Description
Introduction
The rapid evolution of deep learning demands models that are not only accurate but also fast, optimized, and hardware-aware. This course dives deep into PyTorch’s new compiler stack—Dynamo, AOT Autograd, and Inductor—paired with Triton for writing custom high-performance GPU kernels.
You will learn how PyTorch transforms eager-mode code into optimized graph executions, how to analyze performance bottlenecks, and how to extend PyTorch with custom kernels. By the end, you’ll be capable of building and optimizing end-to-end training and inference pipelines with production-grade performance.
Prerequisites
-
Strong knowledge of Python
-
Solid understanding of PyTorch (autograd, nn modules, tensors)
-
Basic C++/CUDA familiarity (for kernel extensions)
-
Understanding of GPUs, parallel programming, and performance concepts
-(Optional but helpful): Experience profiling deep-learning workloads
Table of Contents
1. PyTorch Compiler Fundamentals
1.1 Understanding the PyTorch Compiler Stack
1.2 FX Graphs & IR Basics
1.3 Compilation Flow
1.4 Supported Backends
2. Hands-on Profiling and Optimization
2.1 Profiling with PyTorch Profiler
2.2 Bottleneck Identification
2.3 Model-Level Optimization
2.4 Kernel-Level Optimization
3. Adding New Kernels to PyTorch
3.1 PyTorch Dispatcher Overview
3.2 Defining Custom Operators
3.3 Kernel Implementation in C++/CUDA
3.4 Autograd Support
4. AOT Autograd and PyTorch Dynamo
4.1 TorchDynamo Internals
4.2 AOTAutograd Overview
4.3 Compiler Backends for AOTAutograd
4.4 Debugging Graph Breaks
5. Inductor and Its Workings
5.1 What is TorchInductor?
5.2 Graph Lowering Pipeline
5.3 Codegen for CUDA & CPU
5.4 Optimizations in Inductor
6. Integration with Triton
6.1 Why Triton?
6.2 Inductor + Triton Architecture
6.3 Triton IR Overview
6.4 Debugging Triton Kernels Generated by Inductor
7. Writing Optimized Triton Kernels
7.1 Triton Programming Model
7.2 Implementing Custom Kernels
7.3 Optimizing for GPU
7.4 Benchmarking & Validating Kernels
8. Device Integration
8.1 Extending PyTorch for New Hardware
8.2 Device Registration
8.3 Kernel Lowering Pathways
8.4 Testing & Validation
9. Memory Optimization and Efficient Training
9.1 Activation Checkpointing
9.2 Static vs Dynamic Shapes
9.3 Mixed Precision Training (AMP)
9.4 Zero Redundancy Optimization (ZeRO)
10. Distributed and Parallel Training
10.1 Data Parallelism & DDP Deep Dive
10.2 Model Parallelism
10.3 Distributed Compiler Behavior
10.4 Performance Debugging for Distributed Systems
11. Quantization & Model Compression Techniques
11.1 Quantization Aware Training (QAT)
11.2 Post Training Quantization (PTQ)
11.3 Hardware-Aware Quantization
11.4 Compiler Support for Quantization
12. Advanced Autograd Internals
12.1 Graph-based vs Eager Autograd
12.2 Custom Autograd Functions
12.3 Higher-order Gradients
12.4 Autograd Optimization Strategies
13. Graph Transformations & IR Optimization
13.1 FX Graph Manipulations
13.2 Operator Fusion Strategies
13.3 SSA / IR Lowering Concepts
13.4 Cost Models
14. GPU Performance Engineering
14.1 CUDA Kernel Launch Principles
14.2 Memory-bound vs Compute-bound Analysis
14.3 Efficient Tensor Layouts
14.4 Benchmarking Approaches
15. Real-World Deployment & Productionization
15.1 TorchScript vs TorchCompile Deployment
15.2 Exporting with Torch Export
15.3 Serving on Triton Inference Server
15.4 Handling Dynamic Shapes & Latency Constraints
16. Debugging and Troubleshooting Advanced Models
16.1 Debugging Compiler Failures
16.2 Triton Kernel Debugging Tools
16.3 Numerical Stability Issues
16.4 Performance Regression Analysis
17. Ethical & Responsible Model Optimization (Optional)
17.1 Energy Efficiency of Compiled Models
17.2 Fairness Implications of Optimization
17.3 Model Transparency
17.4 Benchmark Reproducibility
This course empowers you to move from using PyTorch to truly mastering its compiler internals and GPU optimization workflow.
By mastering both Inductor and Triton, you gain the ability to write highly efficient model kernels and integrate them directly into PyTorch’s compilation flow—unlocking maximum performance across hardware platforms.







Reviews
There are no reviews yet.