PyTorch Basics & Tutorial
Published:
I’ve created a comprehensive PyTorch tutorial that takes you from basic tensor operations to advanced topics like attention mechanisms and mixed precision training. This hands-on guide includes real code examples and practical implementations that demonstrate core concepts in modern deep learning.
Tutorial Overview
The tutorial consists of 5 progressive modules, each building upon the previous concepts with practical code examples you can run and experiment with.
Part 1: Tensor Fundamentals
Understanding tensors is crucial for any PyTorch work. Here’s how we start:
import torch
import numpy as np
# Creating tensors - the foundation of PyTorch
tensor_from_data = torch.tensor([1, 2, 3, 4])
tensor_zeros = torch.zeros(2, 3)
tensor_ones = torch.ones(2, 3)
tensor_random = torch.randn(2, 3) # Normal distribution
# Essential tensor properties
x = torch.randn(3, 4, 5)
print(f"Shape: {x.shape}")
print(f"Data type: {x.dtype}")
print(f"Device: {x.device}")
print(f"Number of elements: {x.numel()}")
Key Broadcasting Concepts
One of the most powerful features in PyTorch:
a = torch.tensor([[1], [2], [3]]) # 3x1
b = torch.tensor([10, 20, 30]) # 1x3
result = a + b # Broadcasting to 3x3
# Result:
# [[11, 21, 31],
# [12, 22, 32],
# [13, 23, 33]]
Part 2: Automatic Differentiation
PyTorch’s autograd system is what makes deep learning possible. Here’s how gradients flow:
# Basic gradient computation
x = torch.tensor([3.0], requires_grad=True)
y = torch.tensor([2.0], requires_grad=True)
z = x * y + x ** 2
z.backward()
print(f"dz/dx: {x.grad}") # Should be y + 2*x = 2 + 2*3 = 8
print(f"dz/dy: {y.grad}") # Should be x = 3
Higher-Order Gradients
For advanced optimization techniques:
x = torch.tensor([2.0], requires_grad=True)
y = x ** 3
# First derivative
dy_dx = torch.autograd.grad(y, x, create_graph=True)[0]
print(f"dy/dx: {dy_dx}") # 3*2² = 12
# Second derivative
d2y_dx2 = torch.autograd.grad(dy_dx, x)[0]
print(f"d²y/dx²: {d2y_dx2}") # 6*2 = 12
Part 3: Neural Network Architecture
Building neural networks with proper PyTorch patterns:
import torch.nn as nn
import torch.nn.functional as F
class SimpleNet(nn.Module):
def __init__(self, input_size, hidden_size, output_size):
super(SimpleNet, self).__init__()
self.fc1 = nn.Linear(input_size, hidden_size)
self.fc2 = nn.Linear(hidden_size, output_size)
def forward(self, x):
x = F.relu(self.fc1(x))
x = self.fc2(x)
return x
# Complete XOR problem solution
class XORNet(nn.Module):
def __init__(self):
super(XORNet, self).__init__()
self.fc1 = nn.Linear(2, 4)
self.fc2 = nn.Linear(4, 1)
def forward(self, x):
x = F.relu(self.fc1(x))
x = torch.sigmoid(self.fc2(x))
return x
Training Loop Pattern
The standard PyTorch training pattern:
model = XORNet()
criterion = nn.BCELoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.1)
# XOR dataset
X = torch.tensor([[0, 0], [0, 1], [1, 0], [1, 1]], dtype=torch.float32)
y = torch.tensor([[0], [1], [1], [0]], dtype=torch.float32)
for epoch in range(1000):
# Forward pass
outputs = model(X)
loss = criterion(outputs, y)
# Backward pass
optimizer.zero_grad()
loss.backward()
optimizer.step()
if (epoch + 1) % 200 == 0:
print(f'Epoch [{epoch+1}/1000], Loss: {loss.item():.4f}')
Part 4: Practical Implementation Problems
This section tackles real-world implementation challenges you might encounter:
1. Numerically Stable Softmax
A common requirement that shows deep understanding:
def softmax_from_scratch(x):
"""Numerically stable softmax implementation"""
exp_x = torch.exp(x - torch.max(x, dim=-1, keepdim=True)[0])
return exp_x / torch.sum(exp_x, dim=-1, keepdim=True)
# Test against PyTorch implementation
x = torch.randn(3, 5)
our_softmax = softmax_from_scratch(x)
pytorch_softmax = F.softmax(x, dim=-1)
print(f"Difference: {torch.max(torch.abs(our_softmax - pytorch_softmax))}")
2. Custom Dataset Implementation
Essential for real-world applications:
from torch.utils.data import Dataset, DataLoader
class CustomDataset(Dataset):
def __init__(self, data, targets):
self.data = data
self.targets = targets
def __len__(self):
return len(self.data)
def __getitem__(self, idx):
return self.data[idx], self.targets[idx]
# Usage example
data = torch.randn(100, 10) # 100 samples, 10 features
targets = torch.randint(0, 2, (100,)) # Binary classification
dataset = CustomDataset(data, targets)
dataloader = DataLoader(dataset, batch_size=16, shuffle=True)
for batch_idx, (batch_data, batch_targets) in enumerate(dataloader):
# Your training code here
pass
3. Batch Normalization from Scratch
Understanding the internals:
class CustomBatchNorm1d(nn.Module):
def __init__(self, num_features, eps=1e-5, momentum=0.1):
super().__init__()
self.num_features = num_features
self.eps = eps
self.momentum = momentum
# Learnable parameters
self.weight = nn.Parameter(torch.ones(num_features))
self.bias = nn.Parameter(torch.zeros(num_features))
# Running statistics
self.register_buffer('running_mean', torch.zeros(num_features))
self.register_buffer('running_var', torch.ones(num_features))
def forward(self, x):
if self.training:
batch_mean = x.mean(dim=0)
batch_var = x.var(dim=0, unbiased=False)
# Update running statistics
self.running_mean = (1 - self.momentum) * self.running_mean + self.momentum * batch_mean
self.running_var = (1 - self.momentum) * self.running_var + self.momentum * batch_var
mean, var = batch_mean, batch_var
else:
mean, var = self.running_mean, self.running_var
x_normalized = (x - mean) / torch.sqrt(var + self.eps)
return self.weight * x_normalized + self.bias
Part 5: Advanced Architectures
Multi-Head Attention Implementation
The backbone of modern transformer architectures:
import math
import torch.nn as nn
import torch.nn.functional as F
class MultiHeadAttention(nn.Module):
def __init__(self, d_model, n_heads, dropout=0.1):
super().__init__()
assert d_model % n_heads == 0
self.d_model = d_model
self.n_heads = n_heads
self.d_k = d_model // n_heads
self.W_q = nn.Linear(d_model, d_model)
self.W_k = nn.Linear(d_model, d_model)
self.W_v = nn.Linear(d_model, d_model)
self.W_o = nn.Linear(d_model, d_model)
self.dropout = nn.Dropout(dropout)
# Initialize weights
self._init_weights()
def _init_weights(self):
for module in [self.W_q, self.W_k, self.W_v, self.W_o]:
nn.init.xavier_uniform_(module.weight)
nn.init.constant_(module.bias, 0)
def create_padding_mask(self, seq, pad_idx=0):
"""Create padding mask to ignore padded tokens"""
return (seq != pad_idx).unsqueeze(1).unsqueeze(2)
def create_causal_mask(self, size):
"""Create causal mask for autoregressive generation"""
mask = torch.tril(torch.ones(size, size))
return mask.unsqueeze(0).unsqueeze(0) # Add batch and head dimensions
def scaled_dot_product_attention(self, Q, K, V, mask=None):
scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(self.d_k)
if mask is not None:
# Handle both padding masks (True/False) and causal masks (1/0)
if mask.dtype == torch.bool:
scores.masked_fill_(~mask, float('-inf'))
else:
scores.masked_fill_(mask == 0, float('-inf'))
attention_weights = F.softmax(scores, dim=-1)
attention_weights = self.dropout(attention_weights)
output = torch.matmul(attention_weights, V)
return output, attention_weights
def forward(self, query, key, value, padding_mask=None, causal_mask=None):
batch_size, seq_length, d_model = query.size()
# Linear transformations and reshape for multi-head
Q = self.W_q(query).view(batch_size, seq_length, self.n_heads, self.d_k).transpose(1, 2)
K = self.W_k(key).view(batch_size, seq_length, self.n_heads, self.d_k).transpose(1, 2)
V = self.W_v(value).view(batch_size, seq_length, self.n_heads, self.d_k).transpose(1, 2)
# Combine masks if both are provided
combined_mask = None
if padding_mask is not None:
combined_mask = padding_mask
if causal_mask is not None:
if combined_mask is not None:
combined_mask = combined_mask & causal_mask
else:
combined_mask = causal_mask
# Apply attention
attention_output, attention_weights = self.scaled_dot_product_attention(Q, K, V, combined_mask)
# Concatenate heads and apply final linear transformation
attention_output = attention_output.transpose(1, 2).contiguous().view(
batch_size, seq_length, d_model)
output = self.W_o(attention_output)
return output, attention_weights
# Usage examples
d_model, n_heads, seq_len, batch_size = 512, 8, 10, 2
attention = MultiHeadAttention(d_model, n_heads)
# Example 1: Self-attention without masks
x = torch.randn(batch_size, seq_len, d_model)
output, weights = attention(x, x, x)
# Example 2: With padding mask (for variable-length sequences)
seq_tokens = torch.randint(1, 1000, (batch_size, seq_len)) # Token IDs
padding_mask = attention.create_padding_mask(seq_tokens, pad_idx=0)
output, weights = attention(x, x, x, padding_mask=padding_mask)
# Example 3: With causal mask (for autoregressive generation)
causal_mask = attention.create_causal_mask(seq_len)
output, weights = attention(x, x, x, causal_mask=causal_mask)
# Example 4: With both masks (common in decoder self-attention)
output, weights = attention(x, x, x, padding_mask=padding_mask, causal_mask=causal_mask)
Custom Optimizer Implementation
Understanding optimization at a fundamental level:
class SGDWithMomentum:
def __init__(self, parameters, lr=0.01, momentum=0.9):
self.parameters = list(parameters)
self.lr = lr
self.momentum = momentum
self.velocities = [torch.zeros_like(p) for p in self.parameters]
def step(self):
for param, velocity in zip(self.parameters, self.velocities):
if param.grad is not None:
velocity.mul_(self.momentum).add_(param.grad, alpha=1)
param.data.add_(velocity, alpha=-self.lr)
def zero_grad(self):
for param in self.parameters:
if param.grad is not None:
param.grad.zero_()
Mixed Precision Training
For efficient training on modern GPUs:
# Requires CUDA
if torch.cuda.is_available():
device = torch.device('cuda')
model = model.to(device)
scaler = torch.cuda.amp.GradScaler()
# Example training setup
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters())
for epoch in range(10): # Example: 10 epochs
for batch_data, batch_targets in dataloader:
batch_data, batch_targets = batch_data.to(device), batch_targets.to(device)
optimizer.zero_grad()
with torch.cuda.amp.autocast():
output = model(batch_data)
loss = criterion(output, batch_targets)
scaler.scale(loss).backward()
scaler.step(optimizer)
scaler.update()
Getting Started
The complete tutorial is available on GitHub with an interactive runner:
git clone https://github.com/hanfang/pytorch-practice.git
cd pytorch-practice
pip install torch torchvision numpy matplotlib
python run_tutorial.py
The interactive runner lets you:
- Choose specific topics or run the entire curriculum
- Experiment with code examples in real-time
- Track your progress through the learning modules
Interactive Learning Features
- Hands-on Examples: Every concept includes runnable code
- Progressive Complexity: From basic tensors to transformer attention
- Best Practices: Production-ready patterns and debugging techniques
- Performance Tips: Memory optimization and efficient training strategies
Learning Outcomes
After completing this tutorial, you’ll master:
- Tensor Operations: Efficient manipulation and broadcasting rules
- Automatic Differentiation: Gradient computation and custom functions
- Neural Architecture: Building complex models with proper PyTorch patterns
- Advanced Techniques: Attention mechanisms, custom optimizers, mixed precision
- Production Skills: Debugging, profiling, and optimization strategies
Whether you’re building research prototypes or production ML systems, this tutorial provides the deep PyTorch knowledge needed for modern deep learning applications.
The combination of theoretical understanding and hands-on implementation has been crucial in my journey from academic research to building large-scale AI systems. I hope this resource accelerates your own PyTorch mastery!