4. Introduction to Neural Networks with PyTorch
Welcome to Module 4! We now transition from classical machine learning with scikit-learn to the foundations of Deep Learning. Neural Networks are powerful models inspired by the structure of the human brain, capable of learning complex patterns. PyTorch is a leading open-source deep learning framework known for its flexibility and Pythonic feel, widely used in both research and industry.
Learning Objectives
After this module, you will be able to:
- Understand the concept of a Perceptron and the role of activation functions.
- Describe the structure of a basic Feedforward Neural Network.
- Explain the core ideas behind Backpropagation and Gradient Descent for training neural networks.
- Understand key PyTorch concepts: Tensors, Autograd, Modules (
nn.Module
), Optimizers, and Loss Functions. - Build and train a simple feedforward neural network in PyTorch for a basic task.
- Grasp the fundamental idea behind Convolutional Neural Networks (CNNs) for image data.
pip install torch torchvision torchaudio
or conda install pytorch torchvision torchaudio cpuonly -c pytorch
.Perceptrons and Activation Functions
The simplest unit in a neural network is often inspired by the Perceptron. Conceptually, it takes multiple inputs, computes a weighted sum, adds a bias, and then passes the result through an activation function.
- Weighted Sum + Bias: $ z = (\sum_{i} w_i x_i) + b $ (where $x_i$ are inputs, $w_i$ are weights, $b$ is bias)
- Activation Function: Introduces non-linearity into the model, allowing it to learn complex relationships beyond simple linear combinations. Without non-linear activation functions, a deep neural network would just behave like a single linear layer.
Common Activation Functions:
- Sigmoid: Squashes values between 0 and 1. $ \sigma(z) = \frac{1}{1 + e^{-z}} $. Often used in the output layer for binary classification.
- ReLU (Rectified Linear Unit): $ \text{ReLU}(z) = \max(0, z) $. Very popular for hidden layers due to its simplicity and efficiency. It outputs the input directly if positive, otherwise, it outputs zero.
- Tanh (Hyperbolic Tangent): Squashes values between -1 and 1. $ \tanh(z) = \frac{e^z - e^{-z}}{e^z + e^{-z}} $.
- Softmax: Used in the output layer for multi-class classification. Converts a vector of scores into a probability distribution (outputs sum to 1).
Feedforward Neural Networks
Also known as Multi-Layer Perceptrons (MLPs), these are the most basic type of artificial neural network.
- Structure: Consists of an input layer, one or more hidden layers, and an output layer.
- Connections: Neurons in one layer are typically fully connected to neurons in the next layer.
- Information Flow: Data flows strictly in one direction – from the input layer, through the hidden layers, to the output layer – without loops (hence “feedforward”).
- Learning: The network “learns” by adjusting the weights and biases of the connections between neurons during training to minimize the difference between its predictions and the actual target values.
Backpropagation and Gradient Descent
How does the network learn to adjust its weights and biases?
- Forward Pass: Input data is fed through the network layer by layer, applying weighted sums and activation functions, until an output (prediction) is generated.
- Loss Calculation: A loss function (or cost function) measures how far the network’s prediction is from the true target value. Common loss functions include Mean Squared Error (MSE) for regression and Cross-Entropy Loss for classification.
- Backward Pass (Backpropagation): This is the core learning algorithm. It calculates the gradient (derivative) of the loss function with respect to each weight and bias in the network. It uses the chain rule of calculus to efficiently propagate the error signal backward from the output layer to the input layer. The gradient indicates the direction and magnitude of change needed for each weight/bias to reduce the loss.
- Weight Update (Gradient Descent): An optimizer algorithm (like Stochastic Gradient Descent - SGD, or more advanced ones like Adam) uses the calculated gradients to update the weights and biases. It takes small steps in the opposite direction of the gradient to minimize the loss. The size of these steps is controlled by the learning rate.
This cycle (Forward Pass -> Loss Calculation -> Backward Pass -> Weight Update) is repeated many times (over many epochs and batches of data) until the model’s performance converges.
Building and Training Models in PyTorch
Let’s see how these concepts translate into PyTorch code.
Core PyTorch Concepts:
- Tensors: The fundamental data structure in PyTorch, similar to NumPy arrays but with added capabilities for GPU acceleration and automatic differentiation.
torch.nn.Module
: The base class for all neural network modules (layers, or the entire network itself). You define your network architecture by subclassingnn.Module
.torch.autograd
: PyTorch’s automatic differentiation engine. It tracks operations on tensors and automatically computes gradients during the backward pass.- Loss Functions (
torch.nn
): Pre-defined loss functions (e.g.,nn.MSELoss
,nn.CrossEntropyLoss
). - Optimizers (
torch.optim
): Implementations of optimization algorithms (e.g.,optim.SGD
,optim.Adam
) used to update model weights.
Simple Feedforward Network Example (Binary Classification):
import torch
import torch.nn as nn # Neural network modules
import torch.optim as optim # Optimization algorithms
from sklearn.model_selection import train_test_split
from sklearn.datasets import make_classification # Generate synthetic data
from sklearn.preprocessing import StandardScaler
# 1. Generate Synthetic Data (using scikit-learn for convenience)
X, y = make_classification(n_samples=500, n_features=10, n_informative=5, n_redundant=2, n_classes=2, random_state=42)
# Convert to PyTorch Tensors
X = torch.tensor(X, dtype=torch.float32)
y = torch.tensor(y, dtype=torch.float32).unsqueeze(1) # Target needs shape (n_samples, 1) for BCELoss
# Scale features (important for neural networks)
scaler = StandardScaler()
X = torch.tensor(scaler.fit_transform(X), dtype=torch.float32)
# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# 2. Define the Network Architecture
class SimpleClassifier(nn.Module):
def __init__(self, num_features):
super(SimpleClassifier, self).__init__() # Call parent class constructor
self.layer_1 = nn.Linear(num_features, 16) # Input features -> 16 hidden units
self.activation_1 = nn.ReLU() # ReLU activation
self.layer_2 = nn.Linear(16, 8) # 16 hidden units -> 8 hidden units
self.activation_2 = nn.ReLU()
self.output_layer = nn.Linear(8, 1) # 8 hidden units -> 1 output unit (for binary)
self.output_activation = nn.Sigmoid() # Sigmoid for binary probability
# Define the forward pass
def forward(self, x):
x = self.layer_1(x)
x = self.activation_1(x)
x = self.layer_2(x)
x = self.activation_2(x)
x = self.output_layer(x)
x = self.output_activation(x)
return x
# Create an instance of the model
input_features = X_train.shape[1]
model = SimpleClassifier(num_features=input_features)
print("Model Architecture:\n", model)
# 3. Define Loss Function and Optimizer
learning_rate = 0.01
criterion = nn.BCELoss() # Binary Cross Entropy Loss for binary classification
optimizer = optim.Adam(model.parameters(), lr=learning_rate) # Adam optimizer
# 4. Training Loop
num_epochs = 100
batch_size = 32 # Process data in batches
for epoch in range(num_epochs):
model.train() # Set model to training mode
# Simple batching (usually use DataLoader for efficiency)
permutation = torch.randperm(X_train.size()[0])
for i in range(0, X_train.size()[0], batch_size):
optimizer.zero_grad() # Clear previous gradients
indices = permutation[i:i+batch_size]
batch_x, batch_y = X_train[indices], y_train[indices]
# Forward pass
outputs = model(batch_x)
loss = criterion(outputs, batch_y)
# Backward pass and optimize
loss.backward() # Calculate gradients (Autograd magic!)
optimizer.step() # Update weights
# Print loss every few epochs
if (epoch + 1) % 10 == 0:
print(f'Epoch [{epoch+1}/{num_epochs}], Loss: {loss.item():.4f}')
# 5. Evaluation
model.eval() # Set model to evaluation mode (disables dropout, etc.)
with torch.no_grad(): # Disable gradient calculation for evaluation
y_pred_tensor = model(X_test)
y_pred_binary = (y_pred_tensor >= 0.5).float() # Convert probabilities to 0 or 1
accuracy = (y_pred_binary == y_test).sum().item() / y_test.shape[0]
print(f'\nTest Accuracy: {accuracy * 100:.2f}%')
Convolutional Neural Networks (CNNs) Basics
While feedforward networks work well for tabular data, they aren’t ideal for data with spatial structure, like images. Convolutional Neural Networks (CNNs) are a specialized type of neural network designed primarily for processing grid-like data (e.g., images).
Key Concepts:
- Convolutional Layers: Instead of fully connected layers, CNNs use convolutional layers that apply learnable filters (kernels) across the input image. These filters detect spatial patterns like edges, corners, textures, etc.
- Pooling Layers: Reduce the spatial dimensions (width/height) of the feature maps, making the model more robust to variations in the position of features and reducing computational load. Max Pooling is common.
- Feature Hierarchy: Early layers learn simple features (edges), while deeper layers combine these to learn more complex features (shapes, objects).
CNNs have revolutionized computer vision tasks like image classification, object detection, and segmentation.
nn.Conv2d
and nn.MaxPool2d
to build CNN architectures.Practice Exercises (Take-Home Style)
- Activation Functions: Briefly describe why non-linear activation functions (like ReLU or Sigmoid) are necessary in multi-layer neural networks. What would happen if you only used linear activation functions?
- Expected Result: Non-linearities allow the network to learn complex, non-linear relationships in the data. Without them, a multi-layer network would mathematically collapse into a single linear transformation, unable to model complex patterns.
- PyTorch Tensors: Create a 2x3 PyTorch tensor filled with random numbers. Print the tensor and its shape.
- Expected Result: Output will show a 2x3 tensor with random values and its shape
torch.Size([2, 3])
.
- Expected Result: Output will show a 2x3 tensor with random values and its shape
- Define a Simple Network: Define a PyTorch
nn.Module
class for a network with one hidden layer containing 8 neurons and using the ReLU activation function. Assume the input has 5 features and the output predicts a single continuous value (regression - no output activation needed here). Don’t worry about training it.- Expected Result: A class definition inheriting from
nn.Module
with an__init__
method definingnn.Linear(5, 8)
andnn.Linear(8, 1)
layers, and aforward
method applying the layers sequentially withnn.ReLU()
after the first linear layer.
- Expected Result: A class definition inheriting from
- Loss Function Choice: Which PyTorch loss function (
nn.MSELoss
ornn.BCELoss
) would be appropriate if you were training the network from Exercise 3 for a regression task? Why?- Expected Result:
nn.MSELoss
(Mean Squared Error Loss) is appropriate for regression tasks where the goal is to minimize the squared difference between continuous predicted and actual values.nn.BCELoss
is for binary classification.
- Expected Result:
Summary
You’ve been introduced to the fundamental building blocks of neural networks: perceptrons, activation functions, and the feedforward architecture. We discussed the crucial training process involving loss functions, backpropagation (gradient calculation), and gradient descent (weight updates via optimizers). You saw how to implement and train a basic neural network using PyTorch’s core components (Tensor
, nn.Module
, optim
, loss functions, autograd
). Finally, we briefly touched upon CNNs, specialized networks for image data.
Additional Resources
- PyTorch Official Tutorials: Excellent starting point, covering basics to advanced topics.
- PyTorch Documentation: Comprehensive API reference.
- Deep Learning with PyTorch Book (Online): A thorough book available freely online.
- 3Blue1Brown - Neural Networks (YouTube Playlist): Fantastic visual intuition for how neural networks work and learn.
- fast.ai Course: Practical deep learning course using PyTorch (and their own library built on top).
Next: Let’s shift focus to processing language data. Proceed to Module 5: Natural Language Processing (NLP) Essentials.