Back to AI Basics: Deeper Convolutional Neural Network (CNN)

Date Created: May 15, 2025

Date Modified:

Code for this topic on GitHub

Introduction

Last time, we built a basic CNN for MNIST digit classification. The simplicity of MNIST allowed us to focus on the core CNN concepts without getting bogged down in complexity. Now, do you think it can fair well on more challenging datasets like CIFAR-10 or CIFAR-100? Probably not, but for the sake of learning and knowing, let's try it out.

In case you missed it, here is the previous blog post where we built a basic CNN for MNIST. But the architecture is pretty simple: 2 convolutional blocks with 3x3 filters, ReLU activations, max pooling, and a fully connected layer, with dropout of 25%.

Basic CNN for CIFAR-10

CNN Architecture
Revisiting the last blog's architecture

For this task, we will only change the input channels from 1 to 3 (for RGB images) and adjust the fully connected layer input size to match CIFAR-10's spatial dimensions.

Basic CNN Architecture for CIFAR-10


============================================================
CIFAR-10 Basic CNN Results Summary
============================================================
Final Train Accuracy: 85.09%
Final Test Accuracy: 74.75%
Total Parameters: 545,098
Total Epochs: 10
Training Time: 185.73 seconds

Train Loss: 0.4164, Train Acc: 85.09%
Test Loss: 0.8065, Test Acc: 74.75%
        

As we can see, even though 74.75% accuracy on CIFAR-10 is decent, it is far from usable in practice. Additionally, there is a significant overfitting gap of ~10% between training and test accuracy.

Improving the CNN Architecture

So in what way can we improve this basic CNN to achieve better performance on CIFAR-10? Yes, we can make it deeper and more complex. We will build a deeper CNN with more convolutional blocks, batch normalization, better initialization, and regularization techniques like dropout and data augmentation.

Deeper CNN Architecture
VGG-style Deeper CNN Architecture appropriate for a 10-class task

Architecture Enhancements:

Regularization Techniques:

This is a well-structured and solid CNN architecture—reminiscent of VGG-style networks but enhanced with BatchNorm after every convolutional layer, weight initialization, and a progressive increase in channel depth. Its strength lies in its modular design, a reasonable depth for testing, and Kaiming initialization for ReLU-based nets. It should perform much better on CIFAR-10 than the basic CNN.


============================================================
CIFAR-10 Deeper CNN Results Summary
============================================================
Final Train Accuracy: 83.24%
Final Test Accuracy: 86.45%
Best Test Accuracy: 86.45%
Total Parameters: 7,320,394
Training Time: 457.63 seconds
Overfitting Gap: -3.21%
        

============================================================
COMPARISON: Basic CNN vs Deeper CNN
============================================================
Basic CNN Test Accuracy:   74.75%
Deeper CNN Test Accuracy:  86.45%
Improvement: +11.70%
Basic CNN Parameters:      545,098
Deeper CNN Parameters:     7,320,394
Parameter Increase: 13.4x
        

Okay! So we got a pretty excellent results here. 86.45% test accuracy is solid for CIFAR-10, with only a 3.21% overfitting gap, and overall 11.7% improvement.

It seems like BatchNorm with data augmentation really helped, the model is well regularised. For 15 epochs, it took about 457 seconds to train (~30s per epoch), which is reasonable, given 7.3M parameters.

Pushing Deeper CNN Further

Whenever we deal with practical model that classify images, CIFAR-10 and CIFAR-100 are always our go-to, and itt is a benchmark for many papers and models.

CIFAR-10 tests "Can your model learn?"
CIFAR-100 tests "Can your model scale?"

CIFAR-10 and CIFAR-100 form a somewhat natural difficulty ladder in terms of model capacity vs. task complexity.

So, let's try to scale our Deeper CNN to CIFAR-100, which has 100 classes, each with 600 images, and see how it performs. It will be the same architecture, but the final fully connected layer will be an output of 100 classes instead of 10.


============================================================
CIFAR-100 Deeper CNN Results Summary
============================================================
Final Train Accuracy: 46.62%
Final Test Accuracy: 53.57%
Best Test Accuracy: 53.57%
Total Parameters: 7,366,564
Total Epochs: 20
Training Time: 600.31 seconds
        

Tough! The model is struggling to learn on CIFAR-100: 53.57% test accuracy is far from usable, not to mention it is clearly underfitting. Sometimes, underfitting mean the model might not have enough epochs to learn yet.


============================================================
CIFAR-100 Deeper CNN Results Summary
============================================================
Final Train Accuracy: 51.25%
Final Test Accuracy: 57.29%
Best Test Accuracy: 57.29%
Total Parameters: 7,366,564
Total Epochs: 35
Training Time: 1093.35 seconds
        

Going from 20 → 35 epochs gave only ~4% gain. It's not even worth it to increase epochs at this point.

So, what is the problem here?

  1. Training Difficulty Scaling: As task complexity increases, models need to learn finer class boundaries and handle a more intra-class variance. This means we might have hit a bottleneck, not due to the model's architecture, but rather its capacity to learn complex features.
  2. Diminishing Returns from Depth: Adding depth past a certain point doesn't yield better performance, unless architecture supports it (e.g., residuals). So with an even deeper network, we might not see significant improvements.

Deeper networks help only up to the point they can be optimized and generalize. When task difficulty scales up and you don't adjust other factors (like width, augmentation, learning rate, architecture), depth alone gives less return.

To go further, we will need more than just depth. We will need an improved architecture.

We Need MORE FIREPOWER!
We need more firepower!

Introducing Modern CNN Architectures

As of 2025, CNNs are still widely used, but their dominance has been overtaken in many areas—especially vision tasks—by Vision Transformers (ViTs) and hybrid models. That said, CNNs remain efficient and preferred for some real-time and resource-constrained tasks.

The CNN is still a vast space. You can check out more from PyTorch models and pre-trained weights or Keras applications. You can also check out Hugging Face Vision Transformers for ViTs, hybrid models, and possibly more.

However, for this blog, we will focus on a handful of architectures, starting with...

Residual Network (ResNet)

One of the common problems with deep CNNs when adding more layers is the Vanishing/Exploding gradient problem. This makes it hard to train very deep networks, as gradients can become too small or too large, leading to poor convergence or divergence, which also leads to high training and test loss.

Proposed in 2015, researchers at Microsoft Research introduced a new architecture called Residual Network, or ResNet.

ResNet is all about minimising the Vanishing/Exploding gradient problem by introducing a core idea called skip connections (or residual connections) that allow gradients to flow more easily through the network. Instead of learning a direct mapping H(x), ResNet learns the residual mapping F(x) = H(x) - x, where x is the input.

There are multiple variations of ResNet with lots of layers. We will choose 18 layers since the purpose of this blog is to discuss the underlying concept, or ResNet-18 to be precise.

ResNet-18 Architecture
ResNet-18 Architecture (Source)

Key ResNet-18 Features:

Now, we have 2 ways to get ResNet-18: build it from scratch or use a pre-trained model. To use a pre-trained model, you can invoke a simple function below. For building from scratch, it is demonstrated in my notebook, feel free to poke around.


from torchvision.models import resnet18, ResNet18_Weights

# Load model with pretrained weights
weights = ResNet18_Weights.DEFAULT
model = resnet18(weights=weights)
        

Though you might be careful with using the pre-trained model above. The weights that specified is used for ImageNet, which is 224x224 high-res natural images dataset, completely different from 32x32 low-res images CIFAR dataset. You will might want to modify the fist conv layer and use smaller learning rates if you want to use it. Anyhow, here is the building from scratch performance:

ResNet-18 on CIFAR-10


============================================================
ResNet-18 on CIFAR10 Results Summary
============================================================
Final Train Accuracy: 91.77%
Final Test Accuracy: 88.75%
Best Test Accuracy: 88.84%
Total Parameters: 11,173,962
Total Epochs: 15
Training Time: 545.73 seconds
Overfitting Gap: 3.02%
        

============================================================
COMPARISON: Previous CNN vs ResNet-18 on CIFAR10
============================================================
Deeper CNN Test Accuracy:  86.45%
ResNet-18 Test Accuracy:   88.84%
Improvement: +2.39%
Deeper CNN Parameters:     7,320,394
ResNet-18 Parameters:      11,173,962
Parameter Efficiency: 1.53x
        

ResNet-18 on CIFAR-100


============================================================
ResNet-18 on CIFAR100 Results Summary
============================================================
Final Train Accuracy: 84.64%
Final Test Accuracy: 71.53%
Best Test Accuracy: 71.75%
Total Parameters: 11,220,132
Total Epochs: 25
Training Time: 893.19 seconds
Overfitting Gap: 13.11%
        

============================================================
COMPARISON: Previous CNN vs ResNet-18 on CIFAR100
============================================================
Deeper CNN Test Accuracy:  53.57%
ResNet-18 Test Accuracy:   71.75%
Improvement: +18.18%
Deeper CNN Parameters:     7,366,564
ResNet-18 Parameters:      11,220,132
Parameter Efficiency: 1.53x
        

Okay, we got some good insights here:

Conclusion

It is no doubt that for understanding and exploring, the simplest of CNNs can give you a glimpse of how convolutional neural networks work and how it benefits the image classification domains, and possibly more. However, as we have seen, the more complex imageries, or bigger the number of pixels, the harder for these basics models to comprehend. The deeper CNNs with BatchNorm, Kaiming initialization, and regularization techniques can help us achieve better performance on more complex imageries. We also get to know a new architecture called ResNet, which is a great example of how to solve the vanishing/exploding gradient problem in deep networks.

In the next part of this Deep CNNs series, we will continue to explore other modern CNNs and its effectiveness in its own areas.

References

  1. He, K., Zhang, X., Ren, S., & Sun, J. (2015). Deep Residual Learning for Image Recognition. arXiv:1512.03385v1. Retrieved from https://arxiv.org/pdf/1512.03385
  2. PyTorch. (n.d.). torchvision.models.resnet18. Retrieved from https://docs.pytorch.org/vision/main/models/generated/torchvision.models.resnet18.html
  3. GeeksforGeeks. (2025). Kaiming Initialization in Deep Learning. Last Updated: 02 Apr, 2025. Retrieved from https://www.geeksforgeeks.org/kaiming-initialization-in-deep-learning/
  4. GeeksforGeeks. (2025). Residual Networks (ResNet) - Deep Learning. Last Updated: 07 Apr, 2025. Retrieved from https://www.geeksforgeeks.org/residual-networks-resnet-deep-learning/
  5. GeeksforGeeks. (2025). ResNet18 from Scratch Using PyTorch. Last Updated: 21 Apr, 2025. Retrieved from https://www.geeksforgeeks.org/resnet18-from-scratch-using-pytorch/

Back to the top page of this series

Back to Home