Back to AI Basics: Simple Neural Network

Date Created: May 15, 2025

Date Modified:

Code for this topic on GitHub

You can think of this as step 0 of the series. In this step, we will build a simple neural network from scratch using Python and NumPy. This will help you understand the basic concepts of neural networks before diving into more complex architectures.

I will mainly talk about the math underneath. Hopefully, when you done with this blog, you will understand what is happenning under the hood:

If you want to understand layers in neural networks in a more practical way, you can check out my MLP on MNIST blog. I will show you how to build a simple MLP using PyTorch and train it on the MNIST dataset.

Or if you want to know more about optimizers, loss functions, etc., check out my GitHub repo for this series.

The Working

A neural network works by mimicking how the human brain processes information.

It consists of multiple layers of interconnected nodes (neurons), not just one, where each node takes input, applies weights and a bias, runs it through an activation function, and passes the result to the next layer.

During training, the network adjusts its weights using backpropagation to minimize the error between its predictions and the actual values. Over time, it learns to recognize patterns and make accurate predictions or classifications based on the input data.

In this blog, I will show you how to implement a simple neural network from scratch using Python and NumPy.

The Model

Our model will be Input Layer (2 neurons) → Hidden Layer (3 neurons with ReLU) → Output Layer (2 neurons with Softmax)

Simple Neural Network Architecture

This is a basic Multilayer Perceptron (MLP) (I talked more about MLP here). It is typically refers to fully connected feedforward neural networks with at least one hidden layer, just like this one. I will follow closely with my notebook for this blog post. You can view it here.

Code for this topic on GitHub

The Working

The Input

We will use a simple dataset of 200 samples with 2 features and 2 classes. The dataset is generated using NumPy's random number generator.

The input layer has 2 neurons that take the 2 features of the dataset.

The Hidden Layer

The hidden layer has 3 neurons that apply the ReLU activation function.

The input data is passed through the hidden layer. Each neuron in the hidden layer takes the input from the input layer, applies weights and a bias, and runs it through the ReLU activation function.

output = activation(weights · inputs + bias)

or in this case:

a = ReLU(w · x + b)
                

So for example:

Layer Description Shape Values
Input → Hidden Weights \(W_1\) Weights connecting input to hidden (2, 3) [[-0.6, 0.6, 0.4], [-0.6, 0.7, 0.5]]
Hidden Biases \(b_1\) Biases of hidden layer (1, 3) [[0.3, 0.5, 0.3]]
Hidden → Output Weights \(W_2\) Weights connecting hidden to output (3, 2) [[-0.7, 0.7], [0.8, -0.8], [0.5, -0.5]]
Output Biases \(b_2\) Biases of output layer (1, 2) [[-0.4, 0.4]]

Let's say we have [0.5, 0.5] as the input.

Step 1: Input → Hidden Layer

Compute hidden layer pre-activation:

\[ z_1 = x \cdot W_1 + b_1 = \begin{bmatrix}0.5, 0.5\end{bmatrix} \cdot \begin{bmatrix}-0.6, 0.6, 0.4\\-0.6, 0.7, 0.5\end{bmatrix} + \begin{bmatrix}0.3, 0.5, 0.3\end{bmatrix} = \begin{bmatrix}-0.3, 1.15, 0.75\end{bmatrix} \]

Step 2: Apply ReLU activation

Apply ReLU activation function:

\[ a_1 = \text{ReLU}(z_1) = \text{ReLU}(\begin{bmatrix}-0.3, 1.15, 0.75\end{bmatrix}) = \begin{bmatrix}0, 1.15, 0.75\end{bmatrix} \]

Step 3: Hidden → Output Layer

Compute output layer pre-activation:

\[ z_2 = a_1 \cdot W_2 + b_2 = \begin{bmatrix}0, 1.15, 0.75\end{bmatrix} \cdot \begin{bmatrix}-0.7, 0.7\\0.8, -0.8\\0.5, -0.5\end{bmatrix} + \begin{bmatrix}-0.4, 0.4\end{bmatrix} = \begin{bmatrix}0.895, -0.895\end{bmatrix} \]

Step 4: Apply Softmax activation

Softmax formula:

\[ \text{softmax}(z) = \frac{e^{z_i}}{\sum_{j=1}^{K} e^{z_j}} \]

First, I'll compute \( e^{z_i} \) for each output neuron:

\( e^{0.895} \approx 2.447 \) and \( e^{-0.895} \approx 0.408 \)

Then, I'll compute the sum of these values:

\( \sum_{j=1}^{K} e^{z_j} \ = 2.447 + 0.408 \approx 2.855 \)

Finally, I'll divide each \( e^{z_i} \) by this sum:

\[ a_2 = \text{softmax}(z_2) = \begin{bmatrix}\frac{2.447}{2.855}, \frac{0.408}{2.855}\end{bmatrix} \approx \begin{bmatrix}0.857, 0.143\end{bmatrix} \]

We can verify that these values sum to 1: 0.857 + 0.143 = 1

In the end, this transformation has converted our original values into a probability distribution, where the first class has about an 85.69% probability and the second class has about a 14.31% probability.

And that's just one forward pass through the network! Imagine doing this for thousands of samples in a dataset.

Step 5: Backpropagation

Backpropagation is the process of updating the weights and biases of the network based on the error between the predicted output and the actual output.

Let's go back to the previous result, \( \begin{bmatrix}0.857, 0.143\end{bmatrix} \)

Let's say the true label is class 1, and you're indexing from 0 (i.e., class 0 and class 1), then the one-hot label is: \( \begin{bmatrix}0, 1\end{bmatrix} \)

Usually, this would indicate that the model is not performing well since it's assigning a high probability to the wrong class.

We have some common loss functions, such as MSE for regression. For this one, we will be using cross-entropy for classification.

Cross-entropy loss function:

\[ L(y, \hat{y}) = -\sum_{i=1}^{C} y_i \log(\hat{y}_i) \]

Where \( y \) is the true label, \( \hat{y} \) is the predicted label, and \( C \) is the number of classes.

Since only \( y_1 \) is 1, the rest are 0:

\[ L(y, \hat{y}) = -\sum_{i=1}^{C} y_i \log(\hat{y}_i) = -y_0 \log(\hat{y}_0) - y_1 \log(\hat{y}_1) = -0 \cdot \log(0.857) - 1 \cdot \log(0.143) \approx 1.944 \]

Now, we need to compute the gradients of the loss function with respect to the weights and biases.

It can be a bit too much since we are using 2 different activation functions, so buckle up.

First, let's denote again:

Let's calculate the gradients via backpropagation:

1. Compute the gradient of the loss with respect to the output layer activation:

For cross-entropy loss with softmax output, if y is a one-hot encoded vector with the true class index, then:

\[ \frac{\partial L}{\partial a_2} = -\frac{y}{a_2} \]

2. Compute the gradient of the loss with respect to the pre-activation of the output layer:

A key property of softmax combined with cross-entropy loss is that:

\[ \frac{\partial L}{\partial z_2} = a_2 - y \]

This is a convenient simplification. So if \( y = \begin{bmatrix}0, 1\end{bmatrix} \) and \( a_2 = \begin{bmatrix}0.857, 0.143\end{bmatrix} \), then:

\[ \frac{\partial L}{\partial z_2} = a_2 - y = \begin{bmatrix}0.857, 0.143\end{bmatrix} - \begin{bmatrix}0, 1\end{bmatrix} = \begin{bmatrix}0.857, -0.857\end{bmatrix} \]

3. Compute the gradient of the loss with respect to the weights of the output layer:

Using the chain rule:

\[ \frac{\partial L}{\partial W_2} = \frac{\partial L}{\partial z_2} \cdot \frac{\partial z_2}{\partial W_2} = \frac{\partial L}{\partial z_2} \cdot a_1^T \]

Where \( a_1^T \) is the transpose of the activation of the hidden layer.

So:

\[ \frac{\partial L}{\partial W_2} = \begin{bmatrix}0.857, -0.857\end{bmatrix} \cdot \begin{bmatrix}0, 1.15, 0.75\end{bmatrix}^T = \begin{bmatrix}0.857 \cdot 0, 0.857 \cdot 1.15, 0.857 \cdot 0.75\\-0.857 \cdot 0, -0.857 \cdot 1.15, -0.857 \cdot 0.75\end{bmatrix} \approx \begin{bmatrix}0, 0.986, 0.643\\0, -0.986, -0.643\end{bmatrix} \]

4. Compute the gradient of the loss with respect to the biases of the output layer:

Similarly:

\[ \frac{\partial L}{\partial b_2} = \frac{\partial L}{\partial z_2} \cdot \frac{\partial z_2}{\partial b_2} = \frac{\partial L}{\partial z_2} \cdot 1 = \frac{\partial L}{\partial z_2} = \begin{bmatrix}0.857, -0.857\end{bmatrix} \]

5. Compute the gradient of the loss with respect to the hidden layer activation:

\[ \frac{\partial L}{\partial a_1} = \frac{\partial L}{\partial z_2} \cdot \frac{\partial z_2}{\partial a_1} = \frac{\partial L}{\partial z_2} \cdot W_2^T = \begin{bmatrix}0.857, -0.857\end{bmatrix} \cdot \begin{bmatrix}-0.7, 0.8, 0.5\\0.7, -0.8, -0.5\end{bmatrix} \approx \begin{bmatrix}-1.2, 1.37, 0.857\end{bmatrix} \]

6. Compute the gradient of the loss with respect to the pre-activation of the hidden layer:

For ReLU, the gradient is 1 for positive values and 0 for negative values:

\[ \frac{\partial L}{\partial z_1} = \frac{\partial L}{\partial a_1} \cdot \frac{\partial a_1}{\partial z_1} = \frac{\partial L}{\partial a_1} \cdot 1_{\{z_1 > 0\}} \]

Where \( 1_{\{z_1 > 0\}} \) is an indicator function that is 1 if \( z_1 > 0 \) and 0 otherwise.

So:

\[ \frac{\partial L}{\partial z_1} = \begin{bmatrix}-1.2, 1.37, 0.857\end{bmatrix} \cdot 1_{\{-0.3, 1.15, 0.75\}} = \begin{bmatrix}0, 1.37, 0.857\end{bmatrix} \]

7. Compute the gradient of the loss with respect to the weights of the hidden layer:

Similar to step 3:

\[ \frac{\partial L}{\partial W_1} = \frac{\partial L}{\partial z_1} \cdot \frac{\partial z_1}{\partial W_1} = \frac{\partial L}{\partial z_1} \cdot x^T = \begin{bmatrix}0, 1.37, 0.857\end{bmatrix} \cdot \begin{bmatrix}0.5, 0.5\end{bmatrix}^T = \begin{bmatrix}0 \cdot 0.5, 1.37 \cdot 0.5, 0.857 \cdot 0.5\\0 \cdot 0.5, 1.37 \cdot 0.5, 0.857 \cdot 0.5\end{bmatrix} \approx \begin{bmatrix}0, 0.685, 0.428\\0, 0.685, 0.428\end{bmatrix} \]

8. Compute the gradient of the loss with respect to the biases of the hidden layer:

Similar to step 4:

\[ \frac{\partial L}{\partial b_1} = \frac{\partial L}{\partial z_1} \cdot \frac{\partial z_1}{\partial b_1} = \frac{\partial L}{\partial z_1} \cdot 1 = \frac{\partial L}{\partial z_1} = \begin{bmatrix}0, 1.37, 0.857\end{bmatrix} \]

And that's it! We have computed the gradients of the loss function with respect to the weights and biases of the network. Now, we can use these gradients to update the weights and biases using an optimization algorithm like gradient descent.

Step 6: Update Weights and Biases

We can update the weights and biases of the network using gradient descent.

The update rule for gradient descent is:

\[ \begin{align} W = W - \eta \cdot \frac{\partial L}{\partial W} \\\\ b = b - \eta \cdot \frac{\partial L}{\partial b} \end{align} \]

Where \( \eta \) is the learning rate.

So for example, if we have a learning rate of 0.01, we can update the weights or biases as follows:

\[ \begin{align} W_1 = W_1 - 0.01 \cdot \frac{\partial L}{\partial W_1} = W_1 - 0.01 \cdot \begin{bmatrix}0, 0.685, 0.428\\0, 0.685, 0.428\end{bmatrix} = W_1 - \begin{bmatrix}0, 0.00685, 0.00428\\0, 0.00685, 0.00428\end{bmatrix}\\ \approx \begin{bmatrix}-0.6, 0.6, 0.4\\-0.6, 0.7, 0.5\end{bmatrix} - \begin{bmatrix}0, 0.00685, 0.00428\\0, 0.00685, 0.00428\end{bmatrix} \approx \begin{bmatrix}-0.6, 0.59315, 0.39572\\-0.6, 0.69315, 0.49572\end{bmatrix} \end{align} \]

Step 7: Repeat

Repeat the forward pass, loss calculation, backpropagation, and weight update steps for each batch of data in the training set until the epoch limit reached.

Conclusion

And there you have it. A simple neural network implemented from scratch using Python and NumPy. This is a basic example, but it covers the main calculations of how a neural network works.

In the next blog, I will show you how to build a simple MLP using PyTorch and train it on the MNIST dataset, a classic dataset for image classification.

Back to the top page of this series

Back to Home