30 Days of Building a Small Language Model — Day 1: Neural Networks

Reddit r/LocalLLaMA / 4/4/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

共有:

Key Points

記事は「30 Days of Building a Small Language Model」のDay 1として、トークナイザや学習ループに入る前にニューラルネットワークの基礎を平易に説明している。
ニューラルネットワークは入力・隠れ層・出力層の“層”で構成され、データは前層から次層へ一方向に流れて各ユニットが値を更新していく。
学習では、出力が正解に近づくようにネットワークが自ら調整され、ルールを逐一プログラムするのではなく例から学ぶと述べている。
重み（weights）、バイアス（bias）、活性化（activation）、損失（loss）の4要素が多くのニューラルネットで共通して登場し、それぞれの役割の直感的な理解を提示している。

30 Days of Building a Small Language Model — Day 1: Neural Networks

Welcome to day one. Before I introduce tokenizers, transformers, or training loops, we start where almost all modern machine learning starts: the neural network. Think of the first day as laying down the foundation you will reuse for the next twenty-nine days.

If you have ever felt that neural networks sound like a black box, this post is for you. We will use a simple picture is this a dog or a cat? and walk through what actually happens inside the model, in plain language.

What is a neural network?

A neural network is made of layers. Each layer has many small units. Data flows in one direction: each unit takes numbers from the previous layer, updates them, and sends new numbers forward.

During training, the network adjusts itself so its outputs get closer to the correct answers on example data. It is not programmed rule by rule. It learns from examples.

Input, hidden, and output layers

The diagram below shows the usual three-layer types:

https://preview.redd.it/2jtyf345t3tg1.png?width=1366&format=png&auto=webp&s=f4dc42ac103e01a362f72dc53799bfc3cc4d8510

Ref: https://nccr-automation.ch/news/2023/going-back-what-we-know-injecting-physical-insights-neural-networks

Input layer: The first numbers the network sees (pixels, features, or similar).
Hidden layers: Everything in the middle. Shallow layers often react to local or simple patterns. Deeper layers combine those into broader patterns.
Output layer: What you read out: often probabilities or scores for each possible class.

The pattern, simple patterns first, bigger patterns later, shows up again in language models, even when the internals look different.

Weights, bias, activation, loss

These four pieces appear in almost every network.

Weights: You can think of weights as the importance given to each feature. For example, the sound an animal makes might be more important than its size. So the network assigns a higher weight to more useful features and a lower weight to less useful ones. Over time, these weights keep getting adjusted so the model can make better predictions.
Bias: Bias is like a small adjustment added to the final score before making a decision. It helps the model shift its prediction slightly in one direction. Even if all inputs are zero or small, bias ensures the model can still produce a meaningful output. For example, sometimes, even before checking everything, you have a tendency: This looks more like a dog. That built-in preference is called bias. It helps the model shift decisions even when the inputs are small.
Activation function: After combining inputs with weights and adding bias, the result is passed through something called an activation function. This is simply a rule that helps the model decide what the final output should look like. For example, after checking all clues, you combine everything:

Score = all clues + importance + bias

Now you decide:

If the score is high → Dog
If the score is low → Cat

That decision rule is called the Activation Function. Think of it like a decision switch

Loss: Now comes the most important part: loss. Once the model makes a prediction, we compare it with the actual answer. If the prediction is wrong, we calculate how far off it was. This difference is called loss. The goal of the neural network is to reduce this loss as much as possible. Now suppose: Model says → Dog, but Actual answer → Cat. We measure: How wrong was the prediction? That error is called: Loss

The learning process is simple. The model makes a prediction, calculates the loss, and then adjusts the weights and bias to reduce the error. This process is repeated many times until the model becomes good at making predictions.

In short, weights decide importance, bias adjusts the output, activation function makes the decision, and loss tells the model how wrong it is so it can improve.

How Neural Networks Reduce Error (Backpropagation)

Now that we understand loss, the next question is:

https://preview.redd.it/3jajcg18t3tg1.png?width=1024&format=png&auto=webp&s=af1c7e6a4a4a2f4b8f28af576190558403ba1c44

How does the model actually reduce this error?

This is where backpropagation comes into the picture.

Backpropagation is simply the process of learning from mistakes. After the model makes a prediction and calculates the loss, it needs to figure out what went wrong and how to fix it. Instead of guessing randomly, it carefully checks how much each weight and bias contributed to the error.

Think of it like this. Suppose the model predicted a dog, but the correct answer was a cat. The model now asks, “Which feature misled me the most?” Maybe it gave too much importance to size and ignored sound. So it slightly reduces the weight for size and increases the weight for sound.

This adjustment is not done randomly. It is guided by something called gradients. A gradient tells us how much a small change in a weight or bias will affect the loss. In simple terms, it shows the direction in which we should move to reduce the error.

Once we know the direction, we update the weights and bias using a small step. This step size is controlled by a parameter called the learning rate. If the learning rate is too high, the model might overshoot the correct solution. If it is too small, learning becomes very slow.

This whole process happens layer by layer, starting from the output and moving backward toward the input. That is why it is called backpropagation.

So the full learning cycle looks like this:

The model takes input and makes a prediction.
It compares the prediction with the actual answer and calculates loss.
Backpropagation calculates how each weight and bias contributed to that loss.
Using gradients and learning rate, the model updates its weights and bias.

This process repeats many times until the model becomes better and the loss becomes smaller.

In short, backpropagation is the method that helps the neural network learn by adjusting its weights and bias in the right direction to reduce errors.

Connection to language models

A large language model is still a neural network: layers, parameters, nonlinearities, a loss, and updates from gradients. The task becomes next token prediction instead of image labels, and the loss is often cross-entropy. The forward pass, loss, backward pass, and update rhythm are the same.

This article used classification to build intuition. Upcoming posts switch the setting to text and tokens, but the training story you read here still applies.

Day 2 moves from concepts to code. We will look at PyTorch: tensors, how networks are expressed in code, and how the training loop fits together in practice.

submitted by /u/Prashant-Lakhera
[link] [comments]