July 31, 2025
As of today, Large language models (LLMs) power everything from chatbots and search engines to code assistants, yet many developers still struggle with the foundational terminology and sizing calculations needed to select and deploy the right model. Even if you've read about attention mechanisms or transformer blocks, terms like parameters, FP16, INT8, and 1‑bit quantization can feel abstract and overlooking them may lead to costly infrastructure choices or unexpected inference slowdowns.
In this blog, we'll understand some core concepts that will improve you as a Computer Science student:
Let's start by understanding model parameters. To put it in the simplest terms, every weight and bias in a neural network is called a parameter. The goal of training an AI model is to discover the set of weights and biases that optimize performance and give the best result, In the context of LLMs, parameters primarily refer to the weights in the attention layers, feed-forward networks, and layer normalizations that define the model's behavior. Each parameter (weight/bias) is iteratively adjusted during training (e.g., via gradient descent) to minimize prediction error. As a result, LLMs today often contain billions of parameters (GPT‑3 for example, has 175 billion parameters). More parameters generally improve a model's ability to capture complex patterns and long-range dependencies, but they also increase:
When you download an LLM from repositories like Hugging Face, you often see a model.safetensors file, this format securely stores parameter tensors (trained weights and biases) in a compact binary form, ensuring both fast load times and tamper resistance. Recently, the GGUF format has emerged as an alternative, offering cross-library compatibility. You'd already know about it if you have worked with llama.cpp, but for now we'll save a deeper dive on GGUF for another time.
This section is a fundamental concept in computer science that every CS student should be familiar with. Here, we'll dive into what it actually means to store values in FP32 or FP16 formats in the context of LLMs.
By now, it should be clear that training a large language model involves adjusting weights and biases to produce final model parameters, often these are saved in formats like safetensors. The terms FP32 and FP16 refer to the precision with which these numerical parameters are stored.
All weights and biases are ultimately numbers. With common sense anybody can tell that the value 3.1415926535 is far more precise than just 3.14. FP32 stores numbers using 32 bits, allowing for about 7 decimal digits of precision, making it more accurate. On the other hand, FP16 uses only 16 bits and can handle roughly 3 decimal digits of precision, making it less accurate but more memory-efficient.
Now you might wonder " how can such a large range of numbers be stored in just 32 bits (4 bytes) or 16 bits (2 bytes)? " Let's break that down next.
At the heart of FP16 and FP32 is a standardized format for representing floating-point numbers known as IEEE 754. Whether you're storing a weight like 0.002453 or a massive value like 3.2e8 ( 3.2 multiplied by 10 to the power of 8), these formats use a clever trick: scientific notation in binary. Both FP16 and FP32 follow this general structure:
[ sign bit ][ exponent ][ mantissa ]
| Format | Total Bits | Sign | Exponent | Mantissa (a.k.a. significand/fraction) |
|---|---|---|---|---|
| FP16 | 16 | 1 | 5 bits | 10 bits |
| FP32 | 32 | 1 | 8 bits | 23 bits |
So even with just 16 or 32 bits, you can represent a wide dynamic range by adjusting the exponent — much like how 1.23 × 10^5 is vastly different from 1.23 × 10^-5, even though the digits 1.23 are the same.
5.75 in FP32Let's walk through storing 5.75 in FP32 to show how elegant this system is.
5.75 → 101.11
Just like scientific notation in base-10 (e.g. 1.23 × 10^4), we do the same in base-2:
101.11 = 1.0111 × 2^2
We shift the binary point 2 places left to get 1.0111.
So now we have:
1.01112 (because of the 2-position shift)FP32 uses an 8-bit exponent with a bias of 127 to allow for both positive and negative exponents.
Actual exponent = 2
Ecoded exponent = 2 + 127 = 129
Binary of 129 = 10000001
So now we have [sign] bit and [exponent] bits, let's calculate [mantissa]
Drop the leading 1. from the mantissa:
0111
Pad with zeros to make it 23 bits:
01110000000000000000000
Now combine all parts:
0 (positive number)1000000101110000000000000000000Final 32-bit FP32 representation:
0 10000001 01110000000000000000000
And there you go....a very large number compressed into just 4 bytes, with enough precision to keep all its significant digits intact. This ability to store such huge numbers with high precision in such a compact format is exactly why floating point representation is so critical in deep learning.
5.75 Just Fine?Yes! That's true.
FP16 can represent " 5.75 " perfectly because it's a binary-friendly number (i.e., a clean sum of powers of 2). The 3-digit precision refers to how many arbitrary digits it can represent before rounding becomes a problem.
So, while FP16 handles values like 1.5, 2.25, or even 5.75 without issue…
It will struggle with values like:
3.1415926535 → becomes 3.14
0.000123456 → might round to 0.00012 or worse
Small gradient updates (e.g., 1e-8) might vanish altogether
This is why lower precision = less representational fidelity, especially for small updates during training. Large Language Models have billions of weights, often stored as tensors.
But it's a tradeoff:
Every parameter in an LLM must be stored as a numerical value and just like storing files or images, the format you choose affects both storage space and speed. Numeric precision refers to the number of bits used to represent each parameter.
Here's a breakdown of common formats used in LLMs:
| Format | Bits per parameter | Description |
|---|---|---|
| FP32 | 32 | Standard single-precision floating point. Accurate but memory-intensive. |
| FP16 (half) | 16 | Half the size of FP32. Good balance between performance and memory, supported by most GPUs. |
| INT8 | 8 | Integer quantization. Compresses values to 8 bits. Faster inference with a slight accuracy drop. |
| 1‑bit | 1 | Extreme quantization. Only stores the sign of a parameter (positive/negative). Ultra-compact. |
To compute storage size:
Size (bytes) = (# parameters) × (bits per parameter) / 8
Size (GB) = Size (bytes) / (1024³)
Example: A 7 billion‑parameter model in FP32:
Size = 7e9 params × 32 bits/param = 224e9 bits
= 224e9 / 8 = 28e9 bytes ≈ 26.1 GiB
In FP16,
Size = 7e9 × 16 / 8 = 14e9 bytes ≈ 13.0 GiB
And in INT8
Size = 7e9 × 8 / 8 = 7e9 bytes ≈ 6.5 GiB
In traditional floating-point representations (like FP32 or even FP16), each weight in a neural network is a real number, possibly something like 0.2341 or -3.4567. These weights take up 32 or 16 bits of memory each.
1-bit quantization, on the other hand, simplifies each weight to only two possible values:
W ∈ {+1, -1}
So instead of 32 bits, each weight is stored using just 1 bit:
1 → +1
0 → -1
This drastically reduces the memory footprint by 32× (compared to FP32) and allows highly efficient computation, especially on specialized hardware.
The process is simple. You begin with a full-precision model (FP32 or FP16), and train it like usual. The weights can take any real value. After training (or during fine-tuning), you apply the 1-bit quantization step by replacing each weight with either +1 or -1. This is usually done based on the sign of the original weight:
Q(w) = sign(w) = +1 if w ≥ 0 else -1
This drastically reduces the model size, up to 32× smaller than FP32, making it ideal for memory-constrained environments like microcontrollers and edge devices. By replacing multiplications with simple binary operations like XNOR or addition/subtraction, it offers major speed and energy efficiency advantages. However, this comes at a cost: you lose all magnitude precision, leading to degraded model performance unless carefully retrained or fine-tuned.
If you found this explanation helpful, please share it with your friends and colleagues! If you have any questions or suggestions for future topics, feel free to ping me on Twitter / X or LinkedIn
PLEASE DON'T BE THIS GUY!
