LLM math for nerds

As of today, Large language models (LLMs) power everything from chatbots and search engines to code assistants, yet many developers still struggle with the foundational terminology and sizing calculations needed to select and deploy the right model. Even if you've read about attention mechanisms or transformer blocks, terms like parameters, FP16, INT8, and 1‑bit quantization can feel abstract and overlooking them may lead to costly infrastructure choices or unexpected inference slowdowns.

In this blog, we'll understand some core concepts that will improve you as a Computer Science student:

What exactly a parameter is in a LLM and why its count matters.
Anatomy of floating point numbers and how they are stored in memory and in LLM.
How various numeric formats (FP32, FP16, INT8, 1‑bit) store model weights.
Step‑by‑step formulas for calculating memory footprint.

Model Parameters

Let's start by understanding model parameters. To put it in the simplest terms, every weight and bias in a neural network is called a parameter. The goal of training an AI model is to discover the set of weights and biases that optimize performance and give the best result, In the context of LLMs, parameters primarily refer to the weights in the attention layers, feed-forward networks, and layer normalizations that define the model's behavior. Each parameter (weight/bias) is iteratively adjusted during training (e.g., via gradient descent) to minimize prediction error. As a result, LLMs today often contain billions of parameters (GPT‑3 for example, has 175 billion parameters). More parameters generally improve a model's ability to capture complex patterns and long-range dependencies, but they also increase:

Storage requirements: Every weight must be saved (e.g., to disk or cloud storage).
Compute cost: Larger matrix multiplications and more layers mean longer runtimes.

When you download an LLM from repositories like Hugging Face, you often see a model.safetensors file, this format securely stores parameter tensors (trained weights and biases) in a compact binary form, ensuring both fast load times and tamper resistance. Recently, the GGUF format has emerged as an alternative, offering cross-library compatibility. You'd already know about it if you have worked with llama.cpp, but for now we'll save a deeper dive on GGUF for another time.

Numeric Meaning

This section is a fundamental concept in computer science that every CS student should be familiar with. Here, we'll dive into what it actually means to store values in FP32 or FP16 formats in the context of LLMs.

By now, it should be clear that training a large language model involves adjusting weights and biases to produce final model parameters, often these are saved in formats like safetensors. The terms FP32 and FP16 refer to the precision with which these numerical parameters are stored.

All weights and biases are ultimately numbers. With common sense anybody can tell that the value 3.1415926535 is far more precise than just 3.14. FP32 stores numbers using 32 bits, allowing for about 7 decimal digits of precision, making it more accurate. On the other hand, FP16 uses only 16 bits and can handle roughly 3 decimal digits of precision, making it less accurate but more memory-efficient.

Now you might wonder " how can such a large range of numbers be stored in just 32 bits (4 bytes) or 16 bits (2 bytes)? " Let's break that down next.

The Anatomy of Floating-Point Numbers

At the heart of FP16 and FP32 is a standardized format for representing floating-point numbers known as IEEE 754. Whether you're storing a weight like 0.002453 or a massive value like 3.2e8 ( 3.2 multiplied by 10 to the power of 8), these formats use a clever trick: scientific notation in binary. Both FP16 and FP32 follow this general structure:

[ sign bit ][ exponent ][ mantissa ]

Format	Total Bits	Sign	Exponent	Mantissa (a.k.a. significand/fraction)
FP16	16	1	5 bits	10 bits
FP32	32	1	8 bits	23 bits

Sign (1 bit) — Determines if the number is positive or negative.
Exponent — Controls the scale (how large or small the number is).
Mantissa — Stores the significant digits of the number in binary.

So even with just 16 or 32 bits, you can represent a wide dynamic range by adjusting the exponent — much like how 1.23 × 10^5 is vastly different from 1.23 × 10^-5, even though the digits 1.23 are the same.

Example: Storing `5.75` in FP32

Let's walk through storing 5.75 in FP32 to show how elegant this system is.

1. Convert to binary:

5.75 → 101.11

2. Normalize (scientific notation):

Just like scientific notation in base-10 (e.g. 1.23 × 10^4), we do the same in base-2:

101.11 = 1.0111 × 2^2

We shift the binary point 2 places left to get 1.0111. So now we have:

Significand (Mantissa): 1.0111
Exponent: 2 (because of the 2-position shift)

3. Encode the Exponent with Bias:

FP32 uses an 8-bit exponent with a bias of 127 to allow for both positive and negative exponents.

bias for fp32 = 127
bias for fp16 = 5

Actual exponent = 2
Ecoded exponent = 2 + 127 = 129
Binary of 129 = 10000001

So now we have [sign] bit and [exponent] bits, let's calculate [mantissa]

4. Encode the Mantissa and Combine:

Drop the leading 1. from the mantissa:

Pad with zeros to make it 23 bits:

01110000000000000000000

Now combine all parts:

Sign bit: 0 (positive number)
Exponent: 10000001
Mantissa: 01110000000000000000000

Final 32-bit FP32 representation:

0 10000001 01110000000000000000000

And there you go....a very large number compressed into just 4 bytes, with enough precision to keep all its significant digits intact. This ability to store such huge numbers with high precision in such a compact format is exactly why floating point representation is so critical in deep learning.

But Wait....Doesn't FP16 Represent `5.75` Just Fine?

Yes! That's true.

FP16 can represent " 5.75 " perfectly because it's a binary-friendly number (i.e., a clean sum of powers of 2). The 3-digit precision refers to how many arbitrary digits it can represent before rounding becomes a problem.

So, while FP16 handles values like 1.5, 2.25, or even 5.75 without issue…

It will struggle with values like:

3.1415926535 → becomes 3.14
0.000123456 → might round to 0.00012 or worse
Small gradient updates (e.g., 1e-8) might vanish altogether

This is why lower precision = less representational fidelity, especially for small updates during training. Large Language Models have billions of weights, often stored as tensors.

Storing them in FP32 = high memory use (~hundreds of GBs)
Using FP16 (or bfloat16) = cuts that in half

But it's a tradeoff:

More memory = more accuracy, but slower training and inference
Less memory = faster, but risk of numerical instability

Numeric Precision

Every parameter in an LLM must be stored as a numerical value and just like storing files or images, the format you choose affects both storage space and speed. Numeric precision refers to the number of bits used to represent each parameter.

Here's a breakdown of common formats used in LLMs:

Format	Bits per parameter	Description
FP32	32	Standard single-precision floating point. Accurate but memory-intensive.
FP16 (half)	16	Half the size of FP32. Good balance between performance and memory, supported by most GPUs.
INT8	8	Integer quantization. Compresses values to 8 bits. Faster inference with a slight accuracy drop.
1‑bit	1	Extreme quantization. Only stores the sign of a parameter (positive/negative). Ultra-compact.

From Bits to Gigabytes: LLM Size Calculation

To compute storage size:

Size (bytes) = (# parameters) × (bits per parameter) / 8
Size (GB)   = Size (bytes) / (1024³)

Example: A 7 billion‑parameter model in FP32:

Size = 7e9 params × 32 bits/param = 224e9 bits
     = 224e9 / 8 = 28e9 bytes ≈ 26.1 GiB

In FP16,

Size = 7e9 × 16 / 8 = 14e9 bytes ≈ 13.0 GiB

And in INT8

Size = 7e9 × 8 / 8 = 7e9 bytes ≈ 6.5 GiB

1-bit Quantization for dummies

In traditional floating-point representations (like FP32 or even FP16), each weight in a neural network is a real number, possibly something like 0.2341 or -3.4567. These weights take up 32 or 16 bits of memory each.

1-bit quantization, on the other hand, simplifies each weight to only two possible values:

W ∈ {+1, -1}

So instead of 32 bits, each weight is stored using just 1 bit:

1 → +1
0 → -1

This drastically reduces the memory footprint by 32× (compared to FP32) and allows highly efficient computation, especially on specialized hardware. The process is simple. You begin with a full-precision model (FP32 or FP16), and train it like usual. The weights can take any real value. After training (or during fine-tuning), you apply the 1-bit quantization step by replacing each weight with either +1 or -1. This is usually done based on the sign of the original weight:

Q(w) = sign(w) = +1 if w ≥ 0 else -1

This drastically reduces the model size, up to 32× smaller than FP32, making it ideal for memory-constrained environments like microcontrollers and edge devices. By replacing multiplications with simple binary operations like XNOR or addition/subtraction, it offers major speed and energy efficiency advantages. However, this comes at a cost: you lose all magnitude precision, leading to degraded model performance unless carefully retrained or fine-tuned.

Outro for Dummies

If you found this explanation helpful, please share it with your friends and colleagues! If you have any questions or suggestions for future topics, feel free to ping me on Twitter / X or LinkedIn

PLEASE DON'T BE THIS GUY!

Diary

of a Developer