Small language models use the same transformer architecture as frontier models, but every design decision is built around efficiency. Here's how tokenization, attention, quantization, and on-device inference actually work under the hood.

How Do Small Language Models Work?

Small Language Models (SLMs) are transformer-based neural networks built to understand and generate text on hardware most people and organizations actually have: a laptop, a phone, an on-premises server. Parameter counts fall roughly in the 1M to 7B range, memory footprints are smaller, and inference is faster, but the underlying machinery is the same as GPT-4 or Claude.

What differs is how deliberately every design decision is made to stay within tight resource budgets. This article walks through the full stack: how raw text becomes numbers, what happens inside the transformer, and how SLMs are trained, compressed, and run on device.

If you're looking for a broader overview covering SLM architectures, use cases, deployment patterns, enterprise adoption, compression techniques, limitations, and the top models in 2026, read our complete Small Language Models (SLMs) Comprehensive Guide.

The Foundation: What Parameters Actually Are

Every time a model reads text and produces output, it runs the input through matrix multiplications using parameters, the learned numeric weights stored inside the model. These weights are what the model ‘knows,’ not as a database of facts, but as distributed patterns learned from billions of examples of language.

To put it in perspective:

  • A 3.8B parameter model like Phi-3 Mini stores its weights in roughly 7.6 GB at FP16 precision (2 bytes per parameter)
  • Quantized to INT4 (4-bit), that drops to approximately 2 GB, small enough to run on a smartphone
  • GPT-4 is estimated at 1T and parameters in a Mixture-of-Experts setup, an entirely different order of magnitude

Step 1: Tokenization: Text Becomes Numbers

Before any computation, text must be converted to integers. Models don't see characters or words, they see tokens, which are subword units from a fixed vocabulary.

How it works

Modern SLMs use Byte Pair Encoding (BPE) or SentencePiece tokenization. The algorithm starts with individual characters and repeatedly merges the most frequent adjacent pairs until it reaches a target vocabulary size, typically 32,000 to 128,000 tokens.

Input TextTokensToken IDs"running"["run", "ning"][1258, 3076]"unbelievable"["un", "believ", "able"][993, 9091, 481]"SLM"["SL", "M"][8200, 44]

The tokenizer is fixed at training time and can't be changed at inference. A few things follow from this:

  • The vocabulary defines what the model can natively represent
  • Rare words (technical terms, names, non-English text) get split into more tokens, consuming more context window
  • Token count ≠ word count. "OpenAI" might be 1 token; "Schrödinger" might be 3

Why smaller vocabulary constrains SLMs

SLMs often use smaller vocabularies than frontier LLMs to reduce parameter count. A 32K vocabulary with a 2048-dimensional embedding layer already costs about 128 MB in FP16. Frontier models with 128K+ vocab entries pay proportionally more, a real budget consideration when you're trying to fit everything on a phone.

For a deeper comparison of deployment trade-offs, reasoning capabilities, latency, privacy, and enterprise adoption patterns, read our detailed breakdown on SLMs vs LLMs.

Step 2: Embedding: Tokens Become Vectors

Once tokenized, each integer ID maps to a dense vector called an embedding. Think of it as a lookup table: token ID 1258 retrieves a 2048-dimensional vector of floating-point numbers.

These vectors are learned during training so that semantically similar tokens end up geometrically close in the embedding space. "king" and "queen" are closer to each other than either is to "carburetor."

Positional encoding: teaching the model about order

Transformers process tokens in parallel, so they have no built-in sense of sequence order. Positional encodings inject this information. Three approaches are common:

  • Sinusoidal encodings (original 2017 transformer) : fixed mathematical functions based on position
  • RoPE (Rotary Position Embedding) : used by Llama, Phi-3, Mistral; encodes position by rotating query and key vectors, generalizes well to longer sequences than those seen in training
  • ALiBi : adds a linear distance bias to attention scores, useful for long-context generalization without explicit position embeddings

Most modern SLMs use RoPE because it handles context extension more gracefully than sinusoidal encodings.

Step 3: Self-Attention: How the Model Reads Context

Self-attention is how a model figures out which words in a sentence matter most for understanding any given word.

Take the sentence: "The bank by the river was steep." To understand what "bank" means, you need to connect it to "river", even though it comes later. Self-attention lets every word look at every other word at the same time and decide how relevant each one is.

For each word, the model computes three things:

  • Query: What am I looking for?
  • Key: What do I represent?
  • Value: What information do I carry?

These are used to produce a score between every pair of words. Higher score = more relevant. The scores get normalized and then used to blend the values together into a new, context-aware representation of each word.

Multi-Head Attention

Rather than doing this once, the model runs attention in parallel across multiple independent "heads." Each head picks up on different kinds of relationships (grammar, meaning, position, etc.). The results get merged at the end.

How SLMs Stay Small and Fast

This is where small models differ from just being a shrunken version of a big one. The choices below are what let them run on a phone.

Grouped Query Attention (GQA)

Normally, every attention head gets its own set of stored vectors (called a KV cache) that grows as the model generates text. With many heads and long text, this cache gets huge and slow.

GQA fixes this by having multiple query heads share a single set of key/value vectors. You get most of the quality with much less memory.

Popular models like Phi-3 Mini and Llama 3.2 3B both use 32 query heads but only 8 key/value heads.

Sliding Window Attention

Normally, every word attends to every other word, which gets exponentially expensive with longer text. Sliding window attention limits each word to only looking at a nearby window of words (e.g., the surrounding 4,096 tokens). This makes the cost grow linearly instead.

Information from outside the window still reaches the model indirectly, each layer passes context a bit further, so after many layers, words can be influenced by things far away.

SwiGLU Activations

After the attention step, each layer runs a small neural network called a feed-forward network. Modern SLMs use an activation function called SwiGLU instead of the older ReLU. It learns more complex patterns with the same number of parameters, making each layer more expressive.

Shared Embeddings

The model uses the same table of word representations for both input and output. Since both are doing essentially the same job, sharing them cuts the parameter count significantly with very little quality loss.

Stacking Layers

One attention block is not a model. Dozens of them get stacked on top of each other. Early layers pick up surface-level patterns; later layers capture more abstract meaning.

Each layer adds its output back to its input (called a residual connection), which prevents the model from forgetting what it already learned as it goes deeper.

ModelParametersLayersTinyLlama1.1B22Phi-3 Mini3.8B32Llama 3.2 3B3B28Mistral 7B7B32

How Text Gets Generated

After all the layers, the model converts its final internal state into a probability distribution over every word in its vocabulary. It picks the next word, appends it, and repeats - one word at a time. This is called autoregressive decoding.

Ways to Sample the Next Word

StrategyWhat it doesGreedyAlways picks the most likely word. Predictable but repetitive.TemperatureScales the probabilities. Lower = more focused, higher = more creative.Top-kOnly samples from the top k most likely words.Top-p (nucleus)Samples from however many words it takes to cover p% of the probability. More adaptive than top-k.

Most production systems use temperature and top-p together.

The KV Cache

Without caching, generating 500 words would mean 500 full passes through the model, each recomputing everything from scratch. The KV cache stores previously computed values so only the newest word needs to be processed each step. On-device, this cache is kept small by limiting it to a sliding window and storing values at lower precision.

How SLMs Are Trained

Phase 1: Pretraining

The model starts with random weights and reads through an enormous amount of text, learning to predict the next word at every step. The quality of the data matters more than the quantity.

Microsoft's Phi series showed that a 1.3B model trained on carefully curated, textbook-style data can outperform a 7B model trained on raw web crawls.

Phase 2: Instruction Tuning

The pretrained model can predict text, but it doesn't know how to follow instructions or hold a conversation. This phase trains it on examples formatted as user/assistant exchanges so it learns how to actually be helpful.

For small models, this is often done with LoRA, a technique that adds small trainable layers on top of the frozen base model, so you don't have to retrain everything. A LoRA fine-tune of a 3B model can run on a single consumer GPU in hours.

Phase 3: Alignment

The final phase refines the model to be more helpful, honest, and safe. Two common approaches:

  • RLHF: Human raters score responses; the model is trained to produce higher-rated outputs.
  • DPO: Simpler alternative. The model is trained directly on pairs of good vs. bad responses without needing a separate scoring model. Increasingly preferred for small models.

Running on a Phone

Quantization

During training, model weights are stored as 32-bit or 16-bit numbers. At inference, those can be compressed down to 4-bit integers (INT4). A 3.8B model that takes 7.6 GB at 16-bit fits in about 1.9 GB at INT4. Quality loss is typically 1–3% on benchmarks.

On devices like Snapdragon chips, the model runs on a dedicated NPU (Neural Processing Unit) rather than the CPU, which is optimized for transformer operations and uses far less power.

Speculative Decoding

Generation is slow because the model reads all its weights from memory for every single token. Speculative decoding speeds this up:

  1. A small, fast draft model generates several tokens quickly
  2. The larger model verifies all of them in one pass
  3. Correct tokens are kept; the first wrong one gets corrected

The output is identical to what the big model would produce alone, but with far fewer expensive passes.

Why On-Device Inference Is Different from Cloud Inference

  1. Memory bandwidth matters more than compute speed.Generating each token requires reading the model's weights from memory. INT4 quantization cuts that by 4x compared to FP16, directly reducing how long each token takes.
    1. Batch size is always 1 on-device.Cloud servers process thousands of users at once, so they optimize for throughput. On a phone, you're the only user, so the priority is low latency per token.
  2. The KV cache grows with context length.At very long contexts, the cache can exceed 1 GB even for a 3B model. Sliding window caching and INT4 KV quantization keep this manageable.

If you want to understand how enterprises actually adapt small models for production workflows, we covered that separately in Fine-Tuning SLMs for Enterprise Use Cases.

Summary

SLMs use the same basic architecture as large models - attention, feed-forward layers, autoregressive decoding - but every design decision is made with efficiency in mind: fewer layers, smaller dimensions, shared KV heads, windowed attention, and INT4 quantization for inference.

Training follows the same three phases (pretraining, instruction tuning, alignment), but data quality is the real differentiator. Curated and synthetic data gets more out of fewer parameters than raw web data ever could.

For well-scoped tasks, a small model running locally is faster, cheaper, and keeps your data on the device.

Frequently Asked Questions

Do small language models use the same architecture as GPT-4 or Claude?

Yes, the core architecture is the same. SLMs use transformers, attention mechanisms, and autoregressive decoding just like frontier models. The difference is that every design decision, from layer count to memory-sharing across attention heads, is optimized for tighter compute and memory constraints.

What actually makes a small language model "small"?

Primarily the parameter count, which usually ranges from 1 million to 7 billion parameters. But smallness is also about practicality: lower memory usage, faster inference, and the ability to run on laptops, phones, or edge devices instead of requiring large data center GPUs.

How much quality do you lose when quantizing a model to INT4?

For most workloads, the quality drop is typically around 1 to 3 percent on benchmarks. A 3.8B model that needs around 7.6GB in 16-bit precision can fit into roughly 1.9GB with INT4 quantization. For tasks like classification, extraction, and summarization, the efficiency gain is often worth the tradeoff.

Why does training data quality matter more for small models than large ones?

Large models can absorb noisy data because they have enormous capacity. Small models cannot. Research like Microsoft's Phi series showed that carefully curated textbook-style datasets allowed a 1.3B model to outperform some larger 7B models trained on raw internet-scale data.

Can a small language model actually run on a smartphone?

Yes. Quantized models around 3B to 4B parameters can fit within the memory limits of modern smartphones. Mobile chips such as Snapdragon processors include NPUs designed for transformer workloads, making on-device inference practical without relying on cloud infrastructure.

Continue reading