Knowledge Distillation: How Frontier Intelligence Gets Cheap

Nish · July 3, 2026

⏱️ 30 min read

Table of Contents

In June 2026, a Chinese lab most people had never heard of released a model that changed the mood of the industry. GLM-5.2, from Zhipu AI (Z.ai), shipped with open weights under an MIT license, scored within a few points of Claude Opus 4.8 and GPT-5.5 on the hardest agentic coding benchmarks, and cost roughly a sixth as much to use. The obvious question is how a company with a fraction of the compute budget of the big US labs gets that close. A large part of the answer, both in what Zhipu’s own technical reports document and in what outside analysts suspect, is a technique that is now quietly load-bearing for the entire model ecosystem: knowledge distillation. This post builds distillation from first principles, from Hinton’s 2015 soft targets to the on-policy methods used in today’s frontier pipelines, and then uses that machinery to read GLM-5.2’s training recipe and the uncomfortable economics it implies.

TL;DR

  • Knowledge distillation trains a small student model to match a large teacher model’s full output distribution rather than a single correct answer. The distribution over wrong answers (“dark knowledge”) encodes how the teacher generalizes, and it is a far denser training signal than a label.
  • Supervision density is the cleanest way to compare training signals: RL delivers roughly one scalar per episode, supervised fine-tuning delivers one token id per position, and distillation delivers a full distribution over the vocabulary at every position. Dense signal is why distillation is so cheap.
  • For generative models the design space is two questions: does the student see the teacher’s logits or only its sampled text (white-box vs black-box), and does training run on the teacher’s outputs or the student’s own (off-policy vs on-policy). The field has marched steadily toward white-box, on-policy.
  • The choice of divergence matters. Forward KL makes a limited student smear probability everywhere the teacher goes (mode-covering); reverse KL makes it commit to what it can actually represent (mode-seeking). This single distinction explains MiniLLM, GKD, and the modern on-policy recipe.
  • Distillation is now a standard stage in production pipelines, not a compression afterthought: Gemma trains on teacher distributions instead of one-hot labels, Llama 3.2’s small models are pruned-and-distilled, Qwen3’s small models skip RL entirely for a tenth of the GPU hours, DeepSeek-R1’s distilled 32B beats running RL directly on the same base by 25 points, and GLM-5 distills itself across post-training stages.
  • The economics are asymmetric: the teacher pays the enormous exploration cost of discovering capability once, and distillation re-radiates it at marginal inference cost. That is why frontier labs now ban distilling their outputs, publish detection reports, and research “antidistillation sampling”, and why open models keep landing months behind the frontier at a tenth of the price.

Scope and honesty notes. Charts labelled schematic illustrate a mechanism with made-up numbers; charts with real benchmark or pricing numbers cite their source in the caption. On GLM-5.2 specifically, this post is careful to separate three things: what Zhipu’s technical reports document, what independent analyses measured, and what commentators speculate. They are very different epistemic categories.

The trick at the heart of it: soft targets

Distillation begins with a compression problem. Buciluă, Caruana and Niculescu-Mizil (2006) wanted to deploy an ensemble of models that was too slow to serve, so they trained one small network to mimic the ensemble’s predictions on unlabelled data. Ba and Caruana (2014) showed shallow networks could match deep ones by regressing on the deep model’s logits rather than the hard labels. Then Hinton, Vinyals and Dean (2015) gave the idea its name, its canonical form, and its enduring metaphor.

Their observation: a trained model’s probability distribution over wrong answers is not noise, it is knowledge. When an image classifier sees a BMW, it might assign a small probability to “garbage truck” and a far smaller one to “carrot”, and that ranking encodes a similarity structure the model spent enormous compute discovering. Hinton called this dark knowledge: information that is present in the model’s outputs but invisible in its final answers. A student trained to match those full distributions inherits the structure, not just the answers.

The same story in language-model terms. Ask a model to complete “The capital of Australia is” and the hard label says only Canberra. The teacher’s distribution says more: Sydney is a plausible confusion (largest city, common misconception), Melbourne less so, Auckland is at least the right region, and Paris is nearly ruled out. To expose that structure, Hinton et al. soften the distribution with a temperature \(T\) in the softmax:

\[p_i^{(T)} = \frac{\exp(z_i / T)}{\sum_j \exp(z_j / T)}\]

where \(z_i\) are the logits. At \(T = 1\) the teacher’s confidence hides the tail; raising \(T\) flattens the distribution and makes the ranking of every wrong answer visible to the student’s loss.

Three horizontal bar charts of next-token probabilities for the prompt 'The capital of Australia is'. The hard label puts probability 1.0 on Canberra. The teacher distribution at temperature 1 puts 0.96 on Canberra with small mass on Sydney and Melbourne. At temperature 3 the distribution softens to 0.53 on Canberra, revealing the full ranking of alternatives.
Three supervision signals for the same next-token prediction (illustrative logits, softmax computed exactly). The hard label carries one bit of structure: the answer. The teacher's distribution carries a ranking over every alternative, and temperature turns that ranking up loud enough for the student to learn from. This is Hinton's dark knowledge.

The classic distillation loss mixes two terms: ordinary cross-entropy against the ground-truth label, and a KL divergence pulling the student’s softened distribution toward the teacher’s:

\[\mathcal{L} = (1 - \lambda)\, \mathcal{L}_{\mathrm{CE}}\!\left(y,\; p_s^{(1)}\right) \;+\; \lambda\, T^2\, D_{\mathrm{KL}}\!\left(p_t^{(T)} \,\Vert\, p_s^{(T)}\right)\]

The \(T^2\) factor is not decoration: gradients of the soft term scale as \(1/T^2\), so multiplying by \(T^2\) keeps the two losses on comparable footing as you tune the temperature. Hinton et al. also showed that in the high-temperature limit, matching softened probabilities becomes equivalent to Ba and Caruana’s logit regression, which is a satisfying unification: soft-target matching is logit matching with the tail probabilities weighted sensibly.

Why a distribution beats a label

Here is the mental model that, more than any equation, explains why distillation keeps winning on cost: count how much supervision one training sequence carries under each paradigm.

Three rows of twelve token cells. The reinforcement learning row has no per-token signal, just a single +1 reward at the end of the sequence. The supervised fine-tuning row has one thin spike above each token, the single correct token id. The distillation row has a small full distribution of bars above every token.
Schematic. The same 12-token completion under three training signals. RL grades the whole episode with roughly one scalar. SFT names the one correct token at each position, all-or-nothing. Distillation hands the student the teacher's probability for every vocabulary entry at every position: the densest signal per unit of compute. The framing follows Thinking Machines' on-policy distillation post.

Reinforcement learning, for all its power, is informationally starved: an entire multi-thousand-token trajectory is graded by approximately one number, and credit assignment must smear that number across every decision that produced it. Supervised fine-tuning is denser, one correct token per position, but the signal is all-or-nothing and says nothing about how wrong each alternative was. Distillation delivers the teacher’s entire conditional distribution at every single position. When Thinking Machines’ on-policy distillation post benchmarked these against each other (more below), the dense signal translated directly into order-of-magnitude compute savings.

The density argument has a second, subtler payoff: soft targets act like a learned, input-dependent form of label smoothing, regularizing the student and letting it generalize from far less data. The Gemma 2 paper leaned on exactly this to break a data bottleneck, training its small models on teacher distributions for “more than 50x the compute-optimal quantity” of tokens: when each token carries a whole distribution, you can profitably train on quantities of data that would be wasteful with one-hot labels. In their words, distillation “replaces the one-hot vector seen at each token with the distribution of potential next tokens computed from a large model”, and it “simulate[s] training beyond the number of available tokens”.

A note on what “knowledge” gets transferred. The literature distinguishes response-based knowledge (the output distributions we have discussed), feature-based knowledge (matching intermediate activations, introduced by FitNets), and relation-based knowledge (matching relationships between layers or samples). The taxonomy runs through every survey since Gou et al. (2021). For LLMs, response-based transfer dominates in practice, so this post stays there; the surveys in the Sources cover the rest of the map.

The first wave: making BERT small

The first large-scale demonstration in NLP was DistilBERT (Sanh et al., 2019), which distilled BERT during pre-training with a triple loss: the usual masked-language-modelling loss, a soft-target distillation loss against BERT’s predictions, and a cosine loss aligning the student’s hidden states with the teacher’s. The result became the reference talking point for the whole field: 40% fewer parameters, 60% faster, keeping about 97% of BERT’s language-understanding performance on GLUE. TinyBERT pushed further down the same road (7.5x smaller, 9.4x faster, above 96% of BERT-base) by distilling attention matrices and hidden states at both pre-training and task-specific stages.

These results established the practical case: for encoder models doing classification-style work, you could buy most of the capability at a fraction of the serving cost. But the models everyone wanted to shrink next were generators, and generation breaks the simple recipe in an instructive way.

Distilling a generator is a different problem

A classifier’s output is one distribution; matching it is one KL term. A language model’s output is a distribution over sequences, factored one token at a time, and each token’s distribution is conditioned on everything before it. That raises two questions that did not exist for BERT, and the answers organize the entire modern literature:

  1. What does the teacher expose? If you have the teacher’s weights (or at least its per-token probabilities), you can match full distributions at every position: white-box distillation. If the teacher is behind an API that returns only sampled text, all you can do is imitate its outputs: black-box distillation.
  2. Whose outputs does training run on? You can train the student on sequences the teacher produced (off-policy), or on sequences the student itself generates, with the teacher grading them (on-policy). Off-policy is simple and cacheable but trains the student on states it may never reach on its own; on-policy trains it exactly where it will actually operate.
A two-by-two grid organizing distillation methods. The horizontal axis is what the teacher exposes, from full next-token distributions (white-box) to only sampled text (black-box). The vertical axis is whose outputs training runs on, from the teacher's (off-policy) to the student's own (on-policy). Classic soft-target methods and Gemma sit in the off-policy white-box quadrant; Alpaca, Orca, and the DeepSeek-R1 distills sit in the off-policy black-box quadrant; MiniLLM, GKD, Qwen3, and GLM-5 sit in the on-policy white-box quadrant; teacher-as-judge feedback sits in the on-policy black-box quadrant.
The design space of LLM distillation. Two questions, four quadrants, and a clear historical drift: the field started top-left with classic soft targets, exploded top-right during the 2023 imitation era, and its current frontier is bottom-left, where the teacher grades every token of the student's own attempts.

The earliest answer for generation, sequence-level knowledge distillation (Kim and Rush, 2016), picked the simplest cell: run the teacher’s beam search, treat its outputs as ground truth, and fine-tune the student on them. For machine translation this was startlingly effective (a 10x faster student losing little quality, and largely removing the need for beam search at inference). Remember this method, because “train on the teacher’s generated text” is exactly what the LLM world would rediscover at scale seven years later.

The imitation era and its cold shower

When ChatGPT and GPT-4 arrived with no open weights, the only distillation available was black-box, and 2023 became the year of imitation. Self-Instruct showed a model could bootstrap tens of thousands of instruction-following examples from a handful of seeds. Alpaca fine-tuned LLaMA-7B on 52K examples generated by text-davinci-003 for under \$600 all-in. Vicuna trained on ~70K shared ChatGPT conversations for about \$300 and, judged by GPT-4, reached “90% of ChatGPT quality”. Orca sharpened the recipe by imitating GPT-4’s explanation traces, its step-by-step reasoning, rather than just its answers, and beat Vicuna by over 100% on BigBench-Hard.

Then came the correction. Gudibande et al., “The False Promise of Imitating Proprietary LLMs” (2023) evaluated imitation models carefully and found that crowdworkers rated them highly while capability benchmarks barely moved: the students had learned the teacher’s style, its confident formatting and pleasant tone, without closing the underlying capability gap. Their conclusion was blunt: imitation models “close little to none of the gap” on tasks not heavily represented in the imitation data, and the highest-leverage move remains building better base models.

This is the imitation gap, and it is the single most important caveat in black-box distillation. A student can only absorb what its base model has the capacity to represent and what the data actually exercises. Mimicking surface form is easy; mimicking the computation that produced it is not.

Reasoning traces changed the answer

Two years later, the field found the conditions under which imitation does work, and the demonstration was impossible to ignore. When DeepSeek released R1 (January 2025), the paper included six small dense models fine-tuned on 800K samples curated from R1’s own outputs, mostly long chain-of-thought reasoning traces. No reinforcement learning on the students at all, just supervised fine-tuning on teacher generations, black-box style.

The numbers rearranged people’s intuitions about what small models could do:

Horizontal bar chart of AIME 2024 pass at 1 scores. GPT-4o scores 9.3, QwQ-32B-Preview 44.0, reinforcement learning run directly on Qwen-32B 47.0, o1-mini 63.6, R1-Distill-Qwen-32B 72.6, and the DeepSeek-R1 teacher 79.8. An annotation notes that on the same 32B base model, distillation beats direct RL by 25.6 points.
AIME 2024 pass@1, from the DeepSeek-R1 paper (arXiv:2501.12948, Tables 5 and 6). The teal and rust bars share the same Qwen-32B base model: distilling the 671B teacher's reasoning traces (teal) beats running the R1 RL recipe directly on the 32B (rust) by 25.6 points, and lands above o1-mini. Even the 7B distill (55.5, not shown) beats GPT-4o and the 32B QwQ-Preview.

The paper’s own ablation is the most quotable result in modern distillation: applying the R1-Zero RL recipe directly to Qwen-32B produced 47.0 on AIME 2024, while simply fine-tuning the same base model on the big teacher’s traces produced 72.6. For a small model, inheriting the reasoning patterns a huge model discovered through expensive RL beats trying to discover them yourself. Discovery is the expensive part; the artifact of discovery is cheap to copy.

Why did this succeed where the 2023 imitation wave disappointed? Three conditions had changed, and they mark the boundary of the imitation gap rather than a contradiction of it. The traces were reasoning, not style: long, checkable chains whose structure transfers. The base models (Qwen 2.5 generation) were strong enough to represent that structure. And the data was curated at scale by the teacher’s own training team, not scraped from chat logs. The same year, DeepSeek also ran the arrow sideways: the V3 report describes distilling reasoning from R1-series models back into their general-purpose flagship, “maintain[ing] control over the output style and length” while absorbing R1’s verification and reflection patterns.

The R1 distills are, formally, black-box sequence-level KD, the Kim and Rush recipe from 2016 applied to reasoning. The lesson is not that fancy losses are unnecessary; it is that what data the teacher generates matters at least as much as how the student matches it. The next section is about the losses anyway, because the biggest efficiency wins of 2025-26 came from fixing them.

Getting the divergence right: forward vs reverse KL

Classic distillation minimizes the forward KL divergence \(D_{\mathrm{KL}}(p \,\Vert\, q_\theta)\), teacher \(p\) first. Expand the definition and you see its personality: the expectation runs over the teacher’s distribution, so the loss punishes the student wherever the teacher has mass and the student does not. If a small student cannot represent everything a huge teacher does, forward KL forces it to spread probability over all of it anyway. The result is mode-covering behaviour: the student places mass in regions it cannot actually model well, which for a language model means fluent-looking sequences the student has no business generating, a recipe for degenerate or hallucinated text.

Flip the arguments and the personality flips. Reverse KL, \(D_{\mathrm{KL}}(q_\theta \,\Vert\, p)\), takes its expectation over the student’s distribution: the student is punished for putting mass where the teacher has little, but pays nothing for ignoring teacher modes it cannot reach. The result is mode-seeking: commit to the subset of the teacher’s behaviour you can represent, and represent it faithfully.

Two panels showing a bimodal teacher distribution in grey. In the left panel, the single-Gaussian student that minimizes forward KL spans both modes with a broad curve that puts substantial mass in the valley between them where the teacher has almost none. In the right panel, the student that minimizes reverse KL locks tightly onto the larger mode and ignores the other.
A teacher too complex for its student, computed exactly: the teacher (grey) is a two-component mixture, the student a single Gaussian. Minimizing forward KL (left, indigo) produces the moment-matching, mode-covering fit: it hedges across both modes and confidently occupies the valley where the teacher puts almost no probability. Minimizing reverse KL (right, teal) produces the mode-seeking fit: it commits to one mode it can represent faithfully. For a small LLM imitating a frontier teacher, the left failure mode looks like fluent hallucination.

This one distinction organizes the modern white-box literature. MiniLLM (Gu et al., 2023) made the case that generation should distill with reverse KL precisely to stop the student “overestimating the low-probability regions of the teacher distribution”, optimizing it with policy-gradient machinery across students from 120M to 13B. GKD (Agarwal et al., 2023) added the second fix: students trained purely on teacher text suffer a train-inference mismatch (they never see their own mistakes during training, then condition on them at inference, the classic exposure-bias problem), so GKD trains on student-generated sequences with the teacher providing per-token feedback, and generalizes the divergence to interpolations like the Jensen-Shannon divergence to tune the mode-seeking/covering trade-off. DistiLLM (2024) made the on-policy loop cheap with a skew-KL loss and adaptive reuse of student generations (up to 4.3x faster than prior KD methods), and DistiLLM-2 (2025) added a contrastive twist that pushes teacher responses up while pushing the student’s own weaker generations down. Speculative KD (2024) even runs the two policies interleaved, speculative-decoding style: the student proposes tokens, the teacher vetoes and replaces the ones it ranks poorly.

On-policy distillation: RL’s relevance, distillation’s density

The synthesis of this whole line arrived in a very readable October 2025 post from Thinking Machines Lab, under the plainest possible name: on-policy distillation. The recipe in one breath: sample rollouts from the student, and instead of an RL reward at the end, have the teacher grade every token with a reverse KL penalty:

\[\mathcal{L}(\theta) \;=\; \mathbb{E}_{y \sim \pi_\theta}\!\left[\, \sum_{t} D_{\mathrm{KL}}\!\Big( \pi_\theta(\cdot \mid y_{<t}, x) \,\Vert\, \pi_{\mathrm{T}}(\cdot \mid y_{<t}, x) \Big) \right]\]

where \(\pi_\theta\) is the student and \(\pi_{\mathrm{T}}\) the teacher. It is worth pausing on how neatly this combines the two halves of the post so far. From RL it takes on-policy relevance: the student is corrected on its own trajectories, including its own mistakes, exactly where GKD showed off-policy distillation goes wrong. From distillation it takes density: a full distribution of supervision at every token, where RL would deliver one scalar per episode.

The numbers they report are the reason this post exists. Distilling Qwen3-32B into Qwen3-8B-Base for math reasoning, on-policy distillation reached about 70% on AIME’24 in roughly 150 training steps, where the published RL baseline burned 17,920 GPU-hours to reach 67.6%; they estimate roughly 10x less compute than RL for the same performance, and cite the Qwen team reaching a higher score (74.4) at one tenth of RL’s cost. Against SFT the claimed savings are 9x when the dataset already exists and about 30x when you would have had to generate it. The Qwen3 report independently corroborates the scale of the win: their “strong-to-weak” pipeline (off-policy teacher data, then on-policy logit distillation) replaced the full four-stage post-training for every small model at about 1/10th of the GPU hours, with a detail worth savoring: distillation improved not just pass@1 but pass@64, expanding the student’s exploration headroom, where their RL runs improved pass@1 while leaving pass@64 flat. Distillation taught the small models things RL could not.

One more turn of the screw, and it is very 2026: the teacher does not even have to be a different model. In a blackboard clip that circulated widely this June, Sasha Rush lays out targeted on-policy self-distillation for agent training: when a rollout contains a localized mistake (the canonical example is calling a tool that does not exist), you inject a corrective hint into the context just before the error, run a forward pass, and use that hint-conditioned distribution as the teacher for the unhinted model, which is the same weights. The model becomes its own teacher, conditioned on privileged information, and the mistaken action gets down-weighted with surgical, per-token precision instead of waiting for a sparse trajectory-level reward to assign blame. Distillation, which began as a compression trick, has quietly become a general-purpose credit assignment primitive.

Where distillation enters real pipelines

Put the pieces together and you can read any modern model card and spot the distillation stages. There are three standard entry points:

Schematic of a student training pipeline with three stages: pre-training, supervised fine-tuning, and reinforcement learning or post-training. A teacher model at the top feeds three numbered arrows into the stages: one, logits as pre-training targets, used by Gemma 2 and 3 and Llama 3.2; two, teacher-generated SFT data, used by the DeepSeek-R1 distills, Phi, and the Alpaca lineage; three, on-policy distillation where the teacher grades the student's own samples, used by Qwen3, GLM-5 cross-stage distillation, and Thinking Machines.
The three places distillation enters a modern LLM pipeline, with production models that document each. The teacher can be a bigger model in the same family, an earlier flagship, an external frontier model, or (for entry point 3 run as self-distillation) a previous checkpoint of the student itself.

Entry point 1: logits as pre-training targets. Gemma 2 trained its 2B and 9B models “with knowledge distillation instead of next token prediction” from the start, and Gemma 3 made it the default for the whole family. Llama 3.2’s 1B and 3B were built by structurally pruning Llama 3.1 8B and then using logits from the 8B and 70B “as token-level targets” during pre-training. NVIDIA’s Minitron work quantified the economics of the pruning-plus-distillation combo: compressing a 15B model to 8B and 4B with up to 40x fewer training tokens per model than training from scratch, and better quality than from-scratch training at the same token budget. Google’s Gemini 1.5 report states plainly that 1.5 Flash was “online distilled” from 1.5 Pro: the fast tier of a frontier product line is a distill by design.

Entry point 2: teacher-generated SFT data. The R1 distills, above, are the flagship example, and the Phi family is the same idea taken to its logical extreme: Phi-4 is trained overwhelmingly on synthetic data generated by GPT-4, and the report argues its curation pipeline “go[es] beyond distillation”, with the student surpassing its teacher on STEM QA benchmarks. Whether you call curated synthetic data “distillation” is partly semantics; the knowledge flow is teacher to student either way, and the survey literature (Xu et al., 2024) treats data-generation as the workhorse channel of modern LLM distillation, splitting it into labeling, expansion, curation, feature, feedback, and self-knowledge.

Entry point 3: on-policy distillation in post-training. Qwen3 and the Thinking Machines recipe, above. This is also where GLM enters the story.

Case study: GLM-5.2 and the price of the frontier

Now we can read the release that motivated this post with the right vocabulary. First the verified facts. GLM-5.2 (Zhipu AI / Z.ai, announced June 13, 2026, open weights the following week) is a sparse mixture-of-experts model: the GLM-5 family’s technical report describes 744B total parameters with 40B active per token, 256 experts, pre-trained on 28.5T tokens (Artificial Analysis lists 753B for the 5.2 refresh; Zhipu’s own docs do not state a figure). It has a 1M-token context window and ships under an MIT license on Hugging Face.

The reception numbers are what made it news. On Artificial Analysis’s Intelligence Index it scored 51, the highest of any open-weights model ever measured, behind Claude Fable 5 (60), Claude Opus 4.8 (56), and GPT-5.5 (55). On Zhipu’s reported agentic-coding results it sits within a few points of Opus 4.8 (Terminal-Bench 2.1: 81.0 vs 85.0; SWE-bench Pro: 62.1 vs 69.2), and Arena’s Code Arena Frontend leaderboard placed it second overall, ahead of Opus 4.7 Thinking. Meanwhile the API costs \$1.40 per million input tokens and \$4.40 per million output tokens, against \$5 in and \$25 out for Opus and \$5 in and \$30 out for GPT-5.5.

Two bar charts. Left: Artificial Analysis Intelligence Index scores with Claude Fable 5 at 60, Claude Opus 4.8 at 56, GPT-5.5 at 55, GLM-5.2 highlighted in teal at 51, Gemini 3.5 Flash at 50, DeepSeek V4 Pro at 44, and Kimi K2.6 at 43. Right: blended API price per million tokens, GPT-5.5 at 11.25 dollars, Claude Opus 4.8 at 10 dollars, and GLM-5.2 at 2.15 dollars.
Capability and price, June 2026 readings. Left: Artificial Analysis Intelligence Index v4.1 (artificialanalysis.ai); GLM-5.2 is the top open-weights score ever recorded on the index. Right: API list price per million tokens, blended 3:1 input:output (openrouter.ai, anthropic.com, openai.com list prices). Roughly one fifth of frontier pricing, though note GLM-5.2 tends to spend more output tokens per task, which claws back some of the gap in practice.

The honest caveats, before the interesting part: independent reviewers consistently find GLM-5.2 strongest exactly where its training focused (agentic coding, tool use) and clearly weaker than the frontier on creative writing, taste, and general judgment; skeptics call it benchmaxxed. “Frontier-adjacent on the benchmarks that matter to developers, at a fraction of the price” is the defensible summary, not “frontier parity”.

What the GLM reports actually document

Zhipu’s own technical reports are unusually explicit that distillation is a core organ of their training pipeline, in the self-distillation sense: the teachers are their own models.

The GLM-4.5 report (2025) describes training separate expert models for reasoning, agentic tasks, and general chat, each with its own SFT and RL, and then a unified SFT stage that “distills the capabilities of different expert models into one hybrid reasoning generalist”. It also documents iterative self-distillation: when an RL run plateaus, the RL-trained model’s own best responses replace the original cold-start data, building a stronger SFT base for the next RL round. Capability ratchets upward with the model repeatedly serving as its own teacher.

The GLM-5 report (2026) adds the on-policy version, which they call on-policy cross-stage distillation. GLM-5’s post-training runs through sequential stages (SFT, then reasoning RL, then agentic RL, then general RL), and the standard failure of such pipelines is that each stage erodes the skills the previous one built. Their fix: the final checkpoint of each earlier stage acts as a teacher during later stages, applying a per-token distillation loss on the student’s own rollouts, exactly the on-policy machinery from the previous section, pointed backwards at the model’s own history.

Four boxes in sequence labelled SFT, Reasoning RL, Agentic RL, and General RL, connected left to right. Teal arcs above the boxes run from each earlier stage to each later stage, indicating that each stage's final checkpoint serves as a teacher for the stages after it via a per-token distillation loss on the student's own rollouts.
GLM-5's on-policy cross-stage distillation, redrawn from the description in section 3.5 of the GLM-5 technical report (arXiv:2602.15763). Sequential post-training normally forgets: reasoning skill decays during agentic RL, agentic skill decays during general RL. Making earlier checkpoints teachers for later stages turns distillation into a memory mechanism.

Notice what this means for the “how do they do it cheaply” question even before any controversy: distillation lets Zhipu reuse every expensive capability it has ever trained rather than re-deriving it, and lets one big RL investment radiate through the whole model family. That is documented, uncontroversial, and probably the larger share of the answer.

The question everyone asks

Did GLM-5.2 also distill from Claude and GPT? Here the epistemic categories matter, so let me keep them separated.

What is official: nothing. The GLM-5 report describes only internal teachers. Zhipu has made no statement on external distillation and did not respond to press inquiries about it. And notably, when Anthropic published its February 2026 report on industrial-scale distillation attacks, the labs it named were DeepSeek, Moonshot AI, and MiniMax; Zhipu was conspicuously absent from the list.

What is measured: suggestive correlations, from one source. Graphistry’s cybersecurity evaluation computed answer-agreement statistics and found GLM-5.2’s responses correlate with GPT-5.5 (Cohen’s kappa 0.80) and Opus 4.8 (0.76) more than those two frontier models correlate with each other (0.63), and wrote that this “may” indicate distillation of both. Correlation with two models that themselves train on overlapping data is evidence, not proof.

What is speculated: quite a lot, with confidence exceeding evidence. A widely shared take from a Google engineer asserted the interesting version: that frontier-model distillation, if it happened, would have mattered mainly as a cold-start for RL, solving the expensive exploration bootstrap, after which GLM’s own RL pipeline did the climbing. Commentators also point at the model’s Claude-like voice and response patterns. None of this is sourced to Zhipu, and stylistic resemblance is weak evidence in a world where every lab trains on overlapping web data and synthetic corpora.

My read: the technically literate version of the GLM story does not depend on resolving the accusation. Everything this post has covered says a lab in Zhipu’s position would rationally distill something: its own experts (documented), its own earlier checkpoints (documented), and, if it were willing to violate terms of service, the frontier models whose outputs are one API call away (unproven here). The mechanism is the same in every case, and the economics are the point: whoever pays the exploration cost, distillation makes the resulting capability nearly free to copy. Which brings us to the fight.

The economics and the politics of distilling the frontier

The asymmetry is stark when you line up the numbers this post has already cited. Discovering frontier capability costs enormous exploration: R1-scale RL runs, or the 17,920 GPU-hours the Qwen RL baseline spent on one 8B model’s math skills. Copying capability costs a rounding error: 800K curated samples, or a tenth of the GPU hours, or \$600 of API calls in Alpaca’s case. The teacher’s forward passes are the only toll, and for an open-weights teacher there is no toll at all. Every capable model that publishes weights or serves an API is, involuntarily, a teacher.

The industry noticed. OpenAI’s terms of use prohibit using outputs “to develop models that compete with OpenAI”, and Anthropic’s terms carry the same clause; the irony that Alpaca, Vicuna, and the whole 2023 open-model boom sat in exactly this gray zone is standard commentary by now. The enforcement era began in January 2025, when OpenAI said it had evidence that groups linked to DeepSeek had extracted data via its API in the run-up to R1, and escalated in February 2026 with Anthropic’s report: over 16 million Claude exchanges harvested through roughly 24,000 fraudulent accounts, targeting agentic reasoning, tool-use traces, and chain-of-thought data, with named labs and described countermeasures.

The countermeasures are becoming a research field of their own. On detection, Lee et al. (2025) quantify distillation across public models via identity-confusion probes (models claiming to be GPT-4) and response homogenization, finding substantial distillation signatures in most open and closed LLMs. On prevention, antidistillation sampling (Savani et al., 2025) has the teacher perturb its own sampling distribution just enough to poison the traces for any student trained on them, trading a tunable amount of output quality for “distillability”, with a small wave of 2026 follow-ups on watermarks and trace-rewriting. Whether a lab can serve useful outputs that cannot be learned from is, I think, one of the most interesting open questions in the field; information that is useful to a human is, by most definitions, informative to a model.

Step back and the strategic picture is uncomfortable for everyone in a different way. For frontier labs, capability now diffuses by default: months, not years, after a breakthrough ships behind an API, its distilled shadow is on Hugging Face. For open-model labs, distillation (of themselves or others) is the only economically sane way to stay adjacent to a frontier they cannot afford to explore. And for the rest of us, the boring practical consequence of this fight is the one visible in the pricing chart above: near-frontier capability keeps getting cheaper, faster than any hardware curve alone can explain.

Where distillation fails

A technique this load-bearing deserves its failure modes listed plainly.

  • The student rarely exceeds the teacher. Distillation redistributes capability; it does not create it. The exceptions prove the rule: Phi-4 beats its teacher on some benchmarks via aggressive data curation, and self-distillation loops like GLM’s ratchet upward only because RL keeps injecting new capability between rounds. Someone still has to pay the exploration bill.
  • The imitation gap never fully closed. Gudibande et al.’s warning still binds black-box distillation: without a strong enough base model and data that exercises real capability, students learn style. The R1 distills escaped it by meeting both conditions, not by repealing it.
  • The capacity gap. A teacher can be too strong. Apple and Oxford’s distillation scaling laws (2025) find student performance is a predictable function of student size, token budget, and teacher quality, that a too-capable teacher actively hurts a small student, and that at large enough compute budgets plain supervised training overtakes distillation. Distillation is a regime, not a law of nature.
  • Diversity loss. Mode-seeking objectives buy faithfulness at the cost of coverage: a reverse-KL student is trained to abandon parts of the teacher’s distribution. Stacked across an industry, this compounds into homogenization, which is exactly what Lee et al. measure ecosystem-wide. If many models are distilled shadows of a few teachers, their failure modes correlate, and “several independent models agree” quietly stops being evidence.
  • The legal and ethical overhang. Distilling an API you agreed not to distill is a contract violation whatever the loss function, and the norms for open-weights teachers (whose licenses mostly permit it) versus closed APIs (whose terms mostly ban it) are still being fought out in reports and, eventually, courtrooms.

A minimal implementation

The core of white-box distillation is small enough to read in one breath. Per batch, with teacher and student logits over the same tokens:

import torch.nn.functional as F

def distillation_loss(student_logits, teacher_logits, T=2.0):
    """Forward-KL soft-target loss of Hinton et al. (2015).

    Both logit tensors: (batch, seq_len, vocab). The T*T factor keeps
    gradient scale comparable across temperatures.
    """
    log_p_student = F.log_softmax(student_logits / T, dim=-1)
    p_teacher = F.softmax(teacher_logits / T, dim=-1)
    kl = F.kl_div(log_p_student, p_teacher, reduction="none").sum(-1)
    return T * T * kl.mean()

Swapping in the modern on-policy recipe changes the data flow, not the mathematics: sample completions from the student, run the teacher’s forward pass over those same tokens, and minimize the per-token reverse KL (F.kl_div with the arguments’ roles exchanged, expectation under the student). The engineering cost lives in serving the teacher efficiently during training; the loss stays a few lines.

What to take away

  1. Distillation is dark knowledge transfer. The teacher’s full distribution over answers, especially the wrong ones, encodes how it generalizes. Matching distributions transfers structure that labels cannot carry.
  2. Count bits per training sequence. RL: about one scalar per episode. SFT: one token id per position. Distillation: a distribution over the vocabulary at every position. Most of distillation’s cost advantages are this one observation wearing different clothes.
  3. Two questions locate any method. Logits or samples? Teacher’s outputs or student’s? The field’s trajectory has been a steady march toward white-box and on-policy, where the teacher grades every token of the student’s own attempts.
  4. The divergence is a personality choice. Forward KL hedges and hallucinates; reverse KL commits and narrows. When the student is much smaller than the teacher, which is the whole point, this choice is not a detail.
  5. Distillation is now infrastructure, and it cuts both ways. It is how Gemma, Llama 3.2, Qwen3, and GLM ship capable models cheaply, how labs preserve skills across training stages, and simultaneously the mechanism by which frontier capability leaks to anyone with API access. GLM-5.2 is what the equilibrium currently looks like: openly self-distilled, suspected of more, and a sixth of the price of the models it chases.

The metaphor distillation deserves is not theft, though theft happens, and not compression, though compression is where it started. It is teaching: the expensive thing is figuring something out for the first time, and the second-cheapest thing in the world is learning it from someone who already knows. The cheapest, as every student eventually discovers, is being handed the full answer key, one probability distribution per token.

Sources and further reading

Citation Information

If you find this content useful & plan on using it, please consider citing it using the following format:

@misc{nish-blog,
  title = {Knowledge Distillation: How Frontier Intelligence Gets Cheap},
  author = {Nish},
  howpublished = {\url{https://www.nishbhana.com/Knowledge-Distillation/}},
  note = {[Online; accessed]},
  year = {2026}
}

x.com, Facebook