Mixed Effects Models: Partial Pooling for Data That Comes in Groups

Nish · September 3, 2025

⏱️ 45 min read

Table of Contents

Most datasets worth modeling have structure that the standard regression toolkit quietly assumes away: customers belong to markets, students to schools, measurements to the patients they were taken from, satisfaction scores to the support teams that earned them. Rows within a group resemble each other, and the GLM post flagged exactly this as the independence violation that “shrinks the reported uncertainty dishonestly”, promising that mixed models were the escape hatch. This post is that escape hatch, in full: what a mixed effects model (also sold under the names multilevel model, hierarchical model, and random effects model, all the same object) actually is, the shrinkage mathematics that makes it work, how to fit and read one in Python, and how the same idea extends to hierarchical logistic regression, where an integral with no closed form and a subtle question, “the effect on whom, exactly?”, make everything more interesting.

TL;DR

  • Grouped data breaks the independence assumption, and the damage lands on your uncertainty, not your predictions. Two customers from the same market share unmeasured market-level causes, so a model that treats 5,000 clustered rows as 5,000 independent rows reports standard errors that are too small, sometimes several times too small for predictors that vary at the group level.
  • The two obvious fixes are both extreme. Ignoring groups (complete pooling) pretends every group is the same; fitting each group separately (no pooling) hands tiny groups estimates too noisy to use. A mixed model is the principled middle: partial pooling.
  • A mixed model is a regression whose group-level coefficients get a model of their own. Group offsets \(u_j\) are drawn from a shared distribution \(\mathcal{N}(0, \tau^2)\), and the data decides how much groups differ (\(\tau^2\)) versus how noisy individuals are (\(\sigma^2\)).
  • The output is shrinkage with exactly the right strength. A group’s estimate is its raw mean pulled toward the grand mean by a factor \(\lambda_j = \tau^2 / (\tau^2 + \sigma^2 / n_j)\): small groups get pulled hard, large groups barely move. This is the same mathematics as ridge regression applied to group dummies, and the penalty is learned from the data rather than cross-validated.
  • REML exists because maximum likelihood underestimates variances, for the same reason dividing by \(n\) instead of \(n-1\) does; it is the default in every serious implementation, and the ML/REML distinction matters when you compare models.
  • The logistic version (a GLMM) has two honest slopes, not one. The coefficient describes the effect within a group; the population-averaged effect is systematically flatter, by a factor of roughly \(\sqrt{1 + 0.346\,\tau^2}\), because averaging S-curves flattens them. Knowing which one you are reporting is most of the skill of using these models.

The assumption everyone violates

Every model in the GLM family, and least squares before it, makes an assumption so standard it is easy to read past: the observations are independent. Formally, once you know a row’s features, no other row tells you anything more about its outcome. And it fails, routinely, in the most ordinary datasets. The failure has a name, clustering: rows come in groups that share unmeasured causes. Customers in the same market share a local economy, a local competitor, a local sales team. Test scores from the same school share teachers and demographics. Ten blood-pressure readings from one patient are not ten patients.

Here is the running example for this post. A subscription company has 18 customer support teams, and we want to know how support wait time affects customer satisfaction (a 0 to 100 CSAT score). Teams are unbalanced: the smallest has 6 surveyed customers, the largest 230, for 1,434 rows total. Teams genuinely differ, some are simply better than others in ways the dataset does not measure, and that is precisely what makes the rows non-independent: knowing that one customer of team 3 is unusually happy makes it more likely the next customer of team 3 is too, because they share whatever makes team 3 good.

Why care? Because correlated rows carry less information than independent ones, and a model that ignores this reports uncertainty it does not have. The intuition is worth making quantitative, and the section on intraclass correlation below does, but the headline from our simulated example is stark enough: for a predictor that varies at the team level (say, whether the team offers callbacks), ordinary least squares on these 1,434 rows reports a standard error of 0.53 when the honest number is 2.26. Everything OLS learns about a team-level predictor comes from effectively 18 data points, not 1,434, and it has no way to know that. False confidence of this kind is how “significant” effects appear out of thin air, get published or shipped, and fail to replicate.

Pseudoreplication is the classical name for this mistake, coined in ecology where treating 100 measurements of 5 animals as 100 animals was once endemic. The replication crisis literature is full of its fingerprints.

Three answers to grouped data

Facing groups, you have three options, and it clarifies everything to see all three fitted to the same data before any theory. The two extremes are what most people try first.

Complete pooling ignores the groups: fit one regression to all 1,434 rows, one intercept, one slope. It uses all the data, but it answers a question about a fictional average customer of a fictional average team, and its residuals hide the correlated team structure the previous section warned about (with the standard-error consequences we just quantified).

No pooling goes to the other extreme: fit a separate regression per team, 18 intercepts and 18 slopes, no sharing. Now teams can differ, but each team’s fit stands entirely on its own data. For the 230-customer team that is fine. For the 6-customer team it is a disaster in slow motion: its private regression has 6 points and 2 parameters, and any unlucky draw produces confident nonsense.

Partial pooling, the mixed model’s answer, treats the teams as related but not identical: every team gets its own line, but the lines are pulled toward each other, with a strength the data itself chooses. The figure shows all three on our dataset, with the 6-customer team highlighted.

Three side-by-side scatter plots of customer satisfaction against support wait time, all showing the same 1,434 customers in gray with one 6-customer team highlighted in red. The left panel shows a single teal pooled regression line. The middle panel shows 18 separate purple per-team lines, with the highlighted team's line sloping upward with slope +0.71 per minute, opposite to every other team. The right panel shows 18 ember partial-pooling lines from a mixed model, all sloping downward, with the highlighted team's slope corrected to -0.85 per minute.
The pooling spectrum on one simulated dataset. Left: complete pooling fits one teal line and ignores team structure. Middle: no pooling fits each team separately, and the 6-customer team (red) gets an absurd upward slope of +0.71 CSAT per minute of waiting, a pure artifact of 6 noisy points. Right: the mixed model partially pools, and the same team's line is pulled to -0.85, in line with the evidence from all 1,434 customers. Every line in all three panels is genuinely fitted.

Look at the middle panel’s red line. Taken at face value it says this team’s customers get happier the longer they wait, by 0.71 points per minute. Nobody believes that; it is 6 points of noise wearing a trend costume. But notice what your skepticism is actually doing: you are using what you know about the other 17 teams to discount this one’s data. Partial pooling is that instinct, made mathematical, and the right panel shows the result: the mixed model looked at the same 6 points, weighed them against everything it learned across teams, and concluded the slope is about \(-0.85\). The rest of the post is the machinery behind that correction.

The model: giving coefficients a distribution

Start from the regression the GLM post would write for this problem, satisfaction as a linear function of wait time, and let each team \(j\) shift the line up or down by its own offset:

\[y_{ij} = \underbrace{\beta_0 + \beta_1 x_{ij}}_{\textcolor{#0f8ba0}{\text{fixed: shared by everyone}}} + \underbrace{u_j}_{\textcolor{#b93d20}{\text{the team's offset}}} + \underbrace{\varepsilon_{ij}}_{\textcolor{#7b8794}{\text{individual noise}}}\]

where \(i\) indexes customers within team \(j\). So far this is just a regression with a dummy variable per team. The move that makes it a mixed effects model is one extra line, a model for the offsets themselves:

\[u_j \sim \mathcal{N}(0, \textcolor{#b45309}{\tau^2}), \qquad \varepsilon_{ij} \sim \mathcal{N}(0, \textcolor{#7b8794}{\sigma^2})\]

Read it generatively, top down. First nature draws a team: a single number \(u_j\) from a bell curve with spread \(\tau\), capturing everything durable about that team (skill, tooling, morale) as a shift of its baseline. Then, within the team, each customer’s outcome scatters around the team’s own line with spread \(\sigma\). Two sources of randomness, two levels, which is why the multilevel-modeling literature and the mixed-models literature are describing the same object:

Two-level schematic. The top panel shows a gold bell curve of team effects centered at zero with four sampled team offsets marked from -7.4 to +8.2. Dashed arrows lead from each sampled offset down to four small scatter panels, each showing one team's customers scattered around that team's ember regression line, with the shared company-wide line dashed in teal for reference.
The generative story, drawn. Level 2 (top): each team's offset is one draw from a shared bell curve with spread tau. Level 1 (bottom): given its offset, a team's customers scatter around its shifted line with spread sigma. The dashed teal line is the same fixed line in every panel; the ember line is the team's own.

The vocabulary follows from the picture. The fixed effects \(\beta_0, \beta_1\) are ordinary coefficients: unknown constants, shared by all groups, the thing you would report to a stakeholder (“each minute of waiting costs about 0.9 points of satisfaction”). The random effects \(u_j\) are not parameters in the same sense: they are unobserved draws from a distribution, and what the model actually estimates is that distribution’s variance \(\tau^2\), alongside \(\sigma^2\). The pair \((\tau^2, \sigma^2)\) are called variance components, and they are the model’s real discovery: how much of the scatter is between teams versus within them. “Mixed” just means both kinds of effect appear in one model.

For \(k\) features and general random-effects structure, the whole family fits in one matrix equation, worth seeing once because every textbook and every solver speaks it:

\[y = X\beta + Zu + \varepsilon, \qquad u \sim \mathcal{N}(0, G), \qquad \varepsilon \sim \mathcal{N}(0, \sigma^2 I)\]

\(X\) is the usual feature matrix with coefficient vector \(\beta\); \(Z\) is an indicator-like matrix routing each row to its group’s effects; and \(G\) is the covariance matrix of the random effects (for our random-intercept model, simply \(\tau^2\) times an identity). This formulation, standardized by Laird and Ware in 1982, is the one statsmodels, lme4, and SAS all implement.

One consequence deserves its own subsection, because it is the quantitative answer to “how bad is ignoring groups?”.

Fold the random effect into the noise and the mixed model becomes an ordinary regression with correlated errors: \(y_{ij} = \beta_0 + \beta_1 x_{ij} + (u_j + \varepsilon_{ij})\), where the combined error term \(u_j + \varepsilon_{ij}\) is shared-plus-private. The correlation between two customers of the same team follows in three lines. (This is the post’s first real derivation, so a housekeeping note: every variance and covariance rule these derivations use, plus the handful of Gaussian facts the later ones need, is stated and proved in the appendix at the bottom. Nothing in this post rests on an unproven “it can be shown”.) Take customers \(a\) and \(b\) of team \(j\), and recall covariance ignores constants, so only the random pieces matter (the covariance rules doing the work here are tool 1):

\[\begin{aligned} \mathrm{Cov}(y_{aj},\, y_{bj}) &= \mathrm{Cov}(u_j + \varepsilon_{aj},\; u_j + \varepsilon_{bj}) && \textcolor{#7b8794}{\small\text{the fixed part is constant; covariance only sees the random part}} \\[4pt] &= \mathrm{Cov}(u_j, u_j) + 0 + 0 + 0 && \textcolor{#7b8794}{\small\text{the noise terms are independent of } u_j \text{ and of each other}} \\[4pt] &= \tau^2 && \textcolor{#7b8794}{\small\text{what they share is exactly the team effect's variance}} \end{aligned}\]

Each row’s total variance is \(\tau^2 + \sigma^2\), so the correlation between two same-team rows, the intraclass correlation (ICC), is

\[\rho = \frac{\tau^2}{\tau^2 + \sigma^2}\]

which doubles as the answer to “what fraction of the outcome’s variance lives at the team level?”. An ICC of 0 says groups do not matter; an ICC of 1 says individuals are clones of their group. Customer-analytics and education datasets commonly live between 0.05 and 0.3.

The cost of ignoring \(\rho\) has a classical closed form. For \(J\) groups of \(m\) rows each, compare the variance of the grand mean under clustering to what independence would give. The clustered version: the mean of group \(j\) is \(u_j + \bar\varepsilon_j\) around the fixed part, with variance \(\tau^2 + \sigma^2/m\), and averaging \(J\) independent groups divides that by \(J\). The independent version with the same total row count \(Jm\) would be \((\tau^2 + \sigma^2)/(Jm)\). The ratio, after one line of algebra with \(\rho\), is the design effect:

\[\mathrm{DEFF} = \frac{(\tau^2 + \sigma^2/m)\,/\,J}{(\tau^2 + \sigma^2)\,/\,(Jm)} = \frac{m\tau^2 + \sigma^2}{\tau^2 + \sigma^2} = 1 + (m - 1)\,\rho\]

Your effective sample size is \(n\) divided by that. The formula is worth staring at: the penalty scales with cluster size, not just correlation, which is why a “small” ICC is not a license to ignore clustering.

Line chart of effective sample size as a share of n against intraclass correlation from 0 to 0.5, with three teal curves for cluster sizes 5, 20, and 100. All curves fall steeply; the cluster-of-100 curve collapses fastest. A red marker at ICC 0.10 on the cluster-of-100 curve is annotated: 5,000 customers in clusters of 100 with ICC 0.10 carry the information of about 460 independent ones.
The design effect, 1 + (m-1) rho, converts nominal rows into effective rows. Even a modest ICC of 0.10 destroys most of the information in large clusters: 5,000 customers in 50 markets of 100 are worth about 460 independent customers for anything that varies at the market level. Curves are the exact formula, not simulation.

This figure is the honest accounting behind the standard-error scandal in the opening section, with one refinement worth remembering: the full design-effect penalty applies to predictors that vary between groups (team policies, market attributes, school funding), because for those the groups are your real sample size. Predictors that vary freely within groups are hit far less; in our simulated example the wait-time slope’s standard error barely moved between OLS and the mixed model, while the intercept’s more than doubled and the team-level predictor’s quadrupled. Where your predictor lives in the hierarchy determines how much the hierarchy costs you.

Shrinkage: what partial pooling actually computes

Now for the payoff, the mechanism that fixed the red team in the opening figure. Suppose for a moment the variance components are known and ask: having seen team \(j\)’s data, what is our best guess of its offset \(u_j\)? To keep the algebra clean, work with the model’s intercept-only version (a regression slope changes nothing except that “team mean” becomes “team mean after removing the fixed part”). Let \(D_j\) be team \(j\)’s observed average deviation from the grand mean, based on its \(n_j\) customers. The generative story says

\[D_j = u_j + \bar\varepsilon_j, \qquad u_j \sim \mathcal{N}(0, \tau^2), \qquad \bar\varepsilon_j \sim \mathcal{N}\!\Big(0, \frac{\sigma^2}{n_j}\Big)\]

the truth plus the average of \(n_j\) noise terms (averaging \(n_j\) independent noises divides their variance by \(n_j\), which is why the noise term shrinks with team size; that is tool 1 again). Both pieces are Gaussian, so \((u_j, D_j)\) is jointly Gaussian, and for jointly Gaussian, zero-mean variables the conditional expectation is a known one-liner, proved as tool 2 in the toolbox: \(\mathbb{E}[A \mid B] = \frac{\mathrm{Cov}(A, B)}{\mathrm{Var}(B)}\, B\). In words: your best guess of the unseen \(A\) is the observed \(B\), scaled down by the fraction of \(B\)’s variation that \(A\) actually accounts for. Apply it:

\[\begin{aligned} \mathbb{E}[\,u_j \mid D_j\,] &= \frac{\mathrm{Cov}(u_j,\, D_j)}{\mathrm{Var}(D_j)}\; D_j && \textcolor{#7b8794}{\small\text{the bivariate-Gaussian conditional mean}} \\[4pt] &= \frac{\tau^2}{\tau^2 + \sigma^2 / n_j}\; D_j && \textcolor{#7b8794}{\small\text{Cov}(u_j, u_j + \bar\varepsilon_j) = \tau^2 \text{; the denominator adds the noise-mean variance}} \\[4pt] &= \lambda_j\, D_j, \qquad \lambda_j = \frac{\tau^2}{\tau^2 + \sigma^2 / n_j} && \textcolor{#7b8794}{\small\text{a shrinkage factor between 0 and 1}} \end{aligned}\]

This estimate is the celebrated BLUP (best linear unbiased predictor, “predictor” rather than “estimator” because \(u_j\) is a random draw, not a parameter), and \(\lambda_j\) is doing something genuinely clever. It is a credibility weight: the ratio of signal variance to signal-plus-noise variance for this team’s mean. Its two limits tell the story. As \(n_j \to \infty\) the noise term \(\sigma^2/n_j\) vanishes, \(\lambda_j \to 1\), and the team keeps its own raw mean: pooling backs off exactly when a team has earned trust. As \(n_j \to 0\), \(\lambda_j \to 0\) and the team’s estimate collapses to the grand mean: with no evidence, be the prior. In between, every team gets pulled toward the center by an amount calibrated to its sample size. The same formula rearranges into a precision-weighted average, the form Bayesians will recognize as a posterior mean: the team’s raw mean weighted by its data precision \(n_j/\sigma^2\), averaged with the grand mean weighted by the prior precision \(1/\tau^2\) (precision is just one over variance, a measure of how much a number deserves to be trusted).

Scatter plot of team intercept deviations against team size on a log scale from 6 to 230. Each team shows two markers: an open purple circle for its raw no-pooling mean and a filled ember dot for its partially pooled BLUP, connected by a gold arrow pointing toward the dashed grand-mean line at zero. Small teams show long arrows; the largest teams show almost none. The shrinkage formula lambda equals tau squared over tau squared plus sigma squared over n is printed on the plot.
Every team's raw mean (open plum) and its BLUP (filled ember), fitted by REML on the running dataset. Gold arrows show the pull toward the grand mean: the 6-customer team moves 35% of the way in, while the 230-customer team moves about 1%. Shrinkage is not a uniform squeeze; it is precisely targeted distrust of small samples.

Two connections make this more than a neat formula. First, this is the James-Stein / empirical Bayes phenomenon wearing regression clothes: Stein’s famous 1956 result that shrinking individual estimates toward a common center dominates estimating each one separately (in total error, once you have three or more groups) is exactly the no-pooling-versus-partial-pooling comparison, and Efron and Morris’s classic baseball example, predicting season batting averages from early-season averages shrunk toward the league mean, is this section with players in place of teams. The mixed model operationalizes the same trade: accept a little bias on each group in exchange for a large variance reduction, and win on average.

Second, and this one reframes the whole model for machine-learning readers: partial pooling is ridge regression on the group dummies, ridge being least squares plus a penalty on the squared size of the coefficients. Write the joint log-density of \((y, u)\) and maximize over \(\beta\) and \(u\) together; dropping constants and flipping sign, you minimize

\[\frac{1}{\sigma^2}\,\lVert y - X\beta - Zu \rVert^2 \;+\; \frac{1}{\tau^2}\,\lVert u \rVert^2\]

which is a least squares fit with dummy variables per team plus an L2 penalty on those dummies, penalty weight \(\sigma^2/\tau^2\) (these are Henderson’s mixed model equations, the computational core of every LMM solver). Everything you know about regularization transfers: pooling toward zero deviation is the prior, the penalty controls the pooling strength, and the fixed effects are unpenalized. The one genuinely new trick is that nothing here is cross-validated: the penalty weight is \(\sigma^2/\tau^2\), and the model estimates both variances from the data. A mixed model is ridge regression that tunes itself, and the next section is about how.

Estimation: maximum likelihood, and why REML exists

With the variance components unknown, fitting has a chicken-and-egg structure: knowing \((\tau^2, \sigma^2)\) makes \(\beta\) easy, and knowing \(\beta\) makes the variances easy. The clean way in is to integrate the random effects out and look at what the model says about the observed data alone. Folding \(u\) into the error as before, the marginal distribution of the whole outcome vector is one big Gaussian,

\[y \sim \mathcal{N}\big(X\beta,\; V\big), \qquad V = ZGZ^\top + \sigma^2 I\]

where \(V\) has the block structure the ICC section derived (statisticians call it compound symmetry): \(\tau^2 + \sigma^2\) on the diagonal, \(\tau^2\) between same-group rows, zero elsewhere, so each group forms one block of uniformly correlated rows. For a fixed \(V\), maximizing the Gaussian log-likelihood over \(\beta\) is a quadratic problem with a closed form. Differentiate the only \(\beta\)-dependent term (the quadratic-gradient rules are tool 3):

\[\begin{aligned} \frac{\partial}{\partial \beta}\; \big( y - X\beta \big)^\top V^{-1} \big( y - X\beta \big) &= -2\, X^\top V^{-1} \big( y - X\beta \big) \;=\; 0 && \textcolor{#7b8794}{\small\text{expand the quadratic, differentiate, set the slope to zero at the optimum}} \\[4pt] \hat\beta &= \big( X^\top V^{-1} X \big)^{-1} X^\top V^{-1} y && \textcolor{#7b8794}{\small\text{generalized least squares: OLS with a whitening metric}} \end{aligned}\]

This is generalized least squares, ordinary least squares computed in the geometry that \(V\) defines, so that highly correlated rows are automatically down-weighted rather than double-counted. The remaining variance parameters are then found numerically: substitute \(\hat\beta(V)\) back in (this is called profiling) and optimize the resulting function of the variance components alone. That optimization is the only genuinely iterative part of fitting an LMM.

But there is a bias trap here, and it is the same one hiding in the most familiar formula in statistics. The maximum likelihood estimate of a plain Gaussian variance divides by \(n\), yet every textbook divides by \(n - 1\). Why: the ML formula measures spread around the estimated mean \(\bar y\), and \(\bar y\) has been chosen to sit in the middle of this particular sample, closer to the data than the true mean is. Residuals from a fitted quantity are systematically too small, by exactly one parameter’s worth. The same thing happens in a mixed model, except worse: estimating \(p\) fixed-effect coefficients eats \(p\) parameters’ worth of spread, and the variance components inherit the downward bias. With many fixed effects or few groups the distortion is material.

Restricted maximum likelihood (REML) is the fix, and its cleanest derivation is one honest Gaussian integral: instead of maximizing the likelihood at the best \(\beta\), average the likelihood over every possible \(\beta\), so no single optimistic choice of \(\beta\) gets to flatter the variances. The two moves it needs are completing the square in matrix form (tool 4) and the Gaussian integral’s determinant (tool 5); here is how they assemble. Completing the square splits the exponent into a \(\beta\)-free part and a quadratic in \(\beta\):

\[\begin{aligned} L_R(V) &= \int L(\beta, V)\; d\beta && \textcolor{#7b8794}{\small\text{integrate } \beta \text{ out instead of maximizing over it}} \\[4pt] &\propto \lvert V \rvert^{-1/2} e^{-\frac{1}{2} r^\top V^{-1} r} \int e^{-\frac{1}{2} (\beta - \hat\beta)^\top X^\top V^{-1} X (\beta - \hat\beta)}\; d\beta && \textcolor{#7b8794}{\small\text{complete the square around the GLS solution } \hat\beta \text{, with } r = y - X\hat\beta} \\[4pt] &\propto \lvert V \rvert^{-1/2}\, \big\lvert X^\top V^{-1} X \big\rvert^{-1/2}\, e^{-\frac{1}{2} r^\top V^{-1} r} && \textcolor{#7b8794}{\small\text{a Gaussian integral contributes its normalizing determinant}} \end{aligned}\]

The new factor \(\lvert X^\top V^{-1} X \rvert^{-1/2}\) is the entire difference between ML and REML, and it is an uncertainty tax: it charges the variance components for the fact that \(\beta\) was estimated rather than known. In the plain-Gaussian special case this correction turns the divide-by-\(n\) estimate into exactly divide-by-\(n-1\), which is the tidiest way to remember what REML is. Practical rules that follow:

  • REML is the right default for estimating variance components, and it is what statsmodels and lme4 do unless told otherwise.
  • Comparing two models that differ in fixed effects requires ML, not REML. The REML likelihood is computed from what remains after the fixed effects are averaged away, and that remainder changes definition whenever \(X\) changes, so two REML numbers built on different \(X\) matrices are not on the same scale. Fit with ML for the comparison, refit the winner with REML.
  • Likelihood-ratio tests on variance components are conservative at the boundary: “is \(\tau^2 = 0\)?” pins the null to the edge of the parameter space, where the usual chi-squared distribution is wrong (the honest reference is a mixture). A significant result is still significant; a marginal one deserves suspicion in the other direction.

Fitting one in Python

Time to make all of this concrete. statsmodels ships a full linear mixed model in MixedLM; the snippet simulates the exact dataset used in every figure above (same seed, same draw order) and fits it.

import numpy as np
import pandas as pd
import statsmodels.formula.api as smf

rng = np.random.default_rng(7)
team_sizes = [6, 8, 11, 15, 19, 24, 30, 38, 48, 60,
              75, 90, 110, 130, 155, 180, 205, 230]

u = rng.normal(0.0, 6.0, size=len(team_sizes))       # each team's true offset
rows = []
for j, size in enumerate(team_sizes):
    wait = rng.uniform(0, 30, size)
    csat = 78 - 0.9 * wait + u[j] + rng.normal(0, 8.0, size)
    rows.append(pd.DataFrame({"team": j, "wait": wait, "csat": csat}))
df = pd.concat(rows, ignore_index=True)

model = smf.mixedlm("csat ~ wait", df, groups="team")  # random intercept per team
result = model.fit()                                   # REML by default
print(result.summary())

Running it prints (abridged to the rows that matter):

         Mixed Linear Model Regression Results
========================================================
No. Observations: 1434    Method:             REML
No. Groups:       18      Scale:              62.1673
--------------------------------------------------------
             Coef.  Std.Err.    z    P>|z| [0.025 0.975]
--------------------------------------------------------
Intercept    76.268    1.140  66.901 0.000 74.033 78.502
wait         -0.899    0.024 -37.574 0.000 -0.946 -0.852
team Var     19.057    0.901
========================================================

Three rows, three different kinds of quantity, and mapping them back to the theory is the whole skill of reading this table:

  • Intercept and wait are the fixed effects \(\hat\beta\), estimated by GLS at the fitted variance components. The truth in the simulation was \(78\) and \(-0.9\); the model reports \(76.3 \pm 1.1\) and \(-0.899 \pm 0.024\), both covering. These read exactly like GLM coefficients: each minute of waiting costs about 0.9 points of satisfaction, within any given team.
  • team Var is \(\hat\tau^2 = 19.06\), so \(\hat\tau \approx 4.4\) points: the estimated spread of team quality. (The simulation’s true \(\tau\) was 6; with only 18 teams, variance components are estimated noisily, a caveat the playbook section returns to.)
  • Scale is \(\hat\sigma^2 = 62.17\), so \(\hat\sigma \approx 7.9\) points of within-team scatter, against a true 8.

From the two variance components you compute the ICC yourself, and it is worth doing every time:

tau2 = float(result.cov_re.iloc[0, 0])   # 19.06
sigma2 = result.scale                    # 62.17
icc = tau2 / (tau2 + sigma2)             # 0.235

About 24% of the satisfaction variance (after accounting for wait time) lives at the team level. The per-team BLUPs from the shrinkage figure are result.random_effects, a dict of one offset per team, already shrunken exactly as the \(\lambda_j\) formula dictates.

Which library? statsmodels.MixedLM is the mature Python option for linear mixed models and handles random intercepts, random slopes, nested groups, and variance components. R’s lme4 remains the reference implementation; if you need its exact behavior (or its ecosystem of helpers) from Python, pymer4 wraps it directly. The Bayesian route via bambi (a formula interface over PyMC) becomes the practical choice once models get complicated, and is discussed with the GLMM below.

Random slopes: when groups differ in the effect, not just the level

Random intercepts encode “teams differ in baseline”. But the more interesting heterogeneity is usually in the effect: maybe wait time hurts more at some teams than others, because a good team makes a 20-minute wait feel attended-to and a bad team makes it feel like a void. Statistically: the coefficient on wait varies by team. The extension writes each team’s intercept and slope as the fixed pair plus a team-specific deviation vector, now drawn from a two-dimensional Gaussian:

\[y_{ij} = (\beta_0 + u_{0j}) + (\beta_1 + u_{1j})\, x_{ij} + \varepsilon_{ij}, \qquad \begin{pmatrix} u_{0j} \\ u_{1j} \end{pmatrix} \sim \mathcal{N}\!\left( 0,\; G = \begin{pmatrix} \tau_0^2 & \rho_{01}\tau_0\tau_1 \\ \rho_{01}\tau_0\tau_1 & \tau_1^2 \end{pmatrix} \right)\]

\(G\) now carries three numbers: the spread of intercepts \(\tau_0\), the spread of slopes \(\tau_1\), and their correlation \(\rho_{01}\), which is a real, interpretable quantity: a negative \(\rho_{01}\) here would mean teams with happier baselines also lose satisfaction faster per minute of waiting (more altitude to lose). Everything from the previous sections generalizes: the BLUPs shrink two-dimensionally toward the population center, and REML estimates \(G\)’s three entries alongside \(\sigma^2\).

rng = np.random.default_rng(53)
G_true = [[6.0**2, -0.4 * 6.0 * 0.35],       # tau0 = 6 CSAT points
          [-0.4 * 6.0 * 0.35, 0.35**2]]      # tau1 = 0.35 per minute, corr -0.4
U = rng.multivariate_normal([0.0, 0.0], G_true, size=len(team_sizes))
rows = []
for j, size in enumerate(team_sizes):
    wait = rng.uniform(0, 30, size)
    csat = (78 + U[j, 0]) + (-0.9 + U[j, 1]) * wait + rng.normal(0, 8.0, size)
    rows.append(pd.DataFrame({"team": j, "wait": wait, "csat": csat}))
df2 = pd.concat(rows, ignore_index=True)

slopes = smf.mixedlm("csat ~ wait", df2, groups="team",
                     re_formula="~wait").fit()   # random intercept AND slope
print(slopes.summary())

The summary gains two rows: wait Var (\(\hat\tau_1^2 = 0.116\), so \(\hat\tau_1 \approx 0.34\) against a true 0.35) and team x wait Cov (\(-1.48\), implying \(\hat\rho_{01} \approx -0.52\)). The fixed slope is \(-1.01 \pm 0.09\), and notice its standard error: four times larger than the random-intercept model’s 0.024, because once slopes genuinely vary by team, learning the average slope is again a “how many teams do you have” problem rather than a “how many customers” one. This dataset is the one behind the opening pooling figure, and the fitted picture:

Two-panel figure. Left: satisfaction against wait time with 18 translucent ember per-team fitted lines of visibly different slopes fanning around a thick teal population-average line with slope -1.01 per minute; the most and least wait-sensitive teams are annotated at -1.54 and -0.44 per minute. Right: a scatter of the 18 teams' fitted intercept-slope pairs with gold 1 and 2 standard deviation ellipses of the fitted covariance matrix G, tilted to show the negative correlation of -0.52, with the population center marked as a teal diamond.
A random-slopes fit (REML) to the running dataset. Left: each team earns its own partially pooled line; the population line (teal) is the fixed effect. Right: the same 18 lines summarized as points in (intercept, slope) space, with the fitted G's 1 and 2 sd ellipses; the tilt is the fitted intercept-slope correlation of -0.52. The team lines are not 18 free parameters; they are draws from this ellipse, and the ellipse is what the model estimates.

The right panel is the mental model worth keeping: a random-slopes model does not fit 18 lines, it fits a distribution over lines (the ellipse) plus 18 shrunken guesses at where each team sits in it.

Two practical warnings, both about ambition. First, random-effect structure is expensive in groups: each new random term adds variances and covariances to \(G\), all estimated from however many groups you have, and with 18 teams a random slope is already pushing it. When the optimizer returns a singular or boundary fit (\(\hat\tau_1^2 = 0\), or \(\hat\rho_{01} = \pm 1\)), the data cannot support the structure; simplify rather than report a boundary estimate as a finding. Second, the psycholinguistics literature spent a decade on exactly this tension: Barr et al.’s influential “keep it maximal” argued for including every random slope your design licenses (omitting a real one deflates standard errors on the corresponding fixed effect, the same dishonesty as ignoring clustering, one level up), while Matuschek et al. showed maximal models can cost real power and convergence. The workable compromise: start with the random effects your question requires (a random slope on the treatment variable if you care about the treatment), test whether they earn their keep, and be honest about boundary fits.

Hierarchical logistic regression: the GLMM

Everything so far had a Gaussian outcome. The GLM post’s whole point, though, was that the linear predictor generalizes past Gaussians through a link function, and the mixed-model idea composes with it cleanly: put the random effects inside the linear predictor. For churn (binary) across markets, the generalized linear mixed model (GLMM) reads

\[y_{ij} \mid u_j \sim \mathrm{Bernoulli}(p_{ij}), \qquad \log \frac{p_{ij}}{1 - p_{ij}} = \beta_0 + \beta_1 x_{ij} + u_j, \qquad u_j \sim \mathcal{N}(0, \tau^2)\]

Same three GLM choices (Bernoulli, linear predictor, logit), plus the fourth choice this post added: which coefficients get a distribution. The left side of that middle equation is the log-odds, the quantity the GLM post spends a section on: the logarithm of churns-per-non-churn, the one scale on which a bounded probability becomes an unbounded number that a linear predictor can live on, and therefore the scale on which all these coefficients are additive. Each market has its own sigmoid, horizontally shifted by its \(u_j\); hierarchical logistic regression is exactly this model, and “Bayesian hierarchical logistic regression” is the same likelihood with priors on \((\beta, \tau)\).

But two things genuinely change, and both are worth understanding before trusting any GLMM output.

The integral that will not close

For the linear model, integrating out \(u\) gave a clean Gaussian marginal, and everything downstream (GLS, REML) flowed from it. Try the same move here. The likelihood contribution of group \(j\) must average the Bernoulli likelihood over what the unobserved \(u_j\) might have been:

\[L_j(\beta, \tau) = \int_{-\infty}^{\infty} \underbrace{\prod_{i=1}^{n_j} p_{ij}(u)^{\,y_{ij}}\, \big(1 - p_{ij}(u)\big)^{1 - y_{ij}}}_{\textcolor{#0f8ba0}{\text{Bernoulli likelihood, given the group effect}}}\; \underbrace{\frac{1}{\sqrt{2\pi}\,\tau} e^{-u^2 / 2\tau^2}}_{\textcolor{#b45309}{\text{how likely that effect was}}}\; du\]

A product of sigmoids times a Gaussian has no closed-form integral; this single unresolvable integral is the computational fact about GLMMs, and every fitting strategy is a stance on it:

  • Laplace approximation: find the mode \(\hat{u}_j\) of the integrand’s exponent, replace the exponent by its second-order Taylor expansion there, and the integral becomes Gaussian, hence solvable by exactly tool 5 from the toolbox. One Gaussian per group, accurate when groups are big enough for their likelihoods to look bell-shaped. This is lme4::glmer’s default and the standard workhorse.
  • Adaptive Gauss-Hermite quadrature: the same idea with \(k\) well-placed evaluation points per group instead of one; exact as \(k\) grows, expensive with more than one random effect per group. The refinement to reach for with small groups (Laplace is its \(k=1\) case).
  • Penalized quasi-likelihood (PQL): an older linearize-and-iterate scheme; fast, but biased for binary data with small groups, and largely superseded.
  • Go Bayesian: put priors on \((\beta, \tau)\) and let MCMC (posterior sampling) average over the \(u_j\) exactly, which is the one approach that gets more attractive as the hierarchy gets deeper. In Python this is bambi/PyMC territory; in the wider world, Stan’s brms.

Two slopes, both honest: conditional vs marginal effects

The second change is conceptual, catches nearly everyone, and does not exist in the linear model. The coefficient \(\beta_1\) in the GLMM is a conditional (cluster-specific) effect: holding the market fixed, one more minute of wait time adds \(\beta_1\) to the log-odds of churn; \(e^{\beta_1}\) is a within-market odds ratio. A different, equally legitimate question is marginal: across the whole customer base, mixing all markets together, how does churn probability move with wait time? In a linear model these coincide (averaging lines gives a line with the same slope). Sigmoids are not lines. Averaging S-curves that are shifted left and right produces a flatter S-curve: at any given wait time, some markets are already saturated near 1 and others still near 0, and neither responds much, diluting the average response.

Plot of churn probability against wait time. Forty thin translucent ember sigmoid curves represent individual markets, shifted horizontally by their random effects. A thick ember curve marks a typical market with logit-scale slope 0.22. A thick teal curve, computed by exactly averaging the market curves, is visibly flatter, matching a logit-scale slope of about 0.16, with a dashed closed-form approximation overlapping it almost perfectly.
Why a GLMM has two slopes. Thin ember curves: individual markets, each a shifted sigmoid (tau = 1.5 here, exaggerated for visibility). Thick ember: the typical market (u = 0), whose logit slope is the model coefficient 0.22. Teal: the population-averaged curve, an exact numerical average of market curves, is flatter, logit slope about 0.16. The dashed curve is the closed-form attenuation approximation, nearly indistinguishable from the exact average.

The flattening even has a usable closed form, and the derivation is a pleasure because it runs on one exact identity. The marginal curve is \(\bar p(\eta) = \mathbb{E}_u[\operatorname{sigmoid}(\eta + u)]\) with \(u \sim \mathcal{N}(0, \tau^2)\). Two steps:

\[\begin{aligned} \operatorname{sigmoid}(z) \;\approx\; \Phi(c z), \quad c = \tfrac{16\sqrt{3}}{15\pi} \approx 0.588 && \textcolor{#7b8794}{\small\text{the logistic curve is almost a rescaled Gaussian CDF } \Phi} \\[6pt] \mathbb{E}_u\big[ \Phi(c\eta + c u) \big] = P\big( Z - c u \le c\eta \big) = \Phi\!\left( \frac{c\eta}{\sqrt{1 + c^2 \tau^2}} \right) && \textcolor{#7b8794}{\small\text{write } \Phi \text{ as a probability over } Z \sim \mathcal{N}(0,1) \text{; then } Z - cu \text{ is just a wider Gaussian}} \\[6pt] \bar p(\eta) \;\approx\; \operatorname{sigmoid}\!\left( \frac{\eta}{\sqrt{1 + c^2 \tau^2}} \right), \qquad \beta_{\text{marginal}} \approx \frac{\beta_{\text{conditional}}}{\sqrt{1 + 0.346\,\tau^2}} && \textcolor{#7b8794}{\small\text{convert back; the whole linear predictor got divided by one number}} \end{aligned}\]

The population-averaged slope is the within-market slope shrunk by \(\sqrt{1 + 0.346\,\tau^2}\) (the constant is \(c^2\)), a result due to Zeger, Liang and Albert. Neither slope is wrong; they answer different questions. “What happens to a market’s churn if its waits grow a minute?” is conditional. “What is the churn-vs-wait gradient across our whole book of business?” is marginal. Report the one your question asks for, and say which it is; a surprising amount of cross-study disagreement in fields that mix GLMMs with the alternatives below is just this factor, unacknowledged.

The dedicated tool for marginal questions is generalized estimating equations (GEE): no random effects, no integral, just the population-averaged model fitted directly with a working correlation structure (a rough guess at how same-group rows correlate; “exchangeable”, every pair equally correlated, is the usual default) and robust standard errors that stay honest under clustering even when that guess is wrong. The GLM post name-checked it alongside mixed models as the other escape hatch; this is the difference between them.

Fitting the GLMM, and what Python honestly offers

import statsmodels.api as sm
from statsmodels.genmod.bayes_mixed_glm import BinomialBayesMixedGLM

rng = np.random.default_rng(23)
n_markets, n_per = 16, 250
u = rng.normal(0.0, 1.2, n_markets)                  # true tau = 1.2
market = np.repeat(np.arange(n_markets), n_per)
wait = rng.uniform(0, 30, n_markets * n_per)
eta = -3.2 + 0.22 * wait + u[market]                 # same betas as the GLM post
churned = rng.binomial(1, 1 / (1 + np.exp(-eta)))
df = pd.DataFrame({"market": market, "wait": wait, "churned": churned})

# conditional model: a GLMM with a random intercept per market
glmm = BinomialBayesMixedGLM.from_formula(
    "churned ~ wait", {"market": "0 + C(market)"}, df).fit_vb()

# marginal model: GEE with exchangeable working correlation
gee = sm.GEE.from_formula("churned ~ wait", groups="market", data=df,
                          family=sm.families.Binomial(),
                          cov_struct=sm.cov_struct.Exchangeable()).fit()

The two fits, on the same 4,000 simulated customers whose true within-market slope is 0.22 with \(\tau = 1.2\):

  • The GLMM recovers the conditional slope: \(0.226\), with the market spread estimated at \(\hat\tau \approx 1.29\).
  • The GEE recovers the marginal slope: \(0.179\), and the attenuation formula predicted \(0.22 / \sqrt{1 + 0.346 \times 1.2^2} = 0.180\). The agreement to three decimals is the theory working, not a coincidence of the seed.
  • A naive pooled logistic regression also lands near 0.18, because ignoring clusters implicitly averages over them, but it reports a standard error of 0.006 where GEE’s honest, cluster-robust one is 0.016. Same point estimate, three times overconfident: the opening section’s lesson, replayed in logit space.

One candid ecosystem note: BinomialBayesMixedGLM fits by variational Bayes (fit_vb) or a MAP approximation (fit_map), which is serviceable for random intercepts but less battle-tested than lme4::glmer’s Laplace machinery, and its posterior spreads are approximate. For GLMMs that matter, the strong choices from Python are pymer4 (which calls glmer under the hood) or going properly Bayesian with bambi, where the same model is three lines and returns full posteriors:

import bambi as bmb

model = bmb.Model("churned ~ wait + (1|market)", df, family="bernoulli")
posterior = model.fit()     # NUTS via PyMC; summaries via ArviZ

The (1|market) syntax is lme4’s formula convention (random intercept per market; (wait|market) would add a random slope), which has become the shared language of the entire mixed-models world and is worth learning to read regardless of your library.

A practical playbook

Decisions that come up every time, with the honest trade-offs:

  • Random effects or dummy variables? Dummies (fixed group effects) estimate each group free of distributional assumptions, but cannot say anything about unseen groups, spend a parameter per group, and give small groups their terrible no-pooling estimates. Random effects assume exchangeable, Gaussian-ish group offsets, pool information, and generalize to new groups. Prefer random effects when there are many groups, when groups are a sample from a population you care about, or when small groups need stabilizing; prefer dummies when there is a handful of specific groups (five regions you will never resample) whose individual identities are the point.
  • The econometrician’s objection deserves respect. Random effects assume \(u_j\) is uncorrelated with the predictors. If good teams also get easier customers, that assumption fails and the fixed-effects (dummy) estimator, which differences the groups away, is the safe one for causal readings; the Hausman test compares the two, and the Mundlak device (add each group’s mean of \(x\) as a predictor, giving the within-group slope its own coefficient) usually gets you the best of both. Related: a predictor’s within-group and between-group effects can genuinely differ, and group-mean centering makes the model estimate both instead of an uninterpretable blend.
  • How many groups do you need? Variance components are estimated across groups, so with very few (below roughly 5 to 10) \(\hat\tau^2\) is mostly noise and partial pooling loses its footing; Gelman and Hill’s advice is that with so few groups a mixed model typically adds little over dummies, though it rarely hurts. Group sizes can be tiny (that is what shrinkage is for); it is the group count that carries the burden.
  • Nested or crossed? Teams within regions within countries are nested (each level sits inside one unit of the level above); customers crossed with products (every customer can rate every product) are crossed, and recommender-style data is full of this. Both are mixed models, but crossed effects break the block structure that makes nested computation cheap; lme4 handles crossed effects natively, statsmodels’ MixedLM only awkwardly (via variance components), which is a real reason to switch tools.
  • Diagnostics carry over, plus one new plot. Residual-versus-fitted and Q-Q checks from the GLM post still apply to the within-group residuals; the new citizen is the Q-Q plot of the BLUPs, since the model assumes the group effects themselves are Gaussian. A couple of wildly outlying group effects often mean a missing group-level predictor, which is a finding, not a nuisance. Watch convergence warnings and boundary fits as first-class results.
  • Prediction has two modes. For a known group, include its BLUP (statsmodels: add result.random_effects[g] to the fixed prediction); for a new group, the random effect is unknown, so predict from the fixed effects and widen the uncertainty by \(\tau\). If your production model will mostly score unseen groups, the fixed-plus-tau mode is the one to validate.

Where mixed models run out

  • Few groups, unstable variance components. Below a handful of groups, \(\hat\tau^2\) is barely estimable, and everything built on it (shrinkage weights, honest standard errors) inherits the wobble. Bayesian versions with weakly informative priors on \(\tau\) degrade most gracefully here.
  • The Gaussian random-effects assumption is a real assumption. A few genuinely aberrant groups (one fraudulent market) violate it, and shrinkage will partially launder the anomaly toward the mean rather than flag it; check the BLUP Q-Q plot, and consider heavier-tailed random-effect distributions in a Bayesian fit.
  • GLMM approximations have failure zones. Laplace and especially PQL degrade with binary outcomes and small groups; adaptive quadrature or MCMC are the fallbacks, at rising cost.
  • Deep or crossed hierarchies outgrow the classical software. Students within classrooms within schools within districts, crossed with years: past two or three levels, formula-based frequentist tooling gets brittle, and probabilistic-programming frameworks (PyMC, Stan) become the practical path, where a hierarchy is just more lines in the generative story.
  • None of this creates causal identification. Partial pooling fixes variance, not confounding: a mixed model with markets does not make wait time exogenous within markets. The coefficients remain conditional associations, exactly as the GLM post’s caveats had it, and the random-versus-fixed-effects assumption above is where clustering and causality intersect.

Closing thoughts

Three ideas to carry out. First, clustering is an uncertainty problem before it is a prediction problem: ignoring groups often barely moves point estimates while wrecking standard errors, which is the quietest and most consequential failure mode in applied regression. Second, partial pooling is regularization with a derivation: the BLUP’s shrinkage weight \(\lambda_j = \tau^2/(\tau^2 + \sigma^2/n_j)\) is ridge regression on group dummies whose penalty the data estimates, and the James-Stein result says the middle path is not a compromise but a dominant strategy. Third, in nonlinear hierarchies, “the effect” is two numbers: the within-group slope the GLMM estimates and the population-averaged slope flattened by \(\sqrt{1 + 0.346\,\tau^2}\), and most confusion around hierarchical logistic regression dissolves the moment you say which one you mean. The GLM post ended by promising that its recipe generalized; this is what one deliberate generalization, a distribution on the coefficients, buys: models that treat “how much should I trust this group’s data?” as a statistical question with a computable answer.

Appendix: the probability toolbox

Every derivation in this post reduces to five facts, each provable in a few lines with nothing beyond first-course probability and the product rule. Each tool below states its fact in the open, folds the proof behind a click for when you want it, and links both ways: the map jumps down to each tool, every tool ends by linking back to the sections it powers, and the body text above links here at the exact step where each tool gets used.

Tool In one line Powers
1. Covariance rules constants vanish, covariance adds term by term, and averaging \(n\) noises divides variance by \(n\) ICC, design effect, BLUP setup
2. Gaussian conditioning the best guess of an unseen \(A\) from an observed \(B\) is \(B\) times a regression slope shrinkage
3. Quadratic gradients \(\beta^\top A \beta\) differentiates like \(a x^2\): the gradient is \(2 A \beta\) GLS
4. Completing the square a quadratic in \(\beta\) splits into “distance from the best fit” plus a \(\beta\)-free remainder REML
5. The Gaussian integral a Gaussian bump’s total volume is the inverse square root of its curvature’s determinant REML, Laplace

1. Covariance rules, and why averaging tames noise

The facts. Variance is the expected squared deviation from the mean, \(\mathrm{Var}(X) = \mathbb{E}[(X - \mathbb{E}X)^2]\); covariance is the expected product of two variables’ deviations, \(\mathrm{Cov}(X, Y) = \mathbb{E}[(X - \mathbb{E}X)(Y - \mathbb{E}Y)]\), positive when they move together; correlation is covariance rescaled to live in \([-1, 1]\). Three rules follow directly from the definitions plus the linearity of expectation, and they are the only ones the post uses:

  • Constants vanish. Shifting \(X\) by a constant does not change its deviations, so neither variance nor covariance sees fixed quantities. This is why the fixed effects dropped out of every covariance computation above.
  • Covariance is linear in each slot. \(\mathrm{Cov}(aX + Y,\, W) = a\,\mathrm{Cov}(X, W) + \mathrm{Cov}(Y, W)\): expand the defining product and push the expectation through the sum. Two consequences the post used constantly: \(\mathrm{Cov}(X, X) = \mathrm{Var}(X)\), and \(\mathrm{Var}(X + Y) = \mathrm{Var}(X) + \mathrm{Var}(Y) + 2\,\mathrm{Cov}(X, Y)\).
  • Independent variables have zero covariance, since the expectation of a product of independent, zero-mean deviations factors into a product of zeros.

The workhorse consequence: the average \(\bar\varepsilon\) of \(n\) independent noise terms, each with variance \(\sigma^2\), has variance \(\sigma^2 / n\). Noise cancels itself at rate \(1/n\), which is the entire reason large groups earn trust.

Proof that averaging divides variance by n \[\begin{aligned} \mathrm{Var}(\bar\varepsilon) &= \mathrm{Var}\!\Big( \tfrac{1}{n} \textstyle\sum_i \varepsilon_i \Big) = \tfrac{1}{n^2}\, \mathrm{Var}\!\Big( \textstyle\sum_i \varepsilon_i \Big) && \textcolor{#7b8794}{\small\text{scaling by } \tfrac{1}{n} \text{ scales deviations by } \tfrac{1}{n} \text{, hence variance by } \tfrac{1}{n^2}} \\[4pt] &= \tfrac{1}{n^2} \textstyle\sum_i \mathrm{Var}(\varepsilon_i) && \textcolor{#7b8794}{\small\text{expand the variance of the sum; independence kills every cross term}} \\[4pt] &= \frac{\sigma^2}{n} && \textcolor{#7b8794}{\small n \text{ identical terms, each } \sigma^2 \text{, divided by } n^2} \end{aligned}\]

Powers: the ICC derivation and its design effect, plus the BLUP setup, where \(\mathrm{Var}(u_j + \varepsilon_{ij}) = \tau^2 + \sigma^2\) is the sum rule with the cross term dead, an instance of what textbooks call the law of total variance: total spread is spread between groups plus average spread within them.

2. Conditioning in a Gaussian world is regression

The fact. For jointly Gaussian, zero-mean \((A, B)\), the best guess of the unseen \(A\) after observing \(B\) is

\[\mathbb{E}[A \mid B] = \frac{\mathrm{Cov}(A, B)}{\mathrm{Var}(B)}\, B\]

the observed \(B\), scaled down by the fraction of \(B\)’s variation that \(A\) accounts for.

Proof, via a decomposition trick worth knowing \[\begin{aligned} A &= cB + W, \qquad c = \frac{\mathrm{Cov}(A, B)}{\mathrm{Var}(B)} && \textcolor{#7b8794}{\small\text{split } A \text{ into a multiple of } B \text{ plus a leftover } W := A - cB} \\[4pt] \mathrm{Cov}(W, B) &= \mathrm{Cov}(A, B) - c\, \mathrm{Var}(B) = 0 && \textcolor{#7b8794}{\small c \text{ was chosen precisely to make the leftover uncorrelated with } B} \\[4pt] \mathbb{E}[A \mid B] &= cB + \mathbb{E}[W \mid B] = cB && \textcolor{#7b8794}{\small\text{for jointly Gaussian variables, uncorrelated means independent, so } B \text{ tells us nothing about } W} \end{aligned}\]

Read it as regression. The coefficient \(c\) is exactly a regression slope: shrinkage is regression of the unobserved truth on the observed data. Only the last proof line needs Gaussianity; without it, \(cB\) is still the best linear guess, which is why BLUPs keep their optimality-among-linear-predictors even when normality is shaky.

Powers: the BLUP derivation, with \(A = u_j\) and \(B = D_j\).

3. The derivative of a quadratic

The facts. For a symmetric matrix \(A\), the gradient of \(\beta^\top A \beta\) with respect to \(\beta\) is \(2A\beta\), and the gradient of \(b^\top \beta\) is \(b\): the matrix twins of \(\frac{d}{dx}(a x^2) = 2ax\) and \(\frac{d}{dx}(bx) = b\).

Why the factor of 2 appears

Write the quadratic form as an explicit double sum, \(\beta^\top A \beta = \sum_i \sum_k \beta_i A_{ik} \beta_k\), and differentiate with respect to one coordinate \(\beta_j\): the product rule produces two copies of \(\sum_k A_{jk} \beta_k\), one from the terms where \(\beta_j\) plays the left role and one where it plays the right, and symmetry of \(A\) makes the two copies identical. Stack the coordinates back into a vector and you have \(2A\beta\).

Applied to the GLS objective, expanding first and using both rules:

\[(y - X\beta)^\top V^{-1} (y - X\beta) = y^\top V^{-1} y \; - \; 2\, \beta^\top X^\top V^{-1} y \; + \; \beta^\top \big( X^\top V^{-1} X \big) \beta\]

whose gradient is \(-2 X^\top V^{-1} y + 2 X^\top V^{-1} X \beta\); setting it to zero gives the GLS normal equations.

Powers: the generalized-least-squares derivation, and it reappears inside tool 4.

4. Completing the square, matrix edition

The fact. With \(\hat\beta\) the GLS solution and \(r = y - X\hat\beta\) its residual, the exponent of the mixed model’s likelihood splits cleanly:

\[(y - X\beta)^\top V^{-1} (y - X\beta) \;=\; r^\top V^{-1} r \;+\; (\beta - \hat\beta)^\top X^\top V^{-1} X (\beta - \hat\beta)\]

a \(\beta\)-free part measuring how well the best fit does, plus a pure penalty for being away from it.

Proof, by expanding around the best fit \[\begin{aligned} y - X\beta &= r + X(\hat\beta - \beta) && \textcolor{#7b8794}{\small\text{add and subtract } X\hat\beta} \\[4pt] (y - X\beta)^\top V^{-1} (y - X\beta) &= r^\top V^{-1} r \; + \; 2 (\hat\beta - \beta)^\top X^\top V^{-1} r \; + \; (\hat\beta - \beta)^\top X^\top V^{-1} X (\hat\beta - \beta) && \textcolor{#7b8794}{\small\text{expand the quadratic in its two pieces}} \\[4pt] X^\top V^{-1} r &= X^\top V^{-1} y - X^\top V^{-1} X \hat\beta = 0 && \textcolor{#7b8794}{\small\text{the GLS normal equations from tool 3, verbatim}} \\[4pt] (y - X\beta)^\top V^{-1} (y - X\beta) &= r^\top V^{-1} r + (\beta - \hat\beta)^\top X^\top V^{-1} X (\beta - \hat\beta) && \textcolor{#7b8794}{\small\text{the cross term dies; the sign flip inside the square is free}} \end{aligned}\]

The geometry behind the algebra: \(\hat\beta\) is the point where the residual is \(V^{-1}\)-orthogonal to everything \(X\) can express, so moving \(\beta\) away from \(\hat\beta\) adds error in a direction the residual cannot see.

Powers: the second line of the REML derivation.

5. The Gaussian integral remembers its determinant

The fact. Integrating a Gaussian bump over all of \(\mathbb{R}^p\) yields the inverse square root of its curvature matrix’s determinant:

\[\int_{\mathbb{R}^p} e^{-\frac{1}{2} v^\top A v}\; dv = (2\pi)^{p/2}\, \lvert A \rvert^{-1/2}\]
Proof, one rotation and p one-dimensional integrals \[\begin{aligned} \int e^{-z^2 / 2\omega}\, dz &= \sqrt{2\pi\omega} && \textcolor{#7b8794}{\small\text{the one-dimensional case: a Gaussian density with variance } \omega \text{ integrates to 1}} \\[4pt] v^\top A v &= \textstyle\sum_i \lambda_i w_i^2 && \textcolor{#7b8794}{\small\text{rotate to } A\text{'s eigen-axes; a rotation does not distort volume, so the integral is unchanged}} \\[4pt] \int e^{-\frac{1}{2} \sum_i \lambda_i w_i^2}\, dw &= \textstyle\prod_i \sqrt{2\pi / \lambda_i} && \textcolor{#7b8794}{\small\text{the integrand factors into } p \text{ independent one-dimensional Gaussians, variance } 1/\lambda_i \text{ each}} \\[4pt] &= (2\pi)^{p/2} \big( \textstyle\prod_i \lambda_i \big)^{-1/2} = (2\pi)^{p/2}\, \lvert A \rvert^{-1/2} && \textcolor{#7b8794}{\small\text{and a determinant is exactly the product of the eigenvalues}} \end{aligned}\]

The interpretation is the useful part: a sharply curved likelihood (large \(\lvert A \rvert\)) encloses little volume, a flat one encloses a lot, and that is precisely how REML’s \(\lvert X^\top V^{-1} X \rvert^{-1/2}\) factor charges the variance components for how much freedom the fixed effects had. The same integral, applied to a Taylor expansion instead of an exact quadratic, is the Laplace approximation from the GLMM section; approximation there, identity here.

Powers: the third line of the REML derivation, with \(A = X^\top V^{-1} X\); conceptually, the Laplace approximation.

Sources and further reading

Citation Information

If you find this content useful, please cite this work as:

Bhana, Nish. "Mixed Effects Models: Partial Pooling for Data That Comes in Groups". Nish Blog (September 2025). https://www.nishbhana.com/Mixed-Effects-Models/

Or use the BibTeX citation:

@article{bhana2025mixedeffects,
  title   = {Mixed Effects Models: Partial Pooling for Data That Comes in Groups},
  author  = {Bhana, Nish},
  journal = {nishbhana.com},
  year    = {2025},
  month   = {September},
  url     = {https://www.nishbhana.com/Mixed-Effects-Models/}
}

x.com, Facebook