Table of Contents
- TL;DR
- What does a model actually predict?
- When the straight line breaks
- The GLM recipe: three choices
- Where the sigmoid actually comes from
- One model, two spaces
- What the model learns, and what “fitting” means
- Reading the coefficients: features as evidence
- The assumptions, and how to actually check them
- Where the recipe runs out
- Closing thoughts
- Sources and further reading
Linear regression and logistic regression are usually taught as two different models: one fits lines to continuous targets, the other fits S-curves to binary ones. That framing hides the more useful truth. They are the same model with two settings changed, and the framework that makes this precise, the generalized linear model (GLM), was one of statistics’ great unification moments: in 1972, Nelder and Wedderburn showed that ordinary regression, the probit models of bioassay, and the log-linear models of contingency tables were all instances of one recipe, fittable by one algorithm. This post works through that recipe from an applied angle: what these models actually predict, the two honest ways to read their outputs, how to use their coefficients as evidence about which features matter, and the assumptions you should check before trusting any of it.
TL;DR
- A GLM predicts a conditional mean, \(\mathbb{E}[\,y \mid x\,]\): the average outcome among cases with those features, not the value of any individual case. Linear regression predicts the center of a Gaussian; logistic regression predicts the parameter of a coin flip. Everything else follows from taking that seriously.
- Every GLM is three choices: a distribution for how observations scatter (the random component), a weighted sum of features \(\eta = \beta_0 + \beta_1 x_1 + \dots + \beta_k x_k\) (the systematic component), and a link function that connects the sum to the mean. Linear regression chooses Gaussian + identity; logistic regression chooses Bernoulli + logit, whose inverse is the sigmoid.
- The sigmoid is derived, not chosen. Write the Bernoulli distribution in exponential-family form and its natural parameter turns out to be the log-odds; inverting that relationship is the sigmoid. No one picked it because it looked nice.
- Every GLM has two faces: a straight line in the link space (log-odds, for logistic regression) and a curve in the outcome space (probability). The model learns in the first space; you report in the second. Coefficients only have their clean “one unit of \(x\) adds \(\beta\)” reading in the link space.
- Coefficients are evidence about features, with caveats. Exponentiating a logistic coefficient gives an odds ratio; confidence intervals tell you whether a feature carries signal. But coefficients are conditional on the other features, sensitive to scale and collinearity, and silent about causation.
- The recipe has known failure modes: assumed linearity in the link, no automatic interactions, sensitivity to outliers and separation, and a hard-coded mean-variance relationship. Knowing the diagnostics is what separates using a GLM from trusting one blindly.
What does a model actually predict?
Start with the model everyone meets first. The usual way to write linear regression is as a sum:
\[\hat{y} = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \dots + \beta_k x_k\]Each feature \(x_j\) gets a learned weight \(\beta_j\), the weights times the features are added up, and the result is the prediction. This is the formula view, and it is the one that makes linear models so prized as interpretable models: increase \(x_1\) by one unit while holding everything else fixed, and the prediction moves by exactly \(\beta_1\).
But real data never sits exactly on the line, and the equation as written has no room for that. So the version you meet in a statistics course adds one more term:
\[y = \underbrace{\beta_0 + \beta_1 x_1 + \dots + \beta_k x_k}_{\textcolor{#0f8ba0}{\beta^\top x}} \; + \; \varepsilon, \qquad \varepsilon \sim \mathcal{N}(0, \sigma^2)\]Two things to unpack here, because they carry the rest of the post. First, the underbrace introduces the compact notation \(\textcolor{#0f8ba0}{\beta^\top x}\): stack the weights into a vector \(\beta\) and the features into a vector \(x\) (with a constant 1 in front so \(\beta_0\) rides along), and the dot product reproduces the weighted sum exactly. Second, the error term \(\varepsilon\) is where all the statistics lives. It collects everything about the outcome that the features do not explain (unmeasured causes, genuine randomness, measurement noise), and the equation says nothing testable until you commit to properties for it. The classical commitments are that the errors average to zero at every \(x\), have the same spread \(\sigma\) everywhere, are independent of each other, and follow a Gaussian. Hold that thought: these are exactly the “assumptions of linear regression” we will check later, and this is the only place they enter.
Now for the reading everyone should actually remember, which follows from that equation in three lines. Take the average of both sides among observations sharing the same features \(x\) (that is what the conditional expectation \(\mathbb{E}[\, \cdot \mid x\,]\) means):
\[\begin{aligned} \mathbb{E}[\, y \mid x \,] &= \mathbb{E}\big[\, \textcolor{#0f8ba0}{\beta^\top x} + \varepsilon \mid x \,\big] && \textcolor{#7b8794}{\small\text{take the conditional average of both sides of the model}} \\[4pt] &= \textcolor{#0f8ba0}{\beta^\top x} + \mathbb{E}[\, \varepsilon \mid x \,] && \textcolor{#7b8794}{\small\text{the average of a sum is the sum of averages; } \beta^\top x \text{ is fixed once } x \text{ is}} \\[4pt] &= \textcolor{#0f8ba0}{\beta^\top x} && \textcolor{#7b8794}{\small\text{the zero-mean assumption: errors average out at every } x} \end{aligned}\]The noise disappears under averaging, and what remains is the statement worth memorizing: the line is not “the value of \(y\)”; it is the conditional mean of \(y\), the average outcome among cases with those feature values. A prediction from linear regression is an estimate of \(\mathbb{E}[\,y \mid x\,]\), nothing more and nothing less, which is why individual observations landing off the line is not a model failure but the model working as described.
There is one last repackaging step, and it is the one that unlocks everything after this section. Instead of writing “a fixed line plus Gaussian noise”, fold both into a single distributional statement, since adding Gaussian noise to a constant just recenters the Gaussian:
\[y \mid x \;\sim\; \mathcal{N}\big(\textcolor{#b93d20}{\mu},\, \sigma^2\big), \qquad \textcolor{#b93d20}{\mu} = \textcolor{#0f8ba0}{\beta^\top x}\]This is the probabilistic view: the model predicts the mean \(\textcolor{#b93d20}{\mu}\) of a Gaussian distribution, and the observed \(y\) scatters around that mean with spread \(\sigma\). It is the same model as the formula view, and the same model as the error-term view, but notice what changed: the error term has vanished as a separate object, because the distribution now carries it. Every assumption that used to be a claim about \(\varepsilon\) became part of the distribution choice. That repackaging is what generalizes, because it splits any such model into two swappable parts: a recipe for the mean, and a distribution describing the scatter around it. A natural question follows: what happens when a Gaussian is the wrong description of the scatter?
When the straight line breaks
Suppose the outcome is binary: a customer churns or does not, an email is spam or is not. The quantity worth predicting is a probability, \(p(y = 1 \mid x)\), and you could try to model it with the machinery we already have: code the outcome as 0 or 1 and fit a straight line to it. The result fails in a way a picture makes obvious.
The visible failure is the range: a line \(\beta^\top x\) spans every real number, but a probability must live between 0 and 1. There are two quieter failures underneath it. The noise around the line is not remotely Gaussian (the outcome only takes two values, and its variance \(p(1-p)\) changes with the prediction itself, largest at \(p = 0.5\) and shrinking toward the extremes), and least squares, which is the right way to fit under Gaussian noise, is therefore the wrong way to fit here. Three failures, one diagnosis: we swapped the outcome type but kept the distribution.
The fix that generalizes is not to bend the line by hand. It is to go back to the probabilistic view and change the two pieces that no longer fit: the distribution, and the way the weighted sum connects to its mean.
The GLM recipe: three choices
A generalized linear model is precisely that recipe. You make three choices, and every model in the family is a different way of filling in the same three slots:
- The random component: a distribution describing how outcomes scatter. This is the only source of randomness in a GLM; the separate additive error term \(\varepsilon\) from the opening section is gone for good, absorbed into the distribution the way we saw the Gaussian absorb it. The distribution’s shape carries all of it.
- The systematic component: the linear predictor \(\textcolor{#0f8ba0}{\eta} = \beta_0 + \beta_1 x_1 + \dots + \beta_k x_k\). This part never changes. It is the sum-of-weighted-features idea, kept intact across the whole family.
- The link function \(g\): the bridge between the two, \(g(\textcolor{#b93d20}{\mu}) = \textcolor{#0f8ba0}{\eta}\). Its inverse \(g^{-1}\) takes the unbounded sum and maps it onto whatever range the mean is allowed to occupy.
Filling the slots recovers the familiar models, plus relatives that usually get taught as separate topics:
| Outcome type | Random component | Link \(g(\mu)\) | Inverse link \(g^{-1}(\eta)\) | The model |
|---|---|---|---|---|
| continuous | Gaussian | identity | \(\eta\) | linear regression |
| binary | Bernoulli | logit \(\log\frac{\mu}{1-\mu}\) | sigmoid \(\frac{1}{1+e^{-\eta}}\) | logistic regression |
| count | Poisson | \(\log \mu\) | \(e^{\eta}\) | Poisson regression |
| positive, skewed | Gamma | \(\log \mu\) (common) | \(e^{\eta}\) | Gamma regression |
Odds, quickly. The logit link in that table speaks in odds, so pin the word down before going further. Odds re-express a probability as successes per failure: \(p = 0.75\) means odds of \(0.75 / 0.25 = 3\), or three to one. The log of that number is the log-odds: probability 0.75 is log-odds \(\log 3 \approx 1.1\), a coin flip (\(p = 0.5\), odds 1) is log-odds 0, and unlikely events get negative log-odds. Crucially, as \(p\) sweeps from 0 to 1, the log-odds sweeps across every real number. That is the entire qualification for being a good link: it puts a bounded probability onto exactly the unbounded scale a linear predictor lives on.
The table raises the obvious question: who decided the Bernoulli gets the logit and the Gaussian gets the identity? The pairings look arbitrary, and they are not. They fall out of a single derivation, and it is worth seeing once in full.
Where the sigmoid actually comes from
The unifying object behind GLMs is the exponential family of distributions. A distribution belongs to it if its density can be massaged into the shape
\[p(y;\, \eta) = b(y)\, \exp\big(\eta\, T(y) - a(\eta)\big)\]The four pieces have names and jobs, though you never compute with them directly:
- \(\eta\) is the natural parameter: the one knob the distribution exposes in this form, and the thing a GLM will make linear in the features.
- \(T(y)\) is the sufficient statistic, and for both distributions we rewrite below it is simply \(y\) itself (the multinomial mentioned later is the usual exception: its \(T(y)\) is a vector of indicators).
- \(a(\eta)\) is the log-partition function, a normalizing term whose only job is to make the probabilities sum (or integrate) to one.
- \(b(y)\) is whatever factor is left over that involves only \(y\) and not the parameter.
The form looks like bureaucracy, but the payoff is concrete: if you can massage your distribution into this shape, the natural parameter tells you what the link function should be. The game is pure pattern-matching, and every move it needs comes from this toolkit of log rules (plus one identity), all first-course material:
\[\log(ab) = \log a + \log b \qquad\quad \log(a^c) = c \log a \qquad\quad \log a - \log b = \log \frac{a}{b} \qquad\quad z = e^{\log z}\]Take the Bernoulli distribution with success probability \(\phi\). Its whole job is to say “\(y = 1\) with probability \(\phi\), \(y = 0\) with probability \(1 - \phi\)”, and the standard way to write that as a single formula is \(\phi^{\,y}(1-\phi)^{1-y}\): plug in \(y = 1\) and the second factor vanishes, leaving \(\phi\); plug in \(y = 0\) and the first vanishes, leaving \(1-\phi\). Now rewrite it step by step:
\[\begin{aligned} p(y;\, \phi) &= \phi^{\,y} (1-\phi)^{1-y} && \textcolor{#7b8794}{\small\text{one expression for both outcomes: try } y = 1 \text{ and } y = 0} \\[4pt] &= \exp\Big( \log\big( \phi^{\,y} (1-\phi)^{1-y} \big) \Big) && \textcolor{#7b8794}{\small\text{wrap in } e^{\log(\cdot)} \text{, which changes nothing}} \\[4pt] &= \exp\big( y \log \phi + (1-y) \log (1-\phi) \big) && \textcolor{#7b8794}{\small\text{log of a product adds; exponents come down as factors}} \\[4pt] &= \exp\big( y \log \phi - y \log (1-\phi) + \log (1-\phi) \big) && \textcolor{#7b8794}{\small\text{multiply out } (1-y)\log(1-\phi)} \\[4pt] &= \exp\Big( y \,\textcolor{#0f8ba0}{\log \tfrac{\phi}{1-\phi}} + \log (1-\phi) \Big) && \textcolor{#7b8794}{\small\text{factor } y \text{ out of the first two terms: } \log a - \log b = \log \tfrac{a}{b}} \\[4pt] &= b(y)\, \exp\big( \textcolor{#0f8ba0}{\eta}\, y - a(\eta) \big) && \textcolor{#7b8794}{\small\text{the target shape, with } b(y) = 1} \end{aligned}\]Now match the last two lines term by term. The factor multiplying \(y\) must be the natural parameter, so \(\textcolor{#0f8ba0}{\eta = \log\frac{\phi}{1-\phi}}\), which is exactly the log-odds of success from the aside above. The leftover term must be \(-a(\eta)\), so \(a(\eta) = -\log(1-\phi)\) (which the inversion below turns into \(\log(1 + e^{\eta})\), though we will not need it). The log-odds was not imposed on the Bernoulli; it was sitting inside the distribution’s algebra all along.
A GLM says: make that natural parameter the thing that is linear in the features, \(\eta = \beta^\top x\). To get back from \(\eta\) to the probability the model predicts, invert the relationship:
\[\begin{aligned} \textcolor{#0f8ba0}{\eta} &= \log \frac{\phi}{1-\phi} && \textcolor{#7b8794}{\small\text{the natural parameter is the log-odds}} \\[4pt] e^{\eta} &= \frac{\phi}{1-\phi} && \textcolor{#7b8794}{\small\text{exponentiate both sides}} \\[4pt] e^{\eta} (1 - \phi) &= \phi && \textcolor{#7b8794}{\small\text{multiply through by } 1-\phi} \\[4pt] e^{\eta} &= \phi\, (1 + e^{\eta}) && \textcolor{#7b8794}{\small\text{move } e^{\eta}\phi \text{ across, factor } \phi \text{ out}} \\[4pt] \phi &= \frac{e^{\eta}}{1 + e^{\eta}} && \textcolor{#7b8794}{\small\text{divide both sides by } 1 + e^{\eta}} \\[4pt] \textcolor{#b93d20}{\phi} &= \frac{1}{1 + e^{-\eta}} && \textcolor{#7b8794}{\small\text{divide top and bottom by } e^{\eta} \text{: the sigmoid, forced on us}} \end{aligned}\]That final line is logistic regression’s S-curve. It was not selected from a menu of squashing functions because it is smooth and pretty; it is the unique inverse of the Bernoulli’s own natural parameter. The same exercise on the Gaussian is shorter. Fixing \(\sigma^2 = 1\) (its value does not affect where the maximum of the likelihood sits):
\[\begin{aligned} p(y;\, \mu) &= \frac{1}{\sqrt{2\pi}} \exp\Big( -\frac{(y - \mu)^2}{2} \Big) && \textcolor{#7b8794}{\small\text{the Gaussian density with } \sigma = 1} \\[4pt] &= \frac{1}{\sqrt{2\pi}} \exp\Big( -\frac{y^2}{2} + \mu y - \frac{\mu^2}{2} \Big) && \textcolor{#7b8794}{\small\text{expand } (y - \mu)^2 = y^2 - 2\mu y + \mu^2 \text{, distribute the } -\tfrac{1}{2}} \\[4pt] &= \underbrace{\frac{1}{\sqrt{2\pi}}\, e^{-y^2 / 2}}_{b(y)}\; \exp\big( \textcolor{#0f8ba0}{\mu}\, y - \tfrac{\mu^2}{2} \big) && \textcolor{#7b8794}{\small\text{split off the factor that ignores } \mu \text{: } e^{a+b} = e^a e^b} \end{aligned}\]Here the natural parameter is \(\mu\) itself: \(\eta = \mu\), and inverting that gives the identity function. Linear regression’s “no transformation at all” is not a special privilege; it is the same recipe, where the algebra happens to yield the simplest possible link. Run the exercise a third time on a multinomial outcome and the softmax function falls out, which is why the output layer of every classification network looks the way it does. One derivation, three model families.
Terminology worth pinning down. The link \(g\) maps the mean to the linear predictor, \(g(\mu) = \eta\); its inverse \(g^{-1}\) (sometimes called the response function) maps back, \(\mu = g^{-1}(\eta)\). When the link is the one the exponential family hands you, as the logit is for the Bernoulli, it is called the canonical link, and it comes with practical perks: the fitting problem stays well behaved and the math below simplifies beautifully.
One model, two spaces
The recipe means every GLM leads a double life, and knowing which life you are looking at is most of the practical skill of reading one.
In the link space, the model is linear, full stop. For logistic regression this is log-odds space: each unit of a feature adds its coefficient \(\beta_j\) to the log-odds, no matter what the other features are doing. This is the space the model learns in, the space where coefficients have their clean additive meaning, and the space where the phrase “linear model” remains true.
In the outcome space, after applying \(g^{-1}\), the model is a curve, and effects are no longer additive. Moving a feature by one unit changes the predicted probability by an amount that depends on where you already are: near \(p = 0.5\) the sigmoid is steep and a unit of \(x\) moves the probability a lot; near 0.05 or 0.95 the curve has flattened and the same unit barely registers. This is the space you report in, because stakeholders want probabilities, not log-odds.
Both readings are correct, and each answers a different question. “What did the model learn about this feature?” is a link-space question with a one-number answer. “What happens to this customer’s churn probability if we cut their wait by five minutes?” is an outcome-space question whose answer depends on the customer. A surprising amount of confusion about logistic regression dissolves once these two questions stop being conflated.
What the model learns, and what “fitting” means
The probabilistic view also settles how the weights should be chosen. Since the model assigns a probability (a density, for continuous outcomes) to everything it could have seen, you can ask of any candidate \(\beta\): how probable was the data we actually saw? Choosing the \(\beta\) that maximizes that probability is maximum likelihood estimation, and it needs one setup step. Treating the observations as independent, the probability of seeing the whole dataset is the product of the individual probabilities, and it pays to work with its logarithm:
\[L(\beta) = \prod_{i=1}^{n} p(y_i \mid x_i;\, \beta), \qquad \ell(\beta) = \log L(\beta)\]Taking the log is a free move for two reasons: the log is an increasing function, so whatever \(\beta\) maximizes \(\ell\) also maximizes \(L\), and it converts an unwieldy product of \(n\) terms into a sum, which is far easier to differentiate. Both famous special cases now fall out. Take the Gaussian first:
\[\begin{aligned} \ell(\beta) &= \sum_{i=1}^{n} \log p(y_i \mid x_i;\, \beta) && \textcolor{#7b8794}{\small\text{the log of the product is the sum of the logs}} \\[4pt] &= \sum_{i=1}^{n} \log \Bigg( \frac{1}{\sqrt{2\pi}\,\sigma} \exp\Big( -\frac{(y_i - \beta^\top x_i)^2}{2\sigma^2} \Big) \Bigg) && \textcolor{#7b8794}{\small\text{substitute the Gaussian density with mean } \beta^\top x_i} \\[4pt] &= \sum_{i=1}^{n} \Bigg( \log \frac{1}{\sqrt{2\pi}\,\sigma} \; - \; \frac{(y_i - \beta^\top x_i)^2}{2\sigma^2} \Bigg) && \textcolor{#7b8794}{\small\text{log of a product adds; } \log e^{z} = z \text{ unwraps the exponential}} \\[4pt] &= \; n \log \frac{1}{\sqrt{2\pi}\,\sigma} \; - \; \frac{1}{2\sigma^2} \sum_{i=1}^{n} \big( y_i - \beta^\top x_i \big)^2 && \textcolor{#7b8794}{\small\text{the first term appears } n \text{ times and does not involve } \beta} \end{aligned}\]Maximizing \(\ell\) over \(\beta\) therefore means minimizing \(\sum_i (y_i - \beta^\top x_i)^2\). Least squares is not a separate idea: it is what maximum likelihood becomes when the noise is Gaussian. The Bernoulli case runs the same way. With \(p_i = \sigma(\beta^\top x_i)\) as the predicted probability:
\[\begin{aligned} \ell(\beta) &= \sum_{i=1}^{n} \log \Big( p_i^{\,y_i} (1 - p_i)^{1 - y_i} \Big) && \textcolor{#7b8794}{\small\text{each observation contributes its Bernoulli probability}} \\[4pt] &= \sum_{i=1}^{n} \Big( y_i \log p_i + (1 - y_i) \log (1 - p_i) \Big) && \textcolor{#7b8794}{\small\text{binary cross-entropy, with its sign flipped}} \end{aligned}\]The loss deep learning calls binary cross-entropy or log loss is, again, just maximum likelihood under the GLM’s distributional choice. And here is the unification payoff, the result that makes the whole exponential-family detour feel earned: differentiate either log-likelihood with respect to one weight \(\beta_j\) and watch what survives. The computation needs two calculus facts. The first is that the derivative of \(\log z\) is \(1/z\). The second is the sigmoid’s famously tidy derivative, worth deriving once because it does the heavy lifting:
\[\begin{aligned} \frac{d\sigma}{d\eta} &= \frac{d}{d\eta} \big( 1 + e^{-\eta} \big)^{-1} && \textcolor{#7b8794}{\small\text{the sigmoid, written as a power}} \\[4pt] &= \frac{e^{-\eta}}{\big( 1 + e^{-\eta} \big)^2} && \textcolor{#7b8794}{\small\text{chain rule: } \tfrac{d}{dz} z^{-1} = -z^{-2} \text{, times the inner derivative } -e^{-\eta}} \\[4pt] &= \frac{1}{1 + e^{-\eta}} \cdot \frac{e^{-\eta}}{1 + e^{-\eta}} && \textcolor{#7b8794}{\small\text{split the fraction into two factors}} \\[4pt] &= \sigma(\eta) \big( 1 - \sigma(\eta) \big) && \textcolor{#7b8794}{\small\text{the first factor is } \sigma \text{; check the second by computing } 1 - \sigma} \end{aligned}\]In model terms: \(\frac{\partial p}{\partial \eta} = p(1-p)\), the derivative of the predicted probability with respect to the linear predictor is the Bernoulli variance. Armed with both facts, differentiate the logistic log-likelihood, remembering that \(p_i\) depends on \(\beta_j\) only through \(\eta_i = \beta^\top x_i\), whose derivative with respect to \(\beta_j\) is just the feature \(x_{ij}\):
\[\begin{aligned} \frac{\partial \ell}{\partial \beta_j} &= \sum_{i=1}^{n} \Big( \frac{y_i}{p_i} - \frac{1 - y_i}{1 - p_i} \Big)\, \frac{\partial p_i}{\partial \beta_j} && \textcolor{#7b8794}{\small\text{differentiate each log term: } \tfrac{d}{dz}\log z = \tfrac{1}{z} \text{, chain rule}} \\[4pt] &= \sum_{i=1}^{n} \frac{y_i - p_i}{p_i (1 - p_i)}\, \frac{\partial p_i}{\partial \beta_j} && \textcolor{#7b8794}{\small\text{common denominator: } y_i (1 - p_i) - (1 - y_i)\, p_i = y_i - p_i} \\[4pt] &= \sum_{i=1}^{n} \frac{y_i - p_i}{p_i (1 - p_i)}\; p_i (1 - p_i)\, x_{ij} && \textcolor{#7b8794}{\small\text{chain rule through the sigmoid: } \tfrac{\partial p_i}{\partial \beta_j} = p_i (1 - p_i)\, x_{ij}} \\[4pt] &= \sum_{i=1}^{n} \big( y_i - \textcolor{#b93d20}{p_i} \big)\, x_{ij} && \textcolor{#7b8794}{\small\text{the variance factors cancel exactly}} \end{aligned}\]Run the same derivative for linear regression (differentiate \(-\tfrac{1}{2}(y_i - \beta^\top x_i)^2\) with the chain rule) and you get \(\sum_i (y_i - \textcolor{#b93d20}{\mu_i})\, x_{ij}\). The gradient is identical in form for every GLM with its canonical link: prediction error times feature value, summed over the data. The models differ only in how the prediction \(\mu_i\) is produced from \(\eta_i\). This is why one fitting algorithm (iteratively reweighted least squares, a Newton-style method that repeatedly solves a weighted linear regression) covers the entire family, and it is the precise sense in which Nelder and Wedderburn’s unification was a discovery rather than a definition.
A practical warning hiding in this section. A tempting shortcut for binary data is to fit ordinary least squares and then squash the fitted line through a sigmoid afterward. The curve even looks plausible. But the two-step version optimizes the wrong objective (squared error in the wrong space) and its decision boundary can be visibly worse than the maximum-likelihood fit, a comparison Albert Rapp’s GLM guide demonstrates directly. The link function is half the story; fitting by likelihood is the other half.
Reading the coefficients: features as evidence
Now the applied payoff. You fit a GLM not only to predict but to learn which features matter and how. The rules for reading coefficients follow from everything above, and they differ by space.
For linear regression, the coefficient reading is the familiar one: holding the other features fixed, one unit of \(x_j\) shifts the expected outcome by \(\beta_j\), in the outcome’s own units. The only common trap is comparing coefficient magnitudes across features with different scales; a coefficient of 40 on “years of tenure” and 0.4 on “monthly price” says nothing about relative importance until the features are put on a common scale, for instance by standardizing them.
For logistic regression, the coefficient lives in log-odds space, and translating it for humans takes one step: exponentiate. Since one unit of \(x_j\) adds \(\beta_j\) to the log-odds, it multiplies the odds by \(e^{\beta_j}\), a quantity called the odds ratio. A coefficient of \(0.61\) means each unit multiplies the odds of the outcome by \(e^{0.61} \approx 1.84\), an 84% increase in odds; a coefficient of \(-0.75\) means each unit multiplies the odds by \(e^{-0.75} \approx 0.47\), cutting them roughly in half. For a fast probability-scale intuition there is also Gelman and Hill’s divide-by-4 rule: \(\beta_j / 4\) is the maximum change in probability per unit of \(x_j\). It is not a separate fact but a reuse of the sigmoid derivative from the previous section: the slope of the probability in \(x_j\) is \(p(1-p)\,\beta_j\), and \(p(1-p)\) peaks at \(0.25\) when \(p = 0.5\), where the curve is steepest. A coefficient of 0.61 therefore moves the probability by at most about 15 percentage points per unit.
Here is what this looks like on a concrete (simulated) churn problem, where the true drivers are known because we built them in:
import numpy as np
import pandas as pd
import statsmodels.api as sm
rng = np.random.default_rng(23)
n = 4000
features = pd.DataFrame({
"monthly_price": rng.normal(size=n), # standardized: units are 1 sd
"tenure": rng.normal(size=n), # standardized
"support_tickets": rng.normal(size=n), # standardized
"on_autopay": rng.binomial(1, 0.55, n),
"weekday_signup": rng.binomial(1, 0.70, n), # deliberately no true effect
})
log_odds = (-1.1 + 0.52 * features["monthly_price"] - 0.68 * features["tenure"]
+ 0.38 * features["support_tickets"] - 0.45 * features["on_autopay"])
churned = rng.binomial(1, 1 / (1 + np.exp(-log_odds)))
X = sm.add_constant(features)
model = sm.GLM(churned, X, family=sm.families.Binomial()).fit()
report = pd.DataFrame({
"coef (log-odds)": model.params,
"odds ratio": np.exp(model.params),
"or 2.5%": np.exp(model.conf_int()[0]),
"or 97.5%": np.exp(model.conf_int()[1]),
}).round(2)
print(report)
The fitted odds ratios, with their confidence intervals, are exactly the kind of evidence the applied question needs, and a forest plot is the standard way to read them at a glance:
The plot demonstrates the healthy way to ask “is this feature important?”: look at the effect size and its uncertainty together. Each point is a fitted odds ratio, its distance from the dashed line at 1.0 is the size of the effect (multiplying the odds by 1 changes nothing), and the whisker through it is a 95% confidence interval, the range of effect sizes the data is still consistent with once its noise is accounted for. Tenure’s odds ratio of 0.49 with a tight interval is strong evidence: the effect is large and the data pins it down. Weekday signup’s 0.89 with an interval still touching 1 is exactly what a useless feature looks like: the data cannot rule out “no effect at all”, and the model recovered that from data alone.
Four caveats keep this honest, and they apply to any coefficient-based importance reading:
- Coefficients are conditional, not marginal. \(\beta_j\) is the effect of \(x_j\) with every other feature held fixed, which can differ from, and even flip the sign of, the raw correlation between \(x_j\) and the outcome. Scikit-learn’s coefficient-interpretation guide shows a wage model where age correlates positively with wage but takes a negative coefficient once experience is in the model.
- Correlated features split the credit unstably. When two features carry overlapping information, the model can assign the shared signal to either, and small data changes swing the split. Check variance inflation factors, or refit on resamples and watch how much the coefficients move.
- Significance is not importance. With enough rows, a practically negligible effect earns a tiny p-value. The odds ratio (how big) and the interval (how certain) answer different questions, and both matter more than the p-value’s binary verdict.
- None of this is causal. A coefficient summarizes association under the model. Reading it as “if we change this feature, the outcome will respond” requires causal assumptions no fitting procedure can supply.
The assumptions, and how to actually check them
Every reading above leans on the model being roughly right, so the recipe’s three choices come with checkable assumptions. For linear regression these are exactly the four commitments we attached to the error term \(\varepsilon\) back in the opening section, packaged as the LINE mnemonic: Linearity of the mean in the features (the errors really do average to zero at every \(x\), so the line is not missing structure), Independence of the errors, Normality of the errors, and Equal variance (the spread around the line does not change with the prediction level).
Checking them takes two definitions, plus a notation habit worth making explicit: a hat on a symbol marks an estimate computed from data, as opposed to the true unknown quantity. So \(\hat{\beta}\) is the trained weights, and \(\hat{\mu}_i = \hat{\beta}^\top x_i\) is the fitted value for observation \(i\): run training row \(i\) back through the trained model and record what it says the average outcome should have been for a case like that. The residual is then everything the model could not explain about that row:
\[e_i = y_i - \hat{\mu}_i \qquad \textcolor{#7b8794}{\small\text{what actually happened, minus what the model expected}}\]Residuals matter because they are the observable stand-in for the error term. You can never see \(\varepsilon_i\) itself (that would require knowing the true \(\beta\)), but if the model is roughly right, the leftovers that training could not remove should behave like the \(\varepsilon\) we assumed: centered on zero, same spread everywhere, no structure, roughly Gaussian. That is the entire logic of residual diagnostics: we assumed things about \(\varepsilon\), we cannot check \(\varepsilon\), so we check its shadow. The remarkable practical fact is that two plots of the residuals catch most violations:
A residuals-versus-fitted plot puts each row’s prediction on one axis and its residual on the other, so it asks directly: does how wrong the model is depend on what it predicts? It should not, and a healthy plot looks like static: an even, patternless band around zero at every prediction level. Curvature means the linear form is missing something (try a transformed feature or an interaction); a funnel means variance grows with the mean, which biases the standard errors your confidence intervals are built from (consider modeling the variance, transforming the target, or robust standard errors). The Q-Q plot asks the remaining question, whether the errors are Gaussian-shaped, by sorting the residuals and plotting them against the values a perfect Gaussian sample of the same size would show at each rank: matching shapes sit on the line. Heavy tails peeling off it at the ends mean outliers are more common than the Gaussian admits, and since least squares punishes errors quadratically, those few points can drag the whole fit toward themselves. Producing both takes four lines:
import matplotlib.pyplot as plt
# y: a continuous target, X: sm.add_constant(features), as before
ols = sm.OLS(y, X).fit()
fig, axes = plt.subplots(1, 2, figsize=(10, 4))
axes[0].scatter(ols.fittedvalues, ols.resid, s=12) # want: an even band, no shape
sm.qqplot(ols.resid, line="45", fit=True, ax=axes[1]) # want: points on the line
One calibrating note from Gelman, Hill and Vehtari’s Regression and Other Stories: ranked by how much they damage your conclusions, the assumptions run validity and representativeness of the data first, linearity next, independence and equal variance after that, and normality last, despite normality being the one people most obsessively test. A mild Q-Q wobble mostly affects prediction intervals; a missed nonlinearity poisons every coefficient.
For logistic regression the checklist shifts, because there is no Gaussian to check and raw residuals from a 0/1 outcome are inherently striped. What to verify instead:
- Linearity in the log-odds, the assumption doing the heaviest lifting: each continuous feature should relate linearly to the logit, not to the probability. Plot binned empirical log-odds against each feature, or add a spline for the feature and see whether the model prefers it.
- Independence, as before: repeated measurements of the same customer, or clustered data, violate it and shrink the reported uncertainty dishonestly (mixed models and GEEs are the escape hatches).
- No separation. If some feature or combination perfectly splits the classes, the maximum-likelihood estimate does not exist: the optimizer pushes coefficients toward infinity, and the telltale symptoms are absurd coefficient magnitudes with enormous standard errors, R’s “fitted probabilities numerically 0 or 1” warning, or a perfect-separation error from statsmodels. Small datasets with many binary features are the usual culprits. Regularization or Firth’s penalized likelihood restores finite, usable estimates.
- Influential points. Cook’s distance flags observations that single-handedly move the fit; a handful of extreme rows deserving a second look is a data-quality finding, not a modeling nuisance. Checking how your feature distributions behave, as in the feature distribution analysis post, pairs naturally with this.
Where the recipe runs out
GLMs earn their keep through transparency, but the same three-choice structure that makes them readable hard-codes their limitations. Knowing them tells you when to reach for something else, and often the something else is a targeted patch rather than a different model family.
- The linear predictor is still linear. A GLM bends the output through the link, but effects in \(\eta\) remain straight lines. If churn risk genuinely U-curves in customer age, the model cannot say so unless you hand it polynomial or spline features; generalized additive models automate exactly this patch.
- Interactions must be hand-built. “Price sensitivity is different for long-tenure customers” is invisible to a main-effects GLM until you add the product term yourself. Tree ensembles like the ones in the XGBoost tuning post discover interactions automatically, which is a fair summary of when they outpredict GLMs, and of what they charge you in interpretability for it.
- Maximum likelihood is not robust. Like least squares, a GLM fit will chase a few wild rows; the outlier diagnostics above are load-bearing, not optional.
- The mean-variance relationship is fixed by the family. Poisson regression asserts the variance equals the mean; real count data is routinely more dispersed, which leaves standard errors overconfident even when predictions look fine. Quasi-likelihood or a negative binomial family are the standard fixes. This failure mode is easy to miss because it corrupts the uncertainty, not the point estimates.
- Coefficient readings degrade with feature engineering. The more transformed, interacted, and encoded the features become, the further “one unit of \(x_j\)” drifts from anything a stakeholder recognizes. The model stays fittable; the clean story is what erodes.
None of these are exotic. They are the ordinary price of a model simple enough to read, and the diagnostics in the previous section exist precisely because the price is payable.
Closing thoughts
Three ideas are worth carrying out of this post. First, models predict distributions’ means, not numbers: the probabilistic view is the one that survives contact with new outcome types, and the formula view is its shadow in the simplest case. Second, the two spaces resolve most interpretation confusion: learning happens where the model is linear, reporting happens where the outcome lives, and coefficients only carry their clean meaning in the first space. Third, the unification is practical, not aesthetic: one derivation explains why the sigmoid exists, why least squares and cross-entropy are the same idea wearing different distributions, and why one gradient, error times feature, fits the whole family. Linear and logistic regression stop being two things to memorize and become one thing understood twice.
Sources and further reading
- The Ultimate Beginner’s Guide to Generalized Linear Models (Albert Rapp) — a concrete-first walkthrough in R; the source of the OLS-then-squash versus maximum-likelihood comparison echoed above.
- Explaining generalized linear models (GLMs) (Very Normal, 2024) — a friendly 12-minute video treatment of the three-component recipe, pitched at research practitioners.
- Stanford CS229 lecture notes, Part I (Ng and Ma) — the exponential-family and GLM-construction sections this post’s derivations follow; an older edition of the notes also works the multinomial case through to softmax regression.
- Generalized Linear Models (Nelder and Wedderburn, 1972) — the original unification paper, which established iteratively reweighted least squares as the family’s shared fitting algorithm.
- Penn State STAT 504, Lesson 6 — clean textbook statements of the three components and worked odds-ratio examples.
- FAQ: How do I interpret odds ratios in logistic regression? (UCLA OARC) — the probability-to-odds-to-log-odds ladder with worked numbers.
- Regression and Other Stories (Gelman, Hill and Vehtari) — the assumptions-ranked-by-importance framing and the divide-by-4 rule.
- Common pitfalls in the interpretation of coefficients of linear models (scikit-learn) — the best single worked example of scale, conditionality, and collinearity traps in coefficient reading.
- statsmodels GLM documentation — the reference for the families, links, and IRLS fitting used in the code snippets.
- McCullagh and Nelder, Generalized Linear Models (2nd ed., 1989) — the definitive monograph, for the full theory including quasi-likelihood and deviance-based model checking.
