The Kalman filter from first principles

2026-06-28 · 7 min · state-estimation · kalman-filter · slam · robotics · explainer

I spend a lot of time on odometry and SLAM, and underneath almost all of it is the same idea: you have a noisy sensor, and a model of how the thing you're tracking moves, and you want to fuse them into a single best estimate that also tells you how much to trust itself. That's the Kalman filter. It's the optimal recursive estimator for a linear system with Gaussian noise, and once it clicks, you see it everywhere — GPS, IMUs, radar tracking, and the LiDAR-inertial filters I'll build on in the next article.

Let me derive it the way it actually makes sense: as fusing two Gaussians.

Two sources of information

You're tracking a state $\mathbf{x}$ — say the position and velocity of something. You have two things:

A process model: how the state evolves on its own. For constant velocity, $p_{t} = p_{t-1} + v\,\Delta t$ . You can predict forward, but the prediction drifts — it accumulates uncertainty.
A measurement model: a sensor that observes some function of the state, noisily. A position sensor gives you $z = p + \text{noise}$ .

Neither is enough alone. The prediction drifts; the sensor is noisy and jittery. The Kalman filter combines them, and the key fact is that combining two Gaussian estimates yields a third Gaussian that is sharper than either input.

prediction × measurement = posterior

predictionmeasurementposterior

prediction spread σ_pred10

measurement spread σ_z7

Kalman gain K

0.67

posterior mean

55.4

posterior σ

5.7 (< both)

The posterior mean is μ_pred + K(z − μ_pred): the gain K = σ²_pred / (σ²_pred + σ²_z) is the fraction of the way you move from the prediction toward the measurement. Shrink the sensor noise and K → 1 (snap to the measurement); shrink the prediction noise and K → 0 (keep the prediction). Either way the posterior is narrower than both inputs — combining two noisy estimates always tells you more than either alone. That’s the whole filter.

That's the whole update step. If the prediction is $\mathcal{N}(\mu_1, \sigma_1^2)$ and the measurement is $\mathcal{N}(\mu_2, \sigma_2^2)$ , their normalized product has

\mu = \mu_1 + K(\mu_2 - \mu_1), \qquad \sigma^2 = (1-K)\,\sigma_1^2, \qquad K = \frac{\sigma_1^2}{\sigma_1^2 + \sigma_2^2}.

$K$ is the Kalman gain — the fraction of the way you move from the prediction toward the measurement, set by which one is more certain. Trust the sensor ( $\sigma_2 \to 0$ ) and $K \to 1$ ; trust the prediction ( $\sigma_1 \to 0$ ) and $K \to 0$ . Everything else is this, generalized to vectors.

The predict-update cycle

In general the state is a vector $\mathbf{x}$ with covariance $\mathbf{P}$ , and the models are linear maps with Gaussian noise:

\mathbf{x}_t = \mathbf{F}\mathbf{x}_{t-1} + \mathbf{B}\mathbf{u}_t + \mathbf{w}, \quad \mathbf{w}\sim\mathcal{N}(0,\mathbf{Q}); \qquad \mathbf{z}_t = \mathbf{H}\mathbf{x}_t + \mathbf{v}, \quad \mathbf{v}\sim\mathcal{N}(0,\mathbf{R}).

$\mathbf{F}$ is the state-transition, $\mathbf{H}$ maps state to measurement, $\mathbf{Q}$ is process noise (how much the model drifts), $\mathbf{R}$ is measurement noise. The filter alternates two steps forever:

Predict — push the state and its uncertainty forward through the model:

\hat{\mathbf{x}} = \mathbf{F}\mathbf{x} + \mathbf{B}\mathbf{u}, \qquad \hat{\mathbf{P}} = \mathbf{F}\mathbf{P}\mathbf{F}^{\!\top} + \mathbf{Q}.

Update — correct with the measurement, by the matrix Kalman gain:

\mathbf{y} = \mathbf{z} - \mathbf{H}\hat{\mathbf{x}} \quad(\text{innovation}), \qquad \mathbf{S} = \mathbf{H}\hat{\mathbf{P}}\mathbf{H}^{\!\top} + \mathbf{R} \quad(\text{innovation covariance}),

\mathbf{K} = \hat{\mathbf{P}}\mathbf{H}^{\!\top}\mathbf{S}^{-1}, \qquad \mathbf{x} = \hat{\mathbf{x}} + \mathbf{K}\mathbf{y}, \qquad \mathbf{P} = (\mathbf{I} - \mathbf{K}\mathbf{H})\hat{\mathbf{P}}.

The Kalman loop: predict pushes the estimate and its covariance forward through the motion model (uncertainty grows by Q); update pulls it back toward the measurement by the gain K (uncertainty shrinks). The two steps alternate for every timestep.

Watch it run on a constant-velocity tracker. The state is $[\text{position}, \text{velocity}]$ , the sensor sees only position, and the filter has to infer velocity and smooth the noise. Drag $R$ and $Q$ to feel the trust trade-off the gain encodes:

constant-velocity Kalman filter · live

truthmeasurementsestimate ±1σ

measurement noise R (distrust sensor)49

process noise Q (let state drift)0.50

Raise R and the filter trusts its prediction over the sensor — the estimate smooths out but lags. Raise Qand it trusts fresh measurements more — it snaps to the data but gets jittery. The Kalman gain is exactly this ratio, computed optimally each step from the covariances, and the band is the filter’s own honest estimate of how unsure it is.

In code

The whole thing is a few lines of linear algebra. Here's the constant-velocity tracker above, in NumPy — no framework, runnable as-is:

import numpy as np
 
dt = 1.0
F = np.array([[1, dt],      # constant-velocity transition
              [0, 1]])
H = np.array([[1.0, 0.0]])  # observe position only
Q = 0.5 * np.array([[dt**3/3, dt**2/2],   # process noise (drift)
                    [dt**2/2, dt]])
R = np.array([[49.0]])      # measurement noise (sensor variance)
 
x = np.array([[0.0], [0.0]])      # initial state
P = np.eye(2) * 50.0              # initial uncertainty
 
def step(x, P, z):
    # predict
    x = F @ x
    P = F @ P @ F.T + Q
    # update
    y = z - H @ x                 # innovation
    S = H @ P @ H.T + R           # innovation covariance
    K = P @ H.T @ np.linalg.inv(S)   # Kalman gain
    x = x + K @ y
    P = (np.eye(2) - K @ H) @ P
    return x, P
 
for z in measurements:            # stream of noisy position readings
    x, P = step(x, P, np.array([[z]]))
    # x[0] is the smoothed position estimate, P[0,0] its variance

Two knobs do all the tuning. R says how noisy the sensor is; Q says how much you let the model drift. Get their ratio right and the filter is optimal; get it wrong and it either lags reality (Q too small) or chases noise (R too small). In practice you start from the sensor's datasheet for R and tune Q until the innovation $\mathbf{y}$ looks like white noise.

When the world isn't linear

The plain Kalman filter assumes $\mathbf{F}$ and $\mathbf{H}$ are linear. Robotics is full of rotations and projections that aren't. Three extensions matter, and the last one is the bridge to LiDAR-inertial SLAM:

Extended KF (EKF). The dynamics $f$ and measurement $h$ are nonlinear, so you linearize them at the current estimate: use the Jacobians $\mathbf{F} = \left.\frac{\partial f}{\partial \mathbf{x}}\right|_{\hat{\mathbf{x}}}$ and $\mathbf{H} = \left.\frac{\partial h}{\partial \mathbf{x}}\right|_{\hat{\mathbf{x}}}$ in place of the matrices, run the same equations. Cheap, and it works when the nonlinearity is mild over one step.
Iterated EKF (iEKF). One linearization point can be bad if the prior is far from the truth. So relinearize: after the update, recompute $\mathbf{H}$ at the new estimate and redo the update, iterating until it converges. Each iteration is a Gauss-Newton step on the maximum-a-posteriori objective. This is what FAST-LIO uses — it matters because the point-to-plane LiDAR residual is very nonlinear in the pose.
Error-state / on-manifold KF. You can't add a vector to a rotation $\mathbf{R}\in SO(3)$ and stay on the manifold. So you track the state on the manifold but the error (and its covariance) in the tangent space, fusing with the $\boxplus/\boxminus$ operators instead of $+/-$ . This keeps rotations valid and the covariance minimal (3 numbers for orientation, not 9).

That last combination — an iterated, error-state Kalman filter on a manifold — is exactly the engine inside FAST-LIO2, fusing a high-rate IMU prediction with thousands of raw LiDAR points per scan. The Kalman gain there gets one more clever rewrite to handle those thousands of measurements cheaply, which is where I'll pick up next.

The one-paragraph summary

A Kalman filter holds a Gaussian belief over a state. Predict moves the belief through a motion model and inflates its uncertainty; update multiplies it by the measurement's Gaussian, which sharpens it and pulls the mean toward the data by the gain $\mathbf{K}$ — optimally weighted by which source is more certain. Linearize for nonlinear systems (EKF), relinearize for hard ones (iEKF), and track the error in the tangent space for rotations. That's the whole toolkit, and it runs real robots.