How Kalman Filters Work, Part 2

Getting Serious

It's time to talk about improving a filter's speed, accuracy, and reliability. In this section, we're going to focus on several techniques that turn a basic filter into a filter fit for implementation. We'll go over the reasons one might use them and simple descriptions of how they work. The goal is to provide a foundation on what things are commonly useful so that a reader will know what to seek out.

Fidelity

When implementing the math for a Kalman filter as code, there's one particularly bad thing that sometimes happens: the covariance matrix, due to the accumulation of roundoff errors over time, slowly loses its symmetry and may even lose its positive-semidefiniteness. One could just say, “This means that the covariance matrix is invalid and all things computed from it are garbage,” and then move on. On the other hand, it's actually easy to see why, so let's visit the math for a moment. If a matrix, $P$, is positive-semidefinite, then $a^T P a \geq 0$ for any non-trivial vector $a$. Simple enough. So let's fill in what that would mean for a covariance matrix in particular, where $P = E\left( (x - \bar{x}) (x - \bar{x})^T \right)$:

$$ \begin{align} a^T P a & = a^T E\left( (x - \bar{x}) (x - \bar{x})^T \right) a \\ & = E\left( a^T (x - \bar{x}) (x - \bar{x})^T a \right) \\ \end{align} $$

Of course, both $a$ and $x$ are vectors, so $b = a^T (x - \bar{x})$ is a scalar. That means we have $E(b^2)$. Clearly $b^2$ can't be negative for any real $b$! So, covariance matrices must be positive-semidefinite (the “semi-” means it's possible for $a^T P a$ to be 0; for positive-definite, $a^T P a \gt 0$).

If the covariance matrix becomes non-positive-semidefinite (indefinite), it's invalid and all things computed from it are garbage. With straight-forward implementations of the Kalman filter, roundoff causes this to happen, so how do we prevent it from becoming a problem?

Estimating the Right Things

First, we need to make sure we're estimating the right things. Suppose we were trying to estimate the three interior angles of a triangle, so our state vector, $x$, consisted of these three angles. Of course the sum of the angle is $\pi$. We could state this as $\pi = \begin{bmatrix} 1 & 1 & 1 \end{bmatrix} x$. There's no uncertainty here; the values must sum to exactly $\pi$, which means the variance of $\begin{bmatrix} 1 & 1 & 1 \end{bmatrix} x$ is 0. Let's write that out:

$$ \begin{align} 0 & = E\left( \begin{bmatrix} 1 & 1 & 1 \end{bmatrix} (x - \bar{x}) (x - \bar{x})^T \begin{bmatrix} 1 & 1 & 1 \end{bmatrix}^T \right) \\ & = \begin{bmatrix} 1 & 1 & 1 \end{bmatrix} E\left( (x - \bar{x}) (x - \bar{x})^T \right) \begin{bmatrix} 1 & 1 & 1 \end{bmatrix}^T \\ \end{align} $$

If $P$ is the covariance of $x$ and we let $a = \begin{bmatrix} 1 & 1 & 1 \end{bmatrix}^T$, then we can substitute these into the above:

$$ 0 = a^T P a $$

And now we see the problem: the covariance matrix is dangerously close to being indefinite (0 is very close to being < 0). This means we'll probably have problems running the filter.

Fortunately, the fix is a win-win. We should only estimate two of the angles, which means the covariance matrix is 2-by-2 and will be positive-definite! This will be just as accurate as before, mathematically, but it will be a smaller and better structured calculation for the computer. In other words, the filter would be faster and more stable. The most important part of any algorithm is setting up the problem correctly before one ever starts to code, and this is the right way to set up the problem here.

Further, it's helpful to pick states that have roughly similar scales (a few orders of magnitude is fine). For instance, if your state vector includes the distance to some other planet as well as the aberation of a lens on the telescope, you should probably not use meters to represent both. That would result in numbers with wildly different scales being multiplied and added together in the filter, which would cause severe roundoff error. For instance, when using doubles, (1,000,000,000,000 + 0.000,001) - 1,000,000,000,000 == 0. In filters, using mixed units can be a good idea; just be careful with them!

Joseph Form

Still, some problems will have covariance matrices in which some values get really small relative to other values. This is dangerous ground for a Kalman filter, especially during the correction of the covariance:

P = P - K * H * P;

That subtraction is potentially deadly; roundoff could cause one of the diagonals to be < 0. Fortunately, there's an easy way to deal with this. When we derived the above equation in part 1, we at one point had an equation like this:

$$ P^+_k = P^-_k - P_{xy} K^T - K P_{xy}^T + K P_{yy} K^T$$

We then substituted in the meaning of $K$, and a lot of things cancelled, but we won't cancel just yet. Instead, let's substitute $P_{xy} = P^-_k H^T$ and $P_{yy} = H P^-_k H^T + R$ into the above equation. With some rearranging, we'd obtain:

$$ P^+_k = \left( I - K H \right) P^-_k \left( I - K H \right)^T + K R K^T $$

where $I$ is an identity matrix. This is called the Joseph form update, and it's best implemented as:

A = eye(n) - K * H;
P = A * P * A.' + K * R * K.';

where .' means “transpose” and eye(n) produces an n-by-n identity matrix.

Clearly, this requires a little more matrix math, but the result is that two necessarily positive-semidefinite things add together; it can't accidentally make P become indefinite. Using Joseph form is an easy way to protect a filter against problems with roundoff error. In fact, the example EKF and KF code use Joseph form.

As a bonus, this form didn't ever say what $K$ was; $K$ could be any matrix. As long as $x^+_k = x^-_k + K \Delta z$, then the covariance of $x^+_k$, $P^+_k$, is updated from $P^-_k$ using the Joseph form. This will be a useful property later.

Square-Root Filtering

On Apollo, they had really small word lengths — bits given to the representation of a floating point number — in their onboard computer. As a result, they were very susceptible to roundoff error in the filter, which they found through simulations. They tried many hacks to get around the problem, and older books still discuss these hacks. However, the hacks present their own problems (no surprise). The right solution was found by a guy named James E. Potter. In a computer, the number of bits determines how large a number can be represented. Every additional bit doubles the size of number that can be represented (this is only literally true for integers, but it's similar for floating-point types). However, imagine if, instead of storing the concept of $x$, one stored the concept of $\sqrt{x}$. The square root requires a smaller range; so with the same word size in the computer, we can represent a much larger $x$. Of course, we can't add $\sqrt{x}$ with $\sqrt{y}$ using the same arithmetic, so storing the square-root form would require more work. However, Potter, taking some work home with him for a weekend, found an elegant way to do this for Kalman filters.

In a “square-root” filter, we don't store the covariance matrix at all, but instead we store the matrix square root, $C$, of the covariance matrix, $P$:

$$ P = C C^T $$

Here, $C$ is called the lower Cholesky factor for the covariance matrix. There are other definitions of “matrix square root”, but this one is standard for filters.

Recalling that all operations on the covariance matrix are symmetric, one begins to see how using “one side” of those operations could make the algorithm could work to update $C$ over time, without ever explicitly needing $P$.

In fact, this wasn't the end. The literal matrix square-root is only one type of decomposition; there are other types of decomposition that are useful. A common one is $U D U^T$ decomposition, where $P = U D U^T$, with $U$ being an upper-unitriangular matrix (zeros below the diagonal, ones on the diagonal, other values above the diagonal) and $D$ being a diagonal matrix (zeros everywhere but the diagonal). The $U$ and $D$ factors are then updated directly by the UD filter algorithm. This isn't literally a “square root”, but it functions similarly, and is one of the best square-root forms available.

Interestingly, square-root filters don't require much more work from the computer than a standard Kalman filter, and yet they have drastically improved numerical properties. For this reason, they're highly recommended when reliability is a requirement. Because there are many different forms, and because the derivation of a square-root filter is tedious, we won't go into the math here. A good book is the best reference. See the box to the right.

Square-root formulations are available for Kalman filters, information filters (see below), and sigma-point filters, and these all have long, well-tested histories. In fact, the Kalman filter on the Apollo Guidance Computer was a square-root form, and it's a testament to Potter's quick work that the very first Kalman filter to run in a vehicle was in fact a square-root form.

Speed

The computation of the Kalman gain requires the inversion of an $n_z$-by-$n_z$ matrix, where $n_z$ is the dimension of the measurement, and this inversion is by far the most costly part of the algorithm. However, there are good ways to address this too.

Sequential Updates

Often, the measurement represents data from a collection of different sensors. The noise from two different sensors is generally uncorrelated. The covariance matrix will therefore be block-diagonal, something like so:

$$ R = \begin{bmatrix} R_{11} & R_{12} & 0 & 0 \\ R_{21} & R_{22} & 0 & 0 \\ 0 & 0 & R_{33} & R_{34} \\ 0 & 0 & R_{43} & R_{44} \\ \end{bmatrix} = \begin{bmatrix} R_a & 0_{2x2} \\ 0_{2x2} & R_b \\ \end{bmatrix} $$

In cases like this, we can actually use multiple, sequential updates to the state and covariance matrix. The first update would be (this is just the correction portion of the EKF):

$$ \hat{z}_a = h_a(\hat{x}^-_k) $$ $$ H_a = \left. \frac{\partial}{\partial x} h_a \right\rvert_{\hat{x}^-_k} $$ $$ K_a = P^-_k H_a^T (H_a P^-_k H_a^T + R_a)^{-1} $$ $$ \hat{x}^a_k = \hat{x}^-_k + K_a \left( z_a - \hat{z}_a \right) $$ $$ P^a_k = P^-_k - K_a H_a P^-_k $$

and then the second update would be:

$$ \hat{z}_b = h_b(\hat{x}^a_k) $$ $$ H_b = \left. \frac{\partial}{\partial x} h_b \right\rvert_{\hat{x}^a_k} $$ $$ K_b = P^a_k H_b^T (H_b P^a_k H_b^T + R_b)^{-1} $$ $$ \hat{x}^+_k = \hat{x}^a_k + K_b \left( z_b - \hat{z}_b \right) $$ $$ P^+_k = P^a_k - K_b H_b P^a_k $$

That is, each uncorrelated block of the measurement covariance matrix can be used to independently update the state estimate and covariance matrix, each building on the previous correction. If the observation function were linear, the results would be identical to updating all at once with the full measurement covariance matrix. For nonlinear observation functions, the results will be very close to identical and actually might be slightly better.

Inverting multiple small matrices is definitely less work than inverting one large one, so this may help save some time. In the extreme case that every element of the measurement is uncorrelated, then the measurement covariance matrix is simply diagonal, the above corrections can be carried out in a loop, and the inversion is simply scalar division! This is much faster than a matrix inversion. Here's the sequential scalar correction process as rough MATLAB code:

for i = 1:n_z                          % For each element of z...
  
  z_hat = h(x);                        % Predict scalar measurement.
  H     = calc_H(x, i);                % Calculate ith row of Jacobian.
  P_xy  = P * H.';                     % Calculate parts of Kalman gain.
  P_yy  = H * Pxy + R(i,i);            % Note that this is a scalar!
  y     = z(i) - z_hat(i);             % Calculate y for the ith sensor.
  x     = x + P_xy * (y/P_yy);         % Correct state.
  P     = P - 1/P_yy * (P_xy * Pxy.'); % Correct covariance.
  
end

Note that this is a natural time to also consider the case that not all sensors update at the same rate. Sensors that update at different rates generally have uncorrelated noise, so they fit nicely into this process:

for i = 1:n_z % For each element of the measurement...
  
  if updated(i) % If this sensor was updated on this sample...
    ... [ as above ] ...
  end
  
end

The use of sequential scalar updates can make a Kalman filter extremely fast. For realistic applications, a reduction to a small percentage of the runtime of the “full matrix” update is common. Further, it requires less RAM, since the various intermediate values are scalar and column vectors.

Finally, when the measurement covariance matrix is constant but not diagonal, we can create a pseudo-measurement whose measurement covariance matrix will be diagonal. Let's call the pseudo-measurement $\tilde{z}$ with covariance $\tilde{R}$ so it's related to the regular measurement by:

$$ \tilde{z} = L z $$ $$ \tilde{R} = L R L^T $$

We want to pick $L$ so that $\tilde{R}$ is diagonal, so let's move the $L$ terms to the left:

$$ L^{-1} \tilde{R} L^{-T} = R $$

We can use UDU decomposition on $R$, so that:

$$ U D U^T = R $$

where $D$ is diagonal and $U$ is upper-unitriangular (like we saw above). $U$ is therefore invertible, so $L = U^{-1}$ and $\tilde{R} = L R L^T$. When $R$ is constant, this can all be done once, offline. Then, the pseudo-measurements can be formed by multiplying by $L$, and the Jacobian will need an $L$ too. Then we can run the filter with sequential scalar updates. Here's are some snippets of code for this:

% Done once, offline:
[U, D] = udfactor(R);
L = inv(U); % Or use back-substitution on this since it's triangular.
R_twiddle = L * R * L.';

...

% In the filter:
z_twiddle     = L * z;
z_hat_twiddle = L * z_hat;
H_twiddle     = L * H;

One can now use sequential scalar updates on the “twiddle” system.

Note that the factorization of $R$ takes as long as an $n_z$-by-$n_z$ matrix inversion, so this only saves time when $R$ is constant, allowing this diagonalization to be done offline. Further, note that if $R$ is constant and block diagonal, a similar procedure can be done for each block individually.

Information Filtering

Another good option for speed is much more advanced, but is also elucidating: information filters. We'll touch on them only lightly. These filters correspond exactly to linear and extended Kalman filters, but instead of updating the covariance matrix (or some “square-root” form of it), they update the inverse of the covariance matrix, which is called the information matrix. The entire Kalman filter can be re-cast to work with this information matrix, and this causes something strange to happen: the correction portion of an information filter is as easy as the propagation portion of a Kalman filter (no inverses!), but the propagation portion of an information filter is as hard as the correction portion of a Kalman filter (inverses). A key difference is that, while a Kalman filter inverts an $n_z$-by-$n_z$ matrix (where $n_z$ is the dimension of the measurement), an information filter inverts an $n_x$-by-$n_x$ matrix (where $n_x$ is the dimension of the state). Therefore, when $n_z > n_x$ (there are a lot of measurements), information filters require less work.

Further, because the information filter is the dual of the Kalman filter, many tricks that work on the correction portion of the Kalman filter can work in the propagation portion of the information filter, like sequential updates. They're also amenable to the square-root forms, and square-root information filters are common in the literature.

We could fill a separate article with information filter… information, but that's not our point here. Instead, let's just focus on how information filters calculate the state, and to do this, let's start with a different problem: least-squares fitting. For this, let the state that we're trying to estimate be constant so that we don't have to worry about process noise or propagation for this quick example. Let the measurement be linear:

$$ z_k = H x + r_k $$

where $r_k$ is Gaussian with zero mean and constant covariance $R$ and $H$ is constant.

If we collected $n$ samples and were to use all of these samples in one big batch to calculate $\hat{x}$ in order to minimize the sum of the squares of all $z_k - H \hat{x}$, then a statistics textbook will tell us that this is weighted least-squares fitting and that the best right equation is:

$$ \hat{x} = \left( H^T R^{-1} H \right)^{-1} H^T R^{-1} \left( \frac{1}{n} \sum_{k=1}^n z_k \right) $$

(We'll just take that as given. See a statistics book for a derivation.)

Of course, information filters, like Kalman filters, don't use data in one big batch. Instead, they update over time, as each new measurement comes in. That update process looks like this:

$$ Y_k = Y_{k-1} + H^T R^{-1} H $$ $$ v_k = v_{k-1} + H^T R^{-1} z_k $$

where $Y$ is the information matrix at whatever sample and $v$ is the information state. What will we have after $n$ samples of this process, if $Y_0$ and $v_0$ start as all zeros (no information)?

$$ Y_n = \sum_{k=1}^n H^T R^{-1} H = n H^T R^{-1} H $$ $$ v_n = \sum_{k-1}^n H^T R^{-1} z_k = H^T R^{-1} \sum_{k-1}^n z_k $$

So let's construct the state estimate after $n$ samples, starting from the equation for the batch least squares estimate of $\hat{x}_n$ (the thing we got from the statistics book above):

$$ \hat{x}_n = \left( H^T R^{-1} H \right)^{-1} H^T R^{-1} \frac{1}{n} \sum_{k=1}^n z_k $$

Let's move $n$ into the inverted matrix, and then we'll see our $Y_n$ and $z_n$ values, plain as day:

$$ \begin{align} \hat{x}_n = & \left( n H^T R^{-1} H \right)^{-1} \left( H^T R^{-1} \sum_{k=1}^n z_k \right) \\ & = Y_n^{-1} v_n \\ \end{align} $$

which is in fact how the information filter produces the new estimate of the state. This makes it clear that an information filter simply collects the measurements over time into a compact form that at any moment can be used to construct the exact same state estimate as would be produced by a batch least-squares fit all of previous data. Of course, Kalman filters do this too, because information filters and Kalman filters ultimately do the same thing, but it's much clearer in the formulation of the information filter how it relates to typical least-squares fitting.

While this example was simplified, information filters work anywhere that linear and extended Kalman filters work. They can even work when no state is initially known, because “infinite covariance”, which can't be represented in a computer, corresponds to “zero information”, which can easily be represented as information matrix full of zeros. This is sometimes a nice advantage.

There are a few caveats, however. Information filters work with the inverses of several matrices: the Jacobian of the propagation function, the process noise covariance matrix, and the measurement noise covariance matrix. Sometimes inverting these is easy (they're diagonal or block diagonal). Sometimes they're constant and can be done offline. Sometimes analytical forms to construct the inverses are feasible. In general, though, they require more work. Information filters therefore are often faster than Kalman filters only when $n_z$ is much bigger than $n_x$ and the measurement noise covariance matrix is constant (in which case its inverse is constant and can be computed offline). Further, despite their potential advantages, information filters are much less common in practice than Kalman filters, and so it can be much harder to find relevant information. All of these things combine to make information filters an attractive option for some problems, but somewhat more advanced.

Simulations

We'll soon discuss how to improve the accuracy of a Kalman filter, but first we need to talk about how we determine accuracy in the first place: simulations. These are really easy to construct for state estimators, because we already have all of the parts we need, including a propagation function, and measurement function, the noise covariance matrices, and some initial condition.

A Basic Simulation

First, we define the initial conditions for the true system and for the filter. Sometimes a filter is initialized with the first measurement, such as this:

% True stuff
R   = ... something ...;     % Measurement covariance matrix
x_0 = ... something ...;     % Initial true state
z_0 = h(x_0) + covdraw(R_0); % Initial measurement (with noise)

% Filter stuff
x_hat_0 = ... something with z_0 ... % Initial estimate
P_0     = ... some function of R ... % Initial covariance

Other times, the filter always starts at a certain state, but we'd still like for the initial covariance to match the initial error between the truth and the estimate. So, we can form the truth as a random draw from the covariance, and this will have the proper statistics.

% Filter stuff
x_hat_0 = ... something ...; % Constant initial estimate
P_0     = ... something ...; % Constant initial covariance

% True stuff
x_0     = x_hat_0 + covdraw(P_0); % Truth is estimate + noise

The next part is the simulation loop, which usually looks something like this:

% Set the "k-1" values to the initial values.
x_km1     = x_0;
x_hat_km1 = x_hat_0;
P_km1     = P_0;

% Loop for n time steps
for k = 2:n+1

  % Propagate and measure the truth.
  u   = ... something ...; % Form the input vector.
  x_k = f(x_km1, u);       % Update the true state.
  x_k = x_k + covdraw(Q);  % Add noise with covariance Q.
  z_k = h(x_k, u);         % Form the measurement.
  z_k = z_k + covdraw(R);  % Add noise with covariance R.
  
  % Run the filter.
  [x_hat_k, P_k] = my_filter(x_hat_km1, P_km1, u, z_k);
  
  % Calculate and store the error and the covariance matrix.
  errors(:,k) = x_k - x_hat_k;
  P(:,:,k)    = P_k;

end

(In the appendices, we'll discuss how to draw noise with a given covariance to make a function like covdraw. That function is also available in the example code, here.)

Mismatched Simulations

The models used by the filter are often much simpler than models used in a detailed simulation, and are almost certaintly simpler than reality. Therefore, the simulation of the truth may use a different state, propagation function, process noise, etc. This is called a mismatched model and is useful for determining how well a filter does when it's imperfect. It's often useful to first make a basic simulation that uses exactly the same models as the filter, just to make sure the filter works with respect to its assumptions. Then, one might proceed to integrate it into a bigger, more realistic simulation. Of course, one should always be able to use truth to calculate the filter's error, and any respectable simulation would record this error for later analysis.

Monte Carlo Methods

A Monte Carlo test simply refers to running the simulation many times, each with a different random number seed so that all the random draws are different. We can then use the results to determine accuracy and associated statistics. Suppose our simulation function returns, among other things, the time history of the estimator's error and covariance matrices, like so:

% For each run of the Monte Carlo test...
for r = 1:nr
  rng(r); % Set the random number seed.
  [errors(:,:,r), P(:,:,:,r), (etc.)] = my_simulation(...);
end

We could now look at typical errors, etc. Critically, we could also see if our covariance matrices match the errors over time. Recall that we can use a covariance matrix to calculate a ellipsoid, like a 3σ ellipsoid. If that covariance matrix were right, then we should see that 99.7% of the errors are within that ellipsoid. So, for each sample of each run, we could determine if the real error was inside the 3σ ellipsoid. We'd expect to see that it is about 99.7% of the time. If fewer errors are within the ellipsoids, then that means our filter may be optimistic — the errors are greater than the covariance matrix “thinks” they are, and this is potentially a problem. The problem with optimism is that the filter trusts itself too much and outside data too little. The estimate may diverge from reality, and subsequent data will not sufficiently correct it. As it diverges, the Jacobians become more and more wrong, and proper correction becomes less and less likely. This can be prevented by acknowledging additional sources of uncertainty in the system in order to keep the covariance matrix consistent with reality. We'll examine this more below.

If too many errors are within the ellipsoid, then the filter is pessimistic, which means we might be able to get better estimates. This is generally not a big problem. There is statistical value in a little humility.

Actually calculating the ellipsoids would be laborious, and there are much better ways to determine if the true errors are consistent with the filter's covariance matrix. For the most demanding test, take the estimation error for each sample of each run and normalize it by the corresponding covariance matrix:

$$ \epsilon_k = \left( x_k - \hat{x}^+_k \right)^T \left( P^+_k \right)^{-1} \left( x_k - \hat{x}^+_k \right) $$

This will be a scalar. Now add up all of these normalized estimation error squared (NEES) values for sample $k$ of each run. This sum should have a chi-square density with $n_r n_x$ degrees of freedom, where $n_r$ is the number of runs and $n_x$ is the dimension of the state. The 95% confidence bounds for a chi-square distribution with this number of degrees of freedom can be looked up in a table or computed with statistics software. Naturally, 95% of the NEES values should be inside these bounds, and if so (or if nearly so), we say that the filter is consistent.

Note that for a single run, the bounds will be very loose. As the number of runs increases, the bounds become much tighter. A filter may appear consistent with 10 runs but badly inconsistent with 1000 runs. When running a linear Kalman filter on linear problem that is modeled correctly, the filter will be consistent and have very nearly 95% of its errors inside the bounds. Because extended filters deal with nonlinearities, they often have fewer errors inside of the bounds, and this can be address in a number of ways, as we'll see below.

There are many more such tests, and they are well-documented in Bar-Shalom et al. under “filter consistency”.

Accuracy and Consistency

So far, we've talked about ways to improve a filter's numerical stability and speed without changing the theoretical results. However, if we revisit some of the assumptions in extended Kalman filters, we'll find that we can potentially improve their accuracy too.

Iteration

For linear problems that are perfectly well described, the Kalman filter is the optimal algorithm in a least-squares sense (like we started to see above). There's no beating it. However, when applied to real problems, especially nonlinear problems, there's room for improvement. Consider the calculation of the state-innovation covariance and innovation covariance:

$$ P_{xy} = P^-_k H^T $$ $$ P_{yy} = H P^-_k H^T + R $$

These are only correct insofar as $H$ and $R$ are correct, and those are often calculated based on the propagated state estimate. There's certainly error, and the extended Kalman filter does not take this into account. However, there's a good solution here. Moments after we calculate these terms, we calculate the Kalman gain and correct the state. At this point, the state is closer to the truth than it was when we calculated $H$ and $R$. So, we can re-calculate those terms and redo the correction, gaining a little bit more accuracy. For highly nonlinear problems, this can be a big difference in the accuracy of the results. Further, once we've calculated a better estimate, we could once again re-calculate the matrices, etc. However, the benefit falls off quickly. Usually two or three iterations (the normal first pass plus one or two more) is sufficient.

Note that on each correction phase, the state is corrected relative to the propagated state; it is not corrected with respect to the last correction! For instance, the code would look something like this:

x_prior = x;       % Store the state before any corrections are made.
z_hat   = h(x);    % Predict the measurement based on the prior.
for k = 1:3        % For 3 iterations...
  H = calc_H(x);                   % Calculate the system matrices.
  R = calc_R(x);                   % Sometimes we calc. this too.
  K = P * H.' / (H * P * H.' + R); % Calculate the Kalman gain.
  x = x_prior + K * (z - z_hat);   % Correct wrt prior.
end
P = P - K * H * P; % Finally, correct the covariance too.

An extended Kalman filter that uses iteration is called an iterated extended Kalman filter (IEKF).

This is a powerful and easy technique, but it comes with a cost. The calculation of the Kalman gain is the most expensive part of a Kalman filter, and each iteration requires that we do it again. It may, therefore, be useful to figure out a good metric to determine when to continue iterating, so that it's only done when necessary. This will be problem dependent, so there are no general rules to follow.

Smoothing

When a new measurement comes in, the Kalman correction is calculated, and the result is the best estimate of the state that can be obtained using all previous data. As times goes on, this will continue to be true for future measurements, but the Kalman filter does not correct old estimates. For instance, if our initial estimate were really bad, but after 100 measurements we had a good estimate, we may wish to propagate backwards to figure out the best estimate of the initial state. This process of using current measurements to correct previous estimates is called smoothing, and there are several ways that it can be done.

For instance, in an offline calculation, one may wish to smooth an entire time history of collected data and estimates. Suppose we were alpine skiing and running an application to track our progress in real-time and also wanted to see our full trajectory going down the mountain. The application tracking us might very well use a Kalman filter to correct our current position and velocity estimates in real-time as each new measurement comes in. And the end of the day, when we get back to the lodge and plug in to our laptops, the application may start at the end of the time history with the final, best estimate of position, and work backwards to smooth out what our trajectory must have been, back to the time we started the application. Maybe we could even use two devices, one in a chest pocket and one just above a boot, so that after smoothing, we could see the moguls we conquered as the oscillating difference in the two positions.

Alternately, sometimes smoothing is used to correct only a few samples back, so that the current and previous few samples are the best estimate of the recent trajectory. This type of thing can be done relatively quickly (it adds one $n_x$-by-$n_x$ matrix inversion per smoothed sample), and so is sometimes part of embedded applications.

There is another potential application for smoothing: just like we discussed how iteration could be used to generate better system matrices, smoothing too can be used for such a purpose. Imagine calculating a smoothed trajectory back some number of samples. This smoothed trajectory can be used to calculate the relevant system matrices at each time step. The original estimate and covariance from the beginning of the smoothed portion (not the smoothed estimate and covariance) can then proceed forward in time, using these more accurate system matrices, along with the stored measurements, to update as normal. Naturally, this is expensive, because it requires storing the original estimated states and covariance matrices plus the smoothed state for each sample of the smoothed time interval, as well as re-calculating the system matrices and re-running the Kalman correction for each sample! If one were to smooth five samples, the runtime and RAM requirements would be something like 10 times greater than a baseline extended Kalman filter (and actually many times greater than that if the baseline filter uses sequential scalar updates). On the other hand, it can help an estimate on a highly-nonlinear problem converge towards the truth much more quickly, with much more accurate covariance matrices.

The Rausch-Tung-Striebel (RTS) smoother is the easiest and most common form of smoother. We used the $^-$ superscript to mean “prior” and $^+$ to mean “posterior” for the relevant sample, so let's use $^{++}$ to mean “smoothed”. Then the RTS smoother equations can correct backwards in time as:

$$ x^{++}_k = x^+_k + C_k \left( \hat{x}^{++}_{k+1} - \hat{x}^{-}_{k+1} \right) $$

where the matrix $C_k$ maps a difference between the smoothed estimate and the originally predicted estimate for sample $k+1$ to a correction on the estimate for sample $k$. The $C_k$ matrix is formed from the posterior covariance for sample $k$ and the propagated covariance from sample $k+1$:

$$ C_k = P^+_k F_k^T \left( P^-_{k+1} \right)^{-1}$$

All of these terms were used in the regular Kalman filter, so they're already available. However, they must all be stored in order to use smoothing. Further, notice that the smoothed covariance is not required. However, it's often desired, and the covariance of the smoothed estimate can also be calculated alongside the smoothed estimate:

$$ P^{++}_k = P^+_k + C_k \left( P^{++}_{k+1} - P^-_{k+1} \right) C_k^T $$

Notice how similar this is to the correction of the state estimate above.

We start the smoothing process at the most recent sample, $n$, where the most recent estimate is the first smoothed estimate ($\hat{x}^{++}_{n} = \hat{x}^{+}_{n}$), and we continue to apply these corrections backwards in time as far as is desired.

There are better forms than the RTS smoother for a variety of reasons, though these are much less common and are pretty advanced. A quick search on “smoothing” in a journal about controls will provide several days worth of reading material.

Tuning the Noise

So far, we've talked about increasing the accuracy of the estimated state. However, a Kalman filter's performance is only as good as its covariance matrices, so we need to make sure those represent reality well too. We saw above that we can test how well the real errors match the filter's covariance (consistency of the filter), so when we find that a filter is inconsistent, what can we do?

Generally, filters are inconsistent on the side of optimism, which is bad. In this case, one very simple method is to increase the values in the process noise covariance matrix used by the filter. It's easy to see why; this makes the covariance larger. And since the Kalman gain is the “ratio” of the covariance of the prediction to the covariance of the innovation, this means the gain will weigh the measurement more heavily than it otherwise would, helping to prevent the estimate from diverging.

Sometimes, we can calculate exactly how much we should add to the covariance matrix. For instance, in the bouncing ball example, what if there were uncertainty in gravity? Maybe the system doesn't know if the ball is bouncing at sea level or at the top of Mount Everest, and the differences in gravitational acceleration between those places may matter. We could calculate the variance of gravity in the locations where one might find the ball, and we could then calculate how much this uncertainty affects the state uncertainty. We would then add this to the process noise covariance matrix.

Other times, it's not clear where additional uncertainty may come from. In fact, there may not be any additional uncertainty at all (perhaps the simulation draws noise in exactly the way the filter expects). In this case, inconsistency would have to be due to the nonlinearities of the problem. That's much harder to satisfy. How should we increase the process noise covariance to account for this? Do we increase the noise by a constant factor? Add values only to the diagonals? Allow it to change over time? There are no good guidelines here, and how it's done is arbitrary. Monte Carlo testing may reveal an acceptable value to use. One may even use a simple search/optimization strategy to find values such that the 95% confidence bounds contain 95% of the errors, but this may take a long time to run (every iteration of the search will require a large number of simulations in order to have good statistics). This is often called tuning, and it's no fun. This is sometimes necessary, but sometimes, there's a better way, which we'll get to below.

On the rare occassion that a filter is found to be pessimistic, one should check the simulation to make sure all of the appropriate uncertainty is represented with random draws. If so, the simulation may need better fidelity, which may require producing additional errors (e.g., the times at which sensor values are sampled). The point here is that filters are simplified and make a lot of assumptions about correlation, sample times, etc., and filters just aren't commonly pessimistic once these are taken into account. If somehow it were, one might simply declare success. Potentially, one could increase the measurement noise covariance by a small amount, causing the filter gain to “trust” the propagated state a little more.

The uncomfortable part about tuning for filter consistency is that estimates often look better when a filter is optimistic (it will be smoother). The problem is the inherent threat of divergence, which might not be observed in short simulations or small numbers of runs. So, tuning is often a toss-up between consistency and accuracy of the estimate on most runs. One must use good judgment.

To restate: tuning the noise is a simple measure that doesn't require any modification to the filter itself. When it's clear how to add more uncertainty to the process noise covariance matrix, that's great, and one should do so, but sometimes it's unclear, and this can become a tedious thing to fix. In these cases, there may be a better way.

Consider Covariance

State estimators assume that there's uncertainty in three places: the current estimate, the process noise, and the measurement noise. Those sources are all assumed to be uncorrelated, but sometimes this isn't enough. Let's reconsider the falling coffee filter from part 1. During its slow descent, the air density will make a difference in how quickly it falls (we did not explicitly model air density in the example). Suppose there's uncertainty in the air density. The way this affects the state depends on the velocity, which is part of the state, and so uncertainty in air density will correlate with uncertainty in velocity. We can't easily determine how the air density uncertainty could become part of the process noise the right way. What do we do?

We might make air density part of the state! Then, it would have uncertainty, because it would have a place in the filter's covariance matrix, and it would correlate properly with the velocity and position. Now assume there's also uncertainty in the mass of the package. We could add this to the state too. However, we would never be able to estimate both the air density and the mass (they appear only as a product in the dynamics, so we can only estimate the product of the two terms, not the terms individually). Would this cause problems for our estimator? Yes. Further, the state is getting too large, causing the filter to have to do too much work. Perhaps we don't really care very much what the air density and mass actually are, but we just want to consider the uncertainty they add so the filter doesn't become overconfident. We can do this with consider covariance (where the word “consider” has been awkwardly adjectived). Here's how.

Let's start by adding to the state vector all of the quantities that include uncertainty. Supposing there are $n_x$ real states and $n_c$ consider parameters, then the augmented state, $x_a$ now has dimension $n_a = n_x + n_c$. We want to estimate the original state, the first $n_x$ elements, but we don't care to estimate the last $n_c$ elements. The covariance will look like this:

$$ P_a = \begin{bmatrix} P & P_{xc} \\ P_{xc}^T & P_{cc} \end{bmatrix} $$

We next create the system matrices. Since the propagation function doesn't update the consider parameters, its Jacobian for the consider parameters is just an identity matrix. However, the Jacobian of the propagation of the state with respect to the consider parameters, $F_c$, will certainly have values.

$$ F_a = \begin{bmatrix} F & F_c \\ 0 & I \end{bmatrix} $$

where those zeros and the identity matrix have the appropriate dimensions.

The process noise doesn't affect the consider parameters, so:

$$ Q_a = \begin{bmatrix} Q & 0 \\ 0 & 0 \end{bmatrix} $$

The observation matrix is similar.

$$ H_a = \begin{bmatrix} H & H_c \end{bmatrix} $$

Now we compute everything as normal, with the exception that we don't want to bother updating the consider parameters. So, the augmented gain is:

$$ K_a = \begin{bmatrix} K_x \\ 0 \end{bmatrix} $$

That is, the augmented gain is computed using the same equations as before, but the bottom portion, corresponding to the consider parameters, is zeroed out.

We can now use this augmented Kalman gain to correct the state (note that the consider parameters will not be corrected). There's one last thing we need to do before correcting the covariance, however. Recall that the $P^+ = P^- - K H P^-$ form was found starting from the Joseph form and then substituting the Kalman gain, with a bunch of things cancelling. But our $K_a$ is no longer the Kalman gain, because we've replaced some values with zeros. As a result, we need to stick with a more general equation, like the Joseph form:

$$ P^+_a = \left( I - K_a H_a \right) P^-_a \left( I - K_a H_a \right)^T + K_a R K_a $$

If we were to multiply this out, we'd find that the bottom-right corner of the covariance matrix, $P_{cc}$, doesn't change. In fact, this makes sense. We're never updating the consider parameters, so why would their uncertainty change? For this reason, one doesn't usually literally create the augmented system, but instead stores $P$ and $P_{xc}$ separately, with separate equations to update them.

All of the equations for this consider Kalman filter can be found with simple algebraic manipulations of the above. Here's the result as code:

% Propagate the state covariance and the state-consider covariance.
P    =   F   * P      * F.' + Q ... % Normal part
       + F   * P_xc   * F_c.' ...   % New parts
       + F_c * P_xc.' * F.' ...
       + F_c * P_cc   * F_c;
P_xc = F * P_xc + F_c * P_cc;

% ... and later on, in the correction portion...

% Calculate the parts of the Kalman gain (just for the state)
P_xy = P * H.' + P_xc * H_c.';
P_yy =   H   * P      * H.' + R ... % Normal part
       + H   * P_xc   * H_c.' ...   % New parts
       + H_c * P_xc.' * H.' ...
       + H_c * P_cc   * H_c;
       
% Calculate the Kalman gain.
K = P_xy / P_yy;

% Correct the state covariance and the state-consider covariance.
P    =   P - K * H * P ... % Normal part
       - K * H_c * P_xc.'; % New part
P_xc = P_xc - K * H * P_xc - K * H_c * P_cc;

Alternately, for numerical stability, one can literally use Joseph form for the update of the full system. It can even be broken down into pieces so that the full augmented covariance matrix need not be stored.

With this, additional sources of uncertainty can be represented, and they can even correlate with the normal parts of the state, allowing the Kalman filter to make the most of its ability to estimate when correlation is known. Clearly, much more matrix multiplication is required, but the size of the inverted matrix hasn't changed. Further, this method works with iteration, sequential scalar updates, sigma-point filters, and the other techniques we've seen, making consider covariance a powerful option in making statistically proper filters that also estimate as well as possible.

Before we wrap up this section, we should note that “consider Kalman filter“ can refer to several different types of things. Some are just used to analyze the effects of additional noise without affecting the estimates. Others use the additional covariance to increase the Kalman gain. What we're looking at in particular is known as the minimum-variance consider Kalman filter or the Schmidt-Kalman filter, named after its inventor, Stanely F. Schmidt, who (it's not surprising at this point) was an engineer working on Apollo.

Wrap-Up

There are many techniques for working with Kalman filters, with these constituting a powerful selection of methods that are relevant today. It's clear that Kalman filters have been tried, tested, and tweaked for a huge variety of circumstances, and methods discussed here allow one to build an accurate, fast, and reliable filter ready for the real world. Though the literature can become confusing, the author hopes this article has been helpful in building an intuition for the various ideas involved and in helping interested readers to navigate the space of options available to them.

To close, let's note that methods like these have helped Kalman filters fit an enormous variety of problems, and the ideas put forth in Kalman's original paper quickly matured during Apollo and through the 1970s to phenomenal success. It's the filter's elegant simplicity and adaptability that enabled the work many of the rest of us now depend on. In recognition of this, in 2009, Dr. Rudolf Kalman was awarded the National Medal of Science in the United States — the country's highest scientific and technological honor. It's a fitting “thank you” to Dr. Kalman and the many mathematicians, scientists, and engineers in and around Apollo, from those of us who seek to take Kalman filters to yet more places.

Happy filtering.

Part 3: Useful Little Things (the Appendices) >>