State estimation has many applications in general robotics, for
example autonomous driving localization and environment prediction.
Kalman filter is a classical yet powerful algorithm that tackles such
problem beautifully. Although there are already many articles,
textbooks and papers on how to derive the algorithm, I found most of
them too heavy on the theoretical side and might be hard for a
first-time learner who comes from an engineering background to follow.
Therefore, I would shamelessly attempt to fill this hole with a series
of posts.
This is going to be the first post of the series that only focuses on
the 1 dimensional case. Future posts will talk about the multi-variate
version of Kalman filter.
Spoiler: there will be a lot of math equations. But rest assured,
nothing will exceed the level of basical calculus arithmetics.
Single Step Prediction
Let's say we have a one dimensional linear Markovian System, whose
transition function is know. This means
The state of the system can be represented as a single scalar.
Let's denote it as x.
The transition function is a linear function, where the next state
only depends on the current state. Therefore we can write the
transition function as
xt+1=a⋅xt
Suppose you have an estimation of xt in the form of a Gaussian
distribution
xt∼N(x^t,σt2)
Based on that, what is your best-effort guess about the next state
xt+1? First of all, the reason that we can make such a
prediction of the next state is because the transition function
actually reveals the relationship between xt and xt+1, which
happens to be linear in this case. I know that intuitively,
you would guess the answer immediately:
xt+1∼N(ax^t,a2σt2)
And that is the correct answer. But how do you prove that? Or more
generally, if we have a random variable X∼N(μ,σ2), can
we prove that another random variable that satisfies Y=aX actually
follows the distribution Y∼N(aμ,a2σ2)?
Let
ϕ(x)=2π1⋅e−21x2
be the pdf (probability density function) of a standard gaussian
distribution N(0,1). It is easy to derive that the pdf of a
general gaussian distribution N(μ,σ2) that X follows
would be
fX(x)=2πσ1⋅e−2σ21(x−μ)2=σ1ϕ(σx−μ)
Using the trick called differential form, the probability of X taking a specific value x is1
P[X=x]=fX(x)dx=σ1ϕ(σx−μ)dx
Okay, so what does pdf of Y (i.e. P(Y=y)) look
like? It turns out that we can easily derive that with a bit
transformation:
This basically shows that Y's pdf is nothing but the pdf of
N(aμ,a2σ2), hence conclude the proof.
1
An intuitve perspective here that helps understanding that is fX(x)dx is actually the area, whose fundamental unit is probability!
The above proved conclusion enables us to solve the prediction problem
at the very beginning of this section,
based on xt+1=a⋅xt and xt∼N(x^t,σt2)
we can predict xt+1∼N(ax^t,a2σt2)
equivalently, this means we can obtain estimation of xt+1 (without observing it) as
{x^t+1σt+1==ax^ta2σt2
Uncertainty in Transition Function
Now it is time to introduce one variation on top of the simplest case
we discussed in the previous section. In reality the transition is
usually not perfect, which means that there is an error associated
with it. Mathematically, it means
xt+1=a⋅xt+et
As usual for simplicity we assume the error is a zero-mean random
variable follows a Gaussian distribution, i.e.
et∼N(0,σet2)
How should we revise our prediction under such condition? Remember we
are still solving the following question - if we already have an
estimation of xt as
xt∼N(x^t,σt2)
what is a good estimation of xt+1, given that we know (although
not precisely in this case) the transition function?
Here we are going to introduce another useful idea - the generative
model. The generative model basically describes the procedure to
get a sample value of a random variable. In this particular case, the
generative model of xt+1 consits of:
Sample a⋅xt out of the distribution N(ax^t,a2σt2) (Note that this is the conclusion from the previous section)
Sample et out of the distribution N(0,σet2)
Construct xt+1 by adding the two sampled values up
So you can see that the generative model is basically an
interpretation of the problem formulation, provides no new knowledge
at all. However, with such interpretation it is clear to see that
xt+1 as a random variable is basically the sum of two
independently distributed gaussian random variables!
I will cheat here by referring to the generating function based
proof
from wikipedia. As you have probably guessed, the conclusion is that
if i.i.d {XY∼N(μX,σX2)∼N(μY,σY2) then Z=X+Y∼N(μX+μY,σX2+σY2)
By plugging in our generative model, we obtain
xt+1∼N(ax^t,a2σt2+σet2)
which means such good prediction would be
{x^t+1σt+1==ax^ta2σt2+σet2
Yep, just add the variance of the error to the estimation of variance. Pretty simple, right?
Let There Be Observations
So we know how to predict xt+1 given xt, which is great. This
means that if we happen to know the initial state of the system,
x0, we can start to predict x1, and then x2, ..., till any
xt, which will be
{x^tσt==atx^ta⋅(a⋅(a⋯)+σet−22)+σet−12
As we can see, there is one fatal problem in the above prediction. As
t grows, our estimation will be less precise because the variance is
going to grow very quickly. This is because ecah time we make
prediction for one more step, the variance of error will be added to
it. To see it more clearly, when a=1, we will have
σt=σe02+σe12+⋯+σet−12
It is easy to understand this error accumulation intuitively. As
we make more predictions, we are not getting new information about the
system. Think about it - if you only know what a cat looks like when
it is 1-week old, how can you precisely predict how it looks like when
he is 3 years old? If you only know your weight before COVID-19
keeping us at home, how do you precisely estimate your current weight?
The key here is that you need constant feedback to guide your
estimation when it gets
Okay let's take one more step in the problem formulation, so that it
will be more realistic. Now we are allowed to take measurement of
the system via an observation function. In the weight example,
this translates to you are allowed to weigh yourself with an electric
scale every now and then. It seems that with the method to take
measurement, we do not need to estimate the state anymore, we can
simply observe the readings and get the precise value! Except for that
in reality the measurement is usually not always accurate. Therefore,
the observation function has an associated error as well.
Assuming a linear observation function, we will have the readings
mathematically as:
yt=ht⋅xt+rt, where rt∼N(0,σrt)
Note that when you take measurement at time t+1, you can directly
observe yt+1 (though yt+1 is not xt+1, and the latter
is what we want to estimate). The question now becomes: how to make a
good estimation about xt+1, given
A good estiamtion of the previous state xt, and
The current measurement reading yt
Let's first find the generative model interpretation of this. We
can see that yt+1 is generated in the following 3 steps:
Sample xt+1=a⋅xt+et out of the distribution N(ax^t,a2σt2+σet2) (Note that this is the conclusion from the previous section)
Sample rt+1 out of the distribution N(0,σrt+12)
Directly compute yt+1=ht+1⋅xt+1+rt+1
Let's stop for a while to take a closer look at the above generative
model. The distribution N(ax^t,a2σt2+σet2) comes from the conclusion of the previous section,
which represents the best2 estimation of xt+1 we can get based
pure prediction. The mean and variance of this distribution will
be used quite a lot in the derivation below, so it is good to give it
some name. Let
{xt+1′σt+1′2==ax^ta2σt2+σet2
Note that both xt+1′ and sigmat+1′ are determinsitic
values, i.e. neither of them is random variable.
2
I am being informal here as we haven't formally defined what
the best estimation means. We will likely get to this topic in
the future, so bear with me for now.
Although the above generative model is about generating yt+1, it
is xt+1 that we actually want to estimate. We can do this by
deriving the pdf of xt+1. The following derivation will
likely seem very tedious, but I will try to be clear on each step and
trust me, this will be the last challenge in this post!
Given that we have observed yt+1=y, what is the probability of
xt+1=x? Such probability can be written as
∀x,P[xt+1=x∣yt+1=y]=fxt+1(x)dx
where fxt+1(x) is the unknown (yet) pdf of xt+1 that
we want to derive. Also note that there is ∀x in the
statement, which is very important. It means that the equation
holds for every single x.
By applying Bayes's law, the left hand side can also be transformed as
So there are 3 items on the right hand side. Let's crack them one by
one.
The simplest one here is P[yt+1=y]. Since it does not
depend on x, this can be just written as
P[yt+1=y]=Const⋅dy
Next comes P[xt+1=x], without conditioning on the
value of yt+1. This is the pure prediction we discussed
above, which ∼N(xt+1′,σt+1′2). Therefore it is
simply
P[xt+1=x]=Const⋅exp(−2σt+1′2(x−xt+1′)2)dx
The last one P[rt+1=y−ht+1x] is about rt+1,
which happens to be following a Gaussian distirbution as well (even
better, the mean is zero)! This means
Note dr can be written as the form above because of
differential form arithmetics. It is good to understand the rules
behind them, but when you get familiar with the rules, they are just
no more strange than the rules you use to take derivatives.
Therefore, take the above 3 expanded components and plug them back,
and keep in mind that by differential form rule dx∧dx=0, we have
Note that since y is basically the value of yt+1, it is
replaced with yt+1.
This means that xt+1 follows Gaussian distirbution! We can even
directly tell what is the mean and what is the variance of the
estimation from the above formula, i.e.
K is clearly a number between 0 and 1. This means that the mean
of estimation of xt+1 is actually a weighted combination
of xt+1′ and yt+1/ht+1. It is worth noting that
xt+1′ is the best guess you can have based on pure prediction
yt+1/ht+1 is the best guess you can have based on pure observation
So this is basically about trusting both of those two evidences with a
grain of salt. And how much you trust each of them depends on the
variance of each guess. The bigger variance, the less trustworthy.
Very reasonable, right?
What about the estimated variance of xt+1? As we have defined
K, it can be written as
σt+12=K⋅σt+1′2
This is intuitively just updating the pure prediction based variance
estimation as we have observation now. Note that becauese K<1, the
final estimated variance is going to be smaller than the actual pure
prediction based estimated variance!
So at this moment we can now summarize the procedure of Kalman Filter
update, i.e. how to obtain t+1-step estimation based on t-step
estimation and new observation.
Step I: Compute the pure prediction based estimation.
{xt+1′σt+1′2==ax^ta2σt2+σet2
Step II: Compute the combination weight K, which is often called the Kalman Gain.
K=σrt+12+ht+12σt+1′2σrt+12
Step III: Use Kalman Gain K and observation yt+1 to
update the pure prediction based estimation.
This post demonstrated the derivation of 1D Kalman Filter, and also
slightly touched the intuitive interpretation of it. Also, I think
many of the techniques used here such as generative model and
differential forms can find their applications in many other
situations.
However, in reality, 1D Kalman Filter is rarely useful enough. This
post should have prepared you for the next journey - multivariate
Kalman Filter. Stay tuned!