Brief Notes on Tech Leading

2021-07-05T00:00:00+00:00

For quite some time, I have been leading software engineering teams. This is not a piece of advice for current or prospective tech leads. In fact, I believe that most of you are better at managing a group than I am. Nonethelss, I wanted to convey what I have learned from this incredible journey in the hopes of inspiring some of the readers.

There are no rules that you can follow to reach the global optimum of software engineering managment, just as there are not rules for many other kinds of art in the world. Your strategy will most certainly be determined by the situation, the team, the individuals, the objectives, and sometimes even your own style as the leader.

However, there are certain areas of focus that I attempt to improve, and I have found that they have helped me construct a more successful and cohesive team. In general, I prefer to establish a team that can accomplish ambitious long-term goals rather than a team that can accomplish simple tasks quickly.

Engineers are most productive when they are inspired rather than forced. Do not simply ask them to do just what you want them to do. Create an environment where the team is allowed to take risks, express their thoughts and try techinically demanding solutions freely. Mistakes help people grow, but punishment does not. The majority of so called metrics are not well-defined. Such not-well-defined metrics are far worse than having none at all. In terms of improving development performance, DevOps and Tooling are more significant than you may realize. Or to summarize them in another way, try not to be a lazy manager.

Minimalist's Kalman Filter Derivation, Part II

2020-09-25T00:00:00+00:00

Background</h2>
As promised, in this post we will be deriving the multi-variate version of Kalman Filter. It will be a bit more math intensive because we are focusing on derivation, but similar to the previous post I will try my best to make the equations intuitive and easily understandable.

Bayes Filter</h2>
Kalman filter is actually a special form of Bayes filter. This means that Bayes filter is actually solving a (slightly) more general problem. We will first give a high-level overivew of Bayes Filter and then add constraint to make it a Kalman filter problem.
The reason is that

Although being more general, Bayes filter is actually more straight-forward to derive.</li>
By understanding the connection between Kalman filter and Bayes filter, it will give a much better picture of the great ideas behind both of them.</li> </ol>
Unlike the previous post, we are looking at a system with multi-variate state space ($n$ dimensional). This means that the system is undergoing a series of states
$$ x_1, x_2, x_3, ..., x_t, x_{t+1}, ... \in \mathbb{R}^n $$
Similarly, we are not able to directly observe the states. What we can do is for each timestamp $t$, we can take a measurement to obtain the observations
$$ y_1, y_2, y_3, ..., y_t, y_{t+1}, ... \in \mathbb{R}^m $$
Note that here $m$ is not necessarily equal to $n$. Bayes filter aims to solve the problem of estimating (the distribution of) $x_t$ given the observed trajectory $y_{1..t}$, i.e. estimating the probability density function (pdf):
$$ p\left(x_t \mid y_{1..t}\right) = ? $$
Solve Bayes Filter Problem</h3>
Bayes filter assumes that you know
$$ \begin{cases} p\left( x_t \mid x_{t-1} \right) & \textrm{the transition model}\\ p\left( y_t \mid x_t \right) & \textrm{the measurement model} \\ p \left( x_{t-1} \mid y_{1..t-1} \right) & \textrm{the previous state estimation} \end{cases} $$
The first step is to obtain the estimation of $x_t$ purely based on prediction (i.e. without the newest observation $y_t$). By applying the transition model and the previous state estimation, we have:
$$ p\left( x_t \mid y_{1..t-1} \right) = \int_{x} p\left(x_t \mid x_{t-1} = x\right) \cdot p\left(x_{t-1} = x \mid y_{1..t-1} \right) \mathrm{d}x $$
We then look at the posterior1</a>
$$ \begin{darray}{rcl} p \left( x_t \mid y_{1..t} \right) &=& \frac{p\left( x_t, y_t \mid y_{1..t-1}\right)}{p\left( y_t \mid y_{1..t-1}\right)} \\ &\propto& p\left( x_t, y_t \mid y_{1..t-1}\right) \\ &=& p \left(y_t \mid x_t \right) \cdot p\left( x_t \mid y_{1..t-1} \right) \end{darray} $$
Note that both terms on the RHS are known as they are just the measurement model and the pure prediction estimation.
^{1
If you recognize it - yes we are applying Bayes inference here.
</div>
By obtaining the pdf $p\left( x_t \mid y_{1..t} \right)$ we derived
the estimated distribution for the current state at $t$. Therefore two
steps actually covered the Bayes filter. Yes it is just that simple
and straight-forward. $\blacksquare$}Kalman Filter Is a Special Bayes</h3>
We say that Kalman filter is a special form of Bayes Filter because it poses 3 constraints on Bayes filter, one for each of the known conditions:

The transition model is assumed to be linear with Gaussian error. This means that
$$ p \left( x_t \mid x_{t-1} \right) = \textrm{ pdf of } N( F_tx_{t-1}, Q_t) $$
Here $F_t$ is a $n \times n$ matrix, and $Q_t$ is a $n \times n$ covariance matrix</a> describing the error. The most important properties of the covariance matrices are that they are

always symmetric</li>
always positive semi-definite</a></li>
always invertible</li> </ul> </li>

The measurement model is also assumed linear with Gaussian error.
$$ p\left( y_t \mid x_t \right) = \textrm{ pdf of } N(H_tx_t, R_t) $$
Similarly, $H_t$ is a $m \times n$ matrix and $R_t$ is a $m \times m$ covariance matrix. </li>

The estimated distribution of $x_t \mid y_{1..t}$ is assumed Gaussian, i.e.
$$ p\left( x_t \mid y_{1..t} \right) = \textrm{ pdf of } N(\hat{x}_t, P_t) $$
Here in the above formula

$\hat{x}_t$ is the estimated mean of the state at $t$.</li>
$P_t$ is a $n \times n$ matrix representing the estimated covariance of the state at $t$.</li> </ul> </li> </ol>
We can now follow Bayes Filter's 2 steps to solve Kalman Filter.
Pure Prediction in Kalman Filter</h2>
As shown above the first step is about computing
$$ p\left( x_t \mid y_{1..t-1} \right) = \int_{x} p\left(x_t \mid x_{t-1} = x\right) \cdot p\left(x_{t-1} = x \mid y_{1..t-1} \right) \mathrm{d}x $$
We can continue to simplify it since we are dealing with Kalman filter and we know that both pdfs involved in the RHS are Gaussian. Note that if we take the generative model view of the above equation, it actually tells

The random variance $x_t|y_{1..t-1}$ is generated by

Sample $x_{t_1} \mid y_{1..t-1}$ from $N(\hat{x}_{t-1}, P_{t-1})$</li>
Sample $e_t$ from $N(0, Q_t)$</li>
Obtain $x_{t} \mid y_{1..t-1} = F_t \cdot (x_{t-1} \mid y_{1..t-1}) + e_t$</li> </ol> </blockquote>
For ease of reading we are going to use $x_t$ as short for the conditional variable $x_t \mid y_{1..t-1}$ and $x_{t-1}$ as short for the conditional random variable $x_{t-1} \mid y_{1..t-1}$.
Recall that the moment generating function for a Gaussian distribution $N(\mu, \Sigma)$ is
$$ g(k) = \mathbb{E} \left[ e^{k^\intercal x}\right] = \exp \left[ k^\intercal\mu + \frac{1}{2} k^\intercal \Sigma k \right] $$
By applying this we can try to obtain the moment generating function for the random variable $x_t \mid y_{1..t-1}$ by the following:
$$ \begin{darray}{rcl} g(k) &=& \mathbb{E} \left[ e^{k^\intercal x_t}\right] = \mathbb{E} \left[ e^{k^\intercal (F_t x_{t-1} + e_t)} \right] \\ &=& \mathbb{E} \left[ e^{k^\intercal F_t x_{t-1}} \right] \cdot \mathbb{E} \left[ e^{k^\intercal e^t} \right] \\ &=& \mathbb{E} \left[ e^{(F_{t}^{\intercal} k )^\intercal x_{t-1}} \right] \cdot \mathbb{E} \left[ e^{k^\intercal e^t} \right] \\ &=& \exp \left[ (F_{t}^{\intercal}k)^\intercal \hat{x}_{t-1} + \frac{1}{2} (F_{t}^{\intercal}k)^\intercal P_{t-1} (F_{t}^{\intercal}k) \right] \cdot \exp \left[ \frac{1}{2} k^\intercal Q_t k\right] \\ &=& \exp \left[ k^\intercal F_t \hat{x}_{t-1} + \frac{1}{2}k^\intercal \left( F_{t}^{\intercal}P_{t-1}F_{t} + Q_t \right)k \right] \end{darray} $$
Now it becomes super clear that $x_t \mid y_{1..t-1}$ also follows a Gaussian distribution. In fact
$$ p\left(x_t \mid y_{1..t-1}\right) = \textrm{ pdf of } N(F_t\hat{x}_{t-1}, F_t^{\intercal} P_{t-1} F_t + Q_t) $$
The mean and covariance determine the pure-prediction estimation.
$$ \begin{cases} x_{t}' &=& F_t \hat{x}_{t-1} & \textrm{pure prediction mean} \\ P_{t}' &=& F_t^\intercal P_{t-1} F_t + Q_t & \textrm{pure prediction covariance} \end{cases} $$
Remember this and we will use them in the next section.
Posterior in Kalman Filter</h2>
The second step in Bayes filter is just to compute the actual estimation called posterior with
$$ p \left( x_t \mid y_{1..t} \right) \propto p \left(y_t \mid x_t \right) \cdot p\left( x_t \mid y_{1..t-1} \right) $$
You would probably think it is now straight-forward to simplify this equation as we know both terms on the RHS are Gaussian pdfs, and their parameters are known. While it is true that we can directly compute the product of the two Gaussian pdfs, there are some complicated matrix inversion that we will have deal with in that approach. To avoid such complexity, we choose to estimate an auxilary random variable
$$ Y = H_tx_t $$
first. Note that $H_t$ is just a known constant matrix, and $Y$ is basically a linear transformation of $x_t$. $Y$ has some physical meaning as well - it is the supposed observation value if there are no noise in the measurement. This also points out an important property of $Y$:
$$ \begin{cases} r_t &=& y_t - Y & \textrm{the measurement noise random variable} \\ r_t &\sim& N(0, R_t) & \textrm{as we assume Gaussian error} \end{cases} $$
Relationship between $Y$ and $x_t$</h3>
Since $Y$ is obtained by just applying a linear transformation on $x_t$, it is obvious that if $x_t$ follows a Gaussian distribution, $Y$ also does. Nonetheless we will try to derive it formally via moment generating function.
$$ \begin{darray}{rcl} g_Y(k) &=& \mathbb{E}\left[ e^{k^\intercal Y}\right] = \mathbb{E}\left[ e^{k^\intercal H_t x_t}\right] = \mathbb{E}\left[ e^{(H_t^\intercal k)^\intercal H_t x_t}\right] \\ &=& \exp \left[ (H_t^\intercal k)^\intercal \mu_{x_t} + \frac{1}{2} (H_t^\intercal k)^\intercal \Sigma_{x_t} (H_t^\intercal k)\right] \\ &=& \exp \left[ k^\intercal (H_t\mu_{x_t}) + \frac{1}{2} k^\intercal (H_t \Sigma_{x_t} H_t^\intercal) k\right] \end{darray} $$
The above equation shows that $Y \sim N(H_t\mu_{x_t}, H_t \Sigma_{x_t} H_t^\intercal)$.
This actually tells two stories:

The pure prediction estimation of $Y$, i.e. $Y | y_{1..t-1}$ is actually
$$ \begin{cases} Y | y_{1..t-1} &\sim& N(y', S) \\ y' &=& H_tx_t' \\ S &=& H_tP_t'H_t^\intercal \end{cases} $$ </li>

The same relationship holds for the final posterior estimation of $Y$ (i.e. $Y | y_{1..t}$) and $x_t$ (i.e. $x_t | y_{1..t}$)
$$ \begin{cases} x_t | y_{1..t} &\sim& N(\hat{x}_t, P_t) \\ Y | y_{1..t} &\sim& N(H_t\hat{x}_t, H_tP_tH_t^\intercal) \end{cases} $$ </li> </ol>
Derivation of the Posterior of $Y$</h3>
Rewrite the posterior equation above so that it is w.r.t. $Y$, we have
$$ p \left( Y = y\mid y_{1..t} \right) \propto p \left(y_t \mid Y = y\right) \cdot p\left( Y = y\mid y_{1..t-1} \right) $$
Let us take a closer look at the LHS. The first term
$$ \begin{darray}{rcl} p \left(y_t \mid Y \right) &=& p \left(r_t = y_t - y \right) \\ &=& \textrm{const} \cdot \exp \left[-\frac{1}{2} (y_t - y)^\intercal R_t (y_t - y)\right] \\ &=& \textrm{const} \cdot \exp \left[-\frac{1}{2} (y - y_t)^\intercal R_t (y - y_t)\right] \\ &=& \textrm{pdf of } N(y_t, R_t) \textrm{ w.r.t. } Y \end{darray} $$
And the second term is the pure prediction estimation of $Y$ which we have already derived in the previous subsection. It is
$$ p\left( Y = y\mid y_{1..t-1} \right) = \textrm{ pdf of } N(H_tx_t', H_tP_t'H_t^\intercal) \textrm{ w.r.t. } Y $$
For ease of reading let's denote both of them as
$$ \begin{cases} N(y_t, R_t)_Y &=& \textrm{pdf of } N(y_t, R_t) \textrm{ w.r.t. } Y \\ N(H_tx_t', H_tP_t'H_t^\intercal)_Y &=& \textrm{ pdf of } N(H_tx_t', H_tP_t'H_t^\intercal) \textrm{ w.r.t. } Y \end{cases} $$
Back to the posterior of $Y$, we can now see
$$ p \left( Y = y\mid y_{1..t} \right) \propto N(y_t, R_t)_Y \cdot N(H_tx_t', H_tP_t'H_t^\intercal)_Y $$
Okay so the posterior pdf is actually the product of two Gaussian pdfs. We now need to apply our Lemma III (proof in the Appendix), which says the product of two Guassian pdfs with parameters $\mu_1$, $\Sigma_1$, $\mu_2$ and $\Sigma_2$ is the pdf of an unnormalized Guassian, s.t.
$$ \begin{cases} \Sigma &=& \Sigma_2 - K\Sigma_2 \\ \mu &=& \mu_2 + K (\mu_1 - \mu_2) \end{cases} \textrm{ , where } K = \Sigma_2(\Sigma_1 + \Sigma_2)^{-1} $$
(everyone is encouraged to read the appendix for the proofs of all the lemmas, as they are rather simple).
Now plugin what we have here
$$ \begin{cases} \Sigma_1 &=& R_t &\textrm{ and }& \mu_1 &=& y_t \\ \Sigma_2 &=& H_tP_t'H_t &\textrm{ and }& \mu_2 &=& H_tx_t' \end{cases} $$
which gives parameters for posterior distrition of $Y|y_{1..t}$
$$ \begin{cases} K_Y &=& H_tP_t'H_t^\intercal(R_t + H_tP_t'H_t^\intercal)^{-1} \\ \Sigma_Y &=& H_tP_t'H_t^\intercal - K H_tP_t'H_t^\intercal \\ &=& H_tP_t'H_t^\intercal - H_tP_t'H_t^\intercal(R_t + H_tP_t'H_t^\intercal)^{-1} H_tP_t'H_t^\intercal \\ \mu_Y &=& H_tP_t' + K (y_t - H_tx_t') \\ &=& H_tx_t' + H_tP_t'H_t^\intercal(R_t + H_tP_t'H_t^\intercal)^{-1}(y_t - H_tx_t') \end{cases} $$
Derivation of the Posterior of $x_t$</h3>
Right, the above does look complicated. But you do not have to remember this and we do this intentionally so that we can achieve our final goal of estimating the posterior distribution of $x_t$. We now go back to look at the last equation of the previous subsection:
$$ Y | y_{1..t} \sim N(H_t\hat{x}_t, H_tP_tH_t^\intercal) $$
It is straight-forward to see that
$$ \begin{cases} H_tP_tH_t^\intercal &=& H_tP_t'H_t^\intercal - H_tP_t'H_t^\intercal(R_t + H_tP_t'H_t^\intercal)^{-1} H_tP_t'H_t^\intercal \\ H_t\hat{x}_t &=& H_tx_t' + H_tP_t'H_t^\intercal(R_t + H_tP_t'H_t^\intercal)^{-1}(y_t - H_tx_t') \end{cases} $$
Note that this holds for whatever $H_t$ we put there. Therfore by striping $H_t$ for both sides we have derived the parameters for the posterior estimation of $x_t$:
$$ \begin{cases} P_t &=& P_t' - P_t'H_t^\intercal(R_t + H_tP_t'H_t^\intercal)^{-1} H_tP_t' \\ \hat{x}_t &=& x_t' + P_t'H_t^\intercal(R_t + H_tP_t'H_t^\intercal)^{-1}(y_t - H_tx_t') \end{cases} $$
In fact, we can further simplify it by defining $K$ (which is usually called the Kalman gain</a>)
$$ K = P_t'H_t^\intercal(R_t + H_tP_t'H_t^\intercal)^{-1} $$
and the posterior estimation of $x_t$ can then be simplified as
$$ \begin{cases} P_t &=& P_t' - KH_tP_t' \\ \hat{x}_t &=& x_t' + K(y_t - H_tx_t') \end{cases} $$
That concludes the derivation of multi-variate Kalman filter. $\blacksquare$
Summary</h2>
Based on the derivation, the Kalman filter can be used to obtain the posterior estimation following the Bayes filter's approach. The steps are

Compute the pure prediction estimation paramters
$$ \begin{cases} x_t' &=& F_t \hat{x}_{t-1} \\ P_t' &=& F_tP_{t-1}F_t^\intercal + Q_t \end{cases} $$ </li>

Compute the Kalman gain $K$
$$ K = P_t'H_t^\intercal(R_t + H_tP_t'H_t^\intercal)^{-1} $$ </li>

Compute the posterior
$$ \begin{cases} P_t &=& P_t' - KH_tP_t' \\ \hat{x}_t &=& x_t' + K(y_t - H_tx_t') \end{cases} $$ </li> </ol>
Appendix</h1>
I want the post to be very self-contained. Therefore I prepared 3 lemmas in the appendix so that main article won't be too distracting. All the 3 lemmas are about the product of two Gaussian pdfs, and it is suggested that you read them in order.
Appendix - Lemma I</h2>
The product of 2 scalar Gaussian pdfs is an unormalized pdf of another scalar Gaussian. To be more specific, if
$$ \begin{darray}{rcl} f(x) &=& f_1(x)f_2(x) \textrm{, where } \\ f_1(x) &=& \textrm{pdf of } N(\mu_1, \sigma_1^2) \\ f_2(x) &=& \textrm{pdf of } N(\mu_2, \sigma_2^2) \end{darray} $$
then $f(x)$ is the pdf of $N(\mu, \sigma^2)$ with
$$ \begin{cases} \sigma^2 &=& \left(\frac{1}{\sigma_1^2} + \frac{1}{\sigma_2^2}\right)^{-1} \\ \mu &=& \frac{\sigma^2}{\sigma_1^2}\mu_1 +\frac{\sigma^2}{\sigma_2^2}\mu_2 \end{cases} $$
proof:
Expand the pdf $f_1$ and $f_2$, we have
$$ \begin{cases} f_1(x) &=& \textrm{const} \cdot \exp \left[ -\frac{(x - \mu_1)^2}{2 \sigma_1^2} \right] \\ f_2(x) &=& \textrm{const} \cdot \exp \left[ -\frac{(x - \mu_2)^2}{2 \sigma_2^2} \right] \\ \end{cases} $$
Therefore
$$ \begin{darray}{rcl} f(x) &=& f_1(x) \cdot f_2(x) \\ &=& \textrm{const} \cdot \exp \left[ -\frac{(x - \mu_1)^2}{2 \sigma_1^2} -\frac{(x - \mu_2)^2}{2 \sigma_2^2} \right] \\ &=& \textrm{const} \cdot \exp -\frac{1}{2}\left[ \left(\frac{1}{\sigma_1^2} + \frac{1}{\sigma_2^2}\right) x^2 - 2\left(\frac{\mu_1}{\sigma_1^2} + \frac{\mu_2}{\sigma_2^2}\right)x\right] \end{darray} $$
Therefore we know that it must be an unnormalized Gaussian form. Assume the parameters are $\mu$ and $\sigma^2$, we can then force
$$ \frac{(x - \mu)^2}{\sigma^2} = \left(\frac{1}{\sigma_1^2} + \frac{1}{\sigma_2^2}\right) x^2 - 2\left(\frac{\mu_1}{\sigma_1^2} + \frac{\mu_2}{\sigma_2^2}\right)x + \textrm{const} $$
which gives
$$ \begin{cases} \frac{1}{\sigma^2} &=& \frac{1}{\sigma_1^2} + \frac{1}{\sigma_2^2} \\ \frac{\mu}{\sigma^2} &=& \frac{\mu_1}{\sigma_1^2} + \frac{\mu_2}{\sigma_2^2} \end{cases} $$
Solve it and we get
$$ \begin{cases} \sigma^2 &=& \left(\frac{1}{\sigma_1^2} + \frac{1}{\sigma_2^2}\right)^{-1} \\ \mu &=& \frac{\sigma^2}{\sigma_1^2}\mu_1 +\frac{\sigma^2}{\sigma_2^2}\mu_2 \end{cases} $$
This concludes the proof. $\blacksquare$
Appendix - Lemma II</h2>
Lemma II is just the multi-variate version of Lemma I.
The product of 2 multi-variate Gaussian pdfs of $n$ dimension is an unormalized pdf of another multi-varite $n$ dimension Gaussian. To be more specific, if
$$ \begin{darray}{rcl} f(x) &=& f_1(x)f_2(x) \textrm{, where } \\ f_1(x) &=& \textrm{pdf of } N(\mu_1, \Sigma_1) \\ f_2(x) &=& \textrm{pdf of } N(\mu_2, \Sigma_2) \end{darray} $$
then $f(x)$ is the pdf of $N(\mu, \Sigma)$ with
$$ \begin{cases} \Sigma &=& (\Sigma_1^{-1} + \Sigma_2^{-1})^{-1} \\ \mu &=& \Sigma \Sigma_1^{-1} \mu_1 + \Sigma \Sigma_2^{-1} \mu_2 \end{cases} $$
Note that Lemma II is awfully similar to Lemma I.
proof:
We still do the expand first, which gives
$$ \begin{cases} f_1(x) &=& \textrm{const} \cdot \exp \left[ -\frac{1}{2} (x-\mu_1)^\intercal \Sigma_1^{-1} (x - \mu_1) \right] \\ f_2(x) &=& \textrm{const} \cdot \exp \left[ -\frac{1}{2} (x-\mu_2)^\intercal \Sigma_2^{-1} (x - \mu_2) \right] \end{cases} $$
Plug them into $f(x)$, we have
$$ \begin{darray}{rcl} f(x) &=& f_1(x) \cdot f_2(x) \\ &=& \textrm{const} \cdot \exp -\frac{1}{2} \left[ x^\intercal(\Sigma_1^{-1} + \Sigma_2^{-1})x - 2x^\intercal (\Sigma_1^{-1}\mu_1 + \Sigma_2^{-1}\mu_2)\right] \end{darray} $$
This shows that $f(x)$ is the pdf of a Gaussian. Similarly assume the parameters of the Gaussian is $\mu$ and $\Sigma$, we can force
$$ \begin{darray}{rcl} &&x^\intercal(\Sigma_1^{-1} + \Sigma_2^{-1})x - 2x^\intercal (\Sigma_1^{-1}\mu_1 + \Sigma_2^{-1}\mu_2) \\ &=& (x - \mu)^\intercal \Sigma^{-1} (x - \mu) \\ &=& x^\intercal \Sigma^{-1} x - 2 x^\intercal \Sigma^{-1} \mu + \textrm{const} \end{darray} $$
This gives the equation
$$ \begin{cases} \Sigma^{-1} &=& \Sigma_1^{-1} + \Sigma_2^{-1} \\ \Sigma^{-1}\mu &=& \Sigma_1^{-1} \mu_1 + \Sigma_2^{-1} \mu_2 \end{cases} $$
Solve it and we have
$$ \begin{cases} \Sigma &=& (\Sigma_1^{-1} + \Sigma_2^{-1})^{-1} \\ \mu &=& \Sigma \Sigma_1^{-1} \mu_1 + \Sigma \Sigma_2^{-1} \mu_2 \end{cases} $$
This concludes the proof of Lemma II. $\blacksquare$
Appendix - Lemma III</h2>
Lemma III further simplifies Lemma II with a transformation. It states that the solution of Lemma II can be rewritten as
$$ \begin{cases} \Sigma &=& \Sigma_2 - K\Sigma_2 \\ \mu &=& \mu_2 + K(\mu_1 - \mu_2) \end{cases} $$
where
$$ K = \Sigma_2(\Sigma_1 + \Sigma_2)^{-1} $$
proof:

We first apply some transformation on $\Sigma$.
$$ \begin{darray}{rcl} \Sigma^{-1} &=& \Sigma_1^{-1} + \Sigma_2^{-1} \\ &=& \Sigma_1^{-1}\Sigma_1(\Sigma_1^{-1} + \Sigma_2^{-1})\Sigma_2\Sigma_2^{-1} \\ &=& \Sigma_1^{-1} \left[ \Sigma_1(\Sigma_1^{-1} + \Sigma_2^{-1})\Sigma_2 \right] \Sigma_2^{-1} \\ &=& \Sigma_1^{-1} (\Sigma_1 + \Sigma_2) \Sigma_2^{-1} \\ \end{darray} $$
take the inverse on both sides, we have
$$ \Sigma = \Sigma_2(\Sigma_1 + \Sigma_2)^{-1}\Sigma_1 = \Sigma_1(\Sigma_1 + \Sigma_2)^{-1}\Sigma_2 $$
Note that we implicitly used the property that covariance matrices and their inverses are symmetric. </li>

We then apply some transformation on $\mu$.
$$ \begin{darray}{rcl} \mu &=& \Sigma \Sigma_1^{-1}\mu_1 + \Sigma \Sigma_2^{-1}\mu_2 \\ &=& \Sigma_2(\Sigma_1 + \Sigma_2)^{-1}\Sigma_1 \Sigma_1^{-1}\mu_1 + \Sigma_1(\Sigma_1 + \Sigma_2)^{-1}\Sigma_2 \Sigma_2^{-1}\mu_2 \\ &=& \Sigma_2(\Sigma_1 + \Sigma_2)^{-1}\mu_1 + \Sigma_1(\Sigma_1 + \Sigma_2)^{-1}\mu_2 \\ &=& \Sigma_2(\Sigma_1 + \Sigma_2)^{-1}\mu_1 + (\Sigma_1 + \Sigma_2 - \Sigma_2)(\Sigma_1 + \Sigma_2)^{-1}\mu_2 \\ &=& \Sigma_2(\Sigma_1 + \Sigma_2)^{-1}\mu_1 + \mu_2 -\Sigma_2(\Sigma_1 + \Sigma_2)^{-1}\mu_2 \\ &=& \mu_2 + \Sigma_2(\Sigma_1 + \Sigma_2)^{-1}(\mu_1 - \mu_2) \end{darray} $$
It is now clear that if we define $K = \Sigma_2(\Sigma_1 + \Sigma_2)^{-1}$, then
$$ \mu = \mu_2 + K(\mu_1 - \mu_2) $$
This proves the half of Lemma III. </li>

Let us take another look at $\Sigma$.
$$ \begin{darray}{rcl} \Sigma &=& \Sigma_2(\Sigma_1 + \Sigma_2)^{-1}\Sigma_1 \\ &=& K\Sigma_1 \\ &=& K(\Sigma_1 + \Sigma_2 - \Sigma_2) \\ &=& K(\Sigma_1 + \Sigma_2) - K\Sigma_2) \\ &=& \Sigma_2(\Sigma_1 + \Sigma_2)^{-1}(\Sigma_1 + \Sigma_2) - K\Sigma_2) \\ &=& \Sigma_2 - K\Sigma_2 \end{darray} $$
The above proves the second half of Lemma III. </li> </ol>
Therefore the proof is concluded. $\blacksquare$

Minimalist's Kalman Filter Derivation, Part I

2020-08-03T00:00:00+00:00

Motivation</h2>
State estimation has many applications in general robotics, for example autonomous driving localization and environment prediction. Kalman filter is a classical yet powerful algorithm that tackles such problem beautifully. Although there are already many articles, textbooks and papers on how to derive the algorithm, I found most of them too heavy on the theoretical side and might be hard for a first-time learner who comes from an engineering background to follow. Therefore, I would shamelessly attempt to fill this hole with a series of posts.
This is going to be the first post of the series that only focuses on the 1 dimensional case. Future posts will talk about the multi-variate version of Kalman filter.
Spoiler: there will be a lot of math equations. But rest assured, nothing will exceed the level of basical calculus arithmetics.

Single Step Prediction</h2>
Let's say we have a one dimensional linear Markovian System, whose transition function is know. This means

The state of the system can be represented as a single scalar. Let's denote it as $x$. </li>

The transition function is a linear function, where the next state only depends on the current state. Therefore we can write the transition function as
$$ x_{t+1} = a \cdot x_t $$ </li> </ol>
Suppose you have an estimation of $x_t$ in the form of a Gaussian distribution
$$ x_t \sim N(\hat{x}_t, \sigma_t^2) $$
Based on that, what is your best-effort guess about the next state $x_{t+1}$? First of all, the reason that we can make such a prediction of the next state is because the transition function actually reveals the relationship between $x_t$ and $x_{t+1}$, which happens to be linear in this case. I know that intuitively, you would guess the answer immediately:
$$ x_{t+1} \sim N(a\hat{x}_t, a^2\sigma_t^2) $$
And that is the correct answer. But how do you prove that? Or more generally, if we have a random variable $X \sim N(\mu, \sigma^2)$, can we prove that another random variable that satisfies $Y = aX$ actually follows the distribution $Y \sim N(a\mu, a^2\sigma^2)$?
Let
$$ \phi(x) = \frac{1}{\sqrt{2\pi}} \cdot e^{-\frac{1}{2}x^2} $$
be the pdf (probability density function) of a standard gaussian distribution $N(0, 1)$. It is easy to derive that the pdf of a general gaussian distribution $N(\mu, \sigma^2)$ that $X$ follows would be
$$ f_X(x) = \frac{1}{\sqrt{2\pi}\sigma} \cdot e^{-\frac{1}{2\sigma^2}(x-\mu)^2} = \frac{1}{\sigma}\phi \left( \frac{x - \mu}{\sigma} \right) $$
Using the trick called differential form</a>, the probability of $X$ taking a specific value $x$ is 1</a>
$$ \mathbb{P} [X = x] = f_X(x) \mathrm{d}x = \frac{1}{\sigma}\phi \left( \frac{x - \mu}{\sigma} \right) \mathrm{d} x $$
Okay, so what does pdf of $Y$ (i.e. $\mathbb{P}(Y = y)$) look like? It turns out that we can easily derive that with a bit transformation:
$$ \begin{darray}{rcl} \mathbb{P} \left[ Y = y \right] &=& \mathbb{P} \left[ X = \frac{y}{a} \right] \\\ &=& f_X \left(\frac{y}{a}\right) \mathrm{d}x \\\ &=& f_X \left(\frac{y}{a}\right) \frac{\mathrm{d}y}{a} \\\ &=& \frac{1}{\sigma}\phi \left( \frac{\frac{y}{a} - \mu}{\sigma} \right) \frac{\mathrm{d}y}{a} \\\ &=& \frac{1}{a\sigma}\phi \left( \frac{y - a\mu}{a\sigma} \right) \mathrm{d}y \end{darray} $$
This basically shows that $Y$'s pdf is nothing but the pdf of $N(a\mu, a^2\sigma^2)$, hence conclude the proof.
^{1
An intuitve perspective here that helps understanding that is $f_X(x) \mathrm{d}x$ is actually the area, whose fundamental unit is probability!
</div>
The above proved conclusion enables us to solve the prediction problem
at the very beginning of this section,
$$
\textrm{based on } x_{t+1} = a \cdot x_t \textrm{ and } x_t \sim N(\hat{x}_t, \sigma_t^2)
$$
$$
\textrm{we can predict } x_{t+1} \sim N(a\hat{x}_t, a^2\sigma_t^2)
$$
equivalently, this means we can obtain estimation of $x_{t+1}$ (without observing it) as
$$
\begin{cases}
\hat{x}_{t+1} &=& a\hat{x}_t \\
\sigma_{t+1} &=& a^2 \sigma_t^2
\end{cases}
$$}
Uncertainty in Transition Function</h2>
Now it is time to introduce one variation on top of the simplest case we discussed in the previous section. In reality the transition is usually not perfect, which means that there is an error associated with it. Mathematically, it means
$$ x_{t+1} = a \cdot x_t + e_t $$
As usual for simplicity we assume the error is a zero-mean random variable follows a Gaussian distribution, i.e.
$$ e_t \sim N \left(0, \sigma_{e_t}^2 \right) $$
How should we revise our prediction under such condition? Remember we are still solving the following question - if we already have an estimation of $x_t$ as
$$ x_t \sim N(\hat{x}_t, \sigma_t^2) $$
what is a good estimation of $x_{t+1}$, given that we know (although not precisely in this case) the transition function?
Here we are going to introduce another useful idea - the generative model. The generative model basically describes the procedure to get a sample value of a random variable. In this particular case, the generative model of $x_{t+1}$ consits of:

Sample $a \cdot x_t$ out of the distribution $N(a\hat{x}_t, a^2\sigma_t^2)$ (Note that this is the conclusion from the previous section)</li>
Sample $e_t$ out of the distribution $N(0, \sigma_{e_t}^2)$</li>
Construct $x_{t+1}$ by adding the two sampled values up</li> </ol>
So you can see that the generative model is basically an interpretation of the problem formulation, provides no new knowledge at all. However, with such interpretation it is clear to see that $x_{t+1}$ as a random variable is basically the sum of two independently distributed gaussian random variables!
I will cheat here by referring to the generating function based proof</a> from wikipedia. As you have probably guessed, the conclusion is that
$$ \textrm{if i.i.d } \begin{cases} X &\sim N(\mu_X, \sigma_X^2) \\ Y &\sim N(\mu_Y, \sigma_Y^2) \end{cases} \textrm{ then } Z = X + Y \sim N(\mu_X + \mu_Y, \sigma_X^2 + \sigma_Y^2) $$
By plugging in our generative model, we obtain
$$ x_{t+1} \sim N(a\hat{x}_t, a^2\sigma_t^2 + \sigma_{e_t}^2) $$
which means such good prediction would be
$$ \begin{cases} \hat{x}_{t+1} &=& a\hat{x}_t \\ \sigma_{t+1} &=& a^2 \sigma_t^2 + \sigma_{e_t}^2 \end{cases} $$
Yep, just add the variance of the error to the estimation of variance. Pretty simple, right?
Let There Be Observations</h2>
So we know how to predict $x_{t+1}$ given $x_t$, which is great. This means that if we happen to know the initial state of the system, $x_0$, we can start to predict $x_1$, and then $x_2$, ..., till any $x_t$, which will be
$$ \begin{cases} \hat{x}_t &=& a^t\hat{x}_t \\ \sigma_t &=& a \cdot (a \cdot ( a \cdots) + \sigma_{e_{t-2}}^2) + \sigma_{e_{t-1}}^2 \end{cases} $$
As we can see, there is one fatal problem in the above prediction. As $t$ grows, our estimation will be less precise because the variance is going to grow very quickly. This is because ecah time we make prediction for one more step, the variance of error will be added to it. To see it more clearly, when $a=1$, we will have
$$ \sigma_t = \sigma_{e_0}^2 + \sigma_{e_1}^2 + \cdots + \sigma_{e_{t-1}}^2 $$
It is easy to understand this error accumulation intuitively. As we make more predictions, we are not getting new information about the system. Think about it - if you only know what a cat looks like when it is 1-week old, how can you precisely predict how it looks like when he is 3 years old? If you only know your weight before COVID-19 keeping us at home, how do you precisely estimate your current weight? The key here is that you need constant feedback to guide your estimation when it gets
Okay let's take one more step in the problem formulation, so that it will be more realistic. Now we are allowed to take measurement of the system via an observation function. In the weight example, this translates to you are allowed to weigh yourself with an electric scale every now and then. It seems that with the method to take measurement, we do not need to estimate the state anymore, we can simply observe the readings and get the precise value! Except for that in reality the measurement is usually not always accurate. Therefore, the observation function has an associated error as well. Assuming a linear observation function, we will have the readings mathematically as:
$$ y_t = h_t \cdot x_t + r_t, \textrm{ where } r_t \sim N \left(0, \sigma_{r_t} \right) $$
Note that when you take measurement at time $t + 1$, you can directly observe $y_{t+1}$ (though $y_{t+1}$ is not $x_{t+1}$, and the latter is what we want to estimate). The question now becomes: how to make a good estimation about $x_{t+1}$, given

A good estiamtion of the previous state $x_t$, and</li>
The current measurement reading $y_t$</li> </ul>
Let's first find the generative model interpretation of this. We can see that $y_{t+1}$ is generated in the following 3 steps:

Sample $x_{t+1} = a \cdot x_t + e_t$ out of the distribution $N(a\hat{x}_t, a^2\sigma_t^2 + \sigma_{e_t}^2)$ (Note that this is the conclusion from the previous section)</li>
Sample $r_{t+1}$ out of the distribution $N(0, \sigma_{r_{t+1}}^2)$</li>
Directly compute $y_{t+1} = h_{t+1} \cdot x_{t+1} + r_{t+1}$</li> </ol>
Let's stop for a while to take a closer look at the above generative model. The distribution $N(a\hat{x}_t, a^2\sigma_t^2 + \sigma_{e_t}^2)$ comes from the conclusion of the previous section, which represents the best 2</a> estimation of $x_{t+1}$ we can get based pure prediction. The mean and variance of this distribution will be used quite a lot in the derivation below, so it is good to give it some name. Let
$$ \begin{cases} x'_{t+1} &=& a\hat{x}_t \\ \sigma'^2_{t+1} &=& a^2 \sigma_t^2 + \sigma_{e_t}^2 \end{cases} $$
Note that both $x'_{t+1}$ and $sigma'_{t+1}$ are determinsitic values, i.e. neither of them is random variable.
^{2
I am being informal here as we haven't formally defined what
the best estimation means. We will likely get to this topic in
the future, so bear with me for now.
</div>
Although the above generative model is about generating $y_{t+1}$, it
is $x_{t+1}$ that we actually want to estimate. We can do this by
deriving the pdf of $x_{t+1}$. The following derivation will
likely seem very tedious, but I will try to be clear on each step and
trust me, this will be the last challenge in this post!
Given that we have observed $y_{t+1} = y$, what is the probability of
$x_{t+1} = x$? Such probability can be written as
$$
\forall x, \mathbb{P} [ x_{t+1} = x \mid y_{t+1} = y] = f_{x_{t+1}}(x) \mathrm{d}x
$$
where $f_{x_{t+1}}(x)$ is the unknown (yet) pdf of $x_{t+1}$ that
we want to derive. Also note that there is $\forall x$ in the
statement, which is very important. It means that the equation
holds for every single $x$.
By applying Bayes's law, the left hand side can also be transformed as
$$
\begin{darray}{rcl}
\forall x, \mathbb{P} [ x_{t+1} = x \mid y_{t+1} = y] &=& \frac{\mathbb{P}[y_{t+1}=y \mid x_{t+1} = x] \mathbb{P}[x_{t+1} = x]}{\mathbb{P}[y_{t+1} = y]} \\
&=& \frac{\mathbb{P}[r_{t+1} = y - h_{t+1}x] \mathbb{P}[x_{t+1} = x]}{\mathbb{P}[y_{t+1} = y]}
\end{darray}
$$
So there are 3 items on the right hand side. Let's crack them one by
one.
The simplest one here is $\mathbb{P}[y_{t+1} = y]$. Since it does not
depend on $x$, this can be just written as
$$
\mathbb{P}[y_{t+1} = y] = \mathrm{Const} \cdot dy
$$
Next comes $\mathbb{P}[x_{t+1} = x]$, without conditioning on the
value of $y_{t+1}$. This is the pure prediction we discussed
above, which $ \sim N(x'_{t+1}, \sigma'^2_{t+1})$. Therefore it is
simply
$$
\mathbb{P}[x_{t+1} = x] = \mathrm{Const} \cdot \exp\left(-\frac{(x-x'_{t+1})^2}{2\sigma'^2_{t+1}}\right) \mathrm{d} x
$$
The last one $\mathbb{P}[r_{t+1} = y - h_{t+1}x]$ is about $r_{t+1}$,
which happens to be following a Gaussian distirbution as well (even
better, the mean is zero)! This means
$$
\begin{darray}{rcl}
\mathbb{P}[r_{t+1} = y - h_{t+1}x] &=& \mathrm{Const} \cdot \exp \left( -\frac{(y - h_{t+1}x)^2}{2\sigma^2_{r_{t+1}}} \right) \mathrm{d}r \\
&=& \mathrm{Const} \cdot \exp \left( -\frac{(y - h_{t+1}x)^2}{2\sigma^2_{r_{t+1}}} \right) (\mathrm{d}y - h_{t+1}\mathrm{d}x)
\end{darray}
$$
Note $\mathrm{d}r$ can be written as the form above because of
differential form arithmetics. It is good to understand the rules
behind them, but when you get familiar with the rules, they are just
no more strange than the rules you use to take derivatives.
Therefore, take the above 3 expanded components and plug them back,
and keep in mind that by differential form rule $\mathrm{d}x \wedge
\mathrm{d}x = 0$, we have
$$
\begin{darray}{rcl}
\forall x, && f_{x_{t+1}}(x) \mathrm{d}x \\
&=& \mathbb{P} [ x_{t+1} = x \mid y_{t+1} = y] \\
&=& \frac{\mathrm{Const} \cdot
\exp \left(
-\frac{(x-x'_{t+1})^2}{2\sigma'^2_{t+1}} -\frac{(y - h_{t+1}x)^2}{2\sigma^2_{r_{t+1}}}
\right) (\mathrm{d}x \wedge \mathrm{d} y - h_{t+1} \mathrm{d}x \wedge \mathrm{d}x)}
{\mathrm{Const} \cdot dy} \\
&=& \mathrm{Const} \cdot \exp \left(
-\frac{(x-x'_{t+1})^2}{2\sigma'^2_{t+1}} -\frac{(y - h_{t+1}x)^2}{2\sigma^2_{r_{t+1}}}
\right) \mathrm{d} x
\end{darray}
$$
Let's then take a closer look at the terms inside $\exp()$
$$
\begin{darray}{rcl}
&&-\frac{(x-x'_{t+1})^2}{2\sigma'^2_{t+1}} -\frac{(y - h_{t+1}x)^2}{2\sigma^2_{r_{t+1}}} \\
&=&
-\frac{(\sigma^2_{r_{t+1}} + h^2_{t+1}\sigma'^2_{t+1})x^2 -
2(\sigma^2_{r_{t+1}} x'_{t+1} + \sigma'^2_{t+1}h_{t+1}y)x + \mathrm{Const}}
{2\sigma'^2_{t+1}\sigma^2_{r_{t+1}}} \\
&=& -\frac{1}{2} \frac{x^2 -
2\frac{\sigma^2_{r_{t+1}} x'_{t+1} + \sigma'^2_{t+1}h_{t+1}y}{\sigma^2_{r_{t+1}} + h^2_{t+1}\sigma'^2_{t+1}}x}
{\frac{\sigma'^2_{t+1}\sigma^2_{r_{t+1}}}{\sigma^2_{r_{t+1}} + h^2_{t+1}\sigma'^2_{t+1}}} + \mathrm{Const} \\
&=& - \frac{1}{2}\frac{\left(x - \frac{\sigma^2_{r_{t+1}} x'_{t+1} + \sigma'^2_{t+1}h_{t+1}y}{\sigma^2_{r_{t+1}} + h^2_{t+1}\sigma'^2_{t+1}} \right)^2}
{\frac{\sigma'^2_{t+1}\sigma^2_{r_{t+1}}}{\sigma^2_{r_{t+1}} + h^2_{t+1}\sigma'^2_{t+1}}} + \mathrm{Const}
\end{darray}
$$
Plug this back in the above equation we have
$$
\begin{darray}{rcl}
\forall x, f_{x_{t+1}}(x) \mathrm{d}x &=& \mathbb{P} [ x_{t+1} = x \mid y_{t+1} = y] \\
&=& \mathrm{Const} \cdot \exp \left( - \frac{1}{2}\frac{\left(x - \frac{\sigma^2_{r_{t+1}} x'_{t+1} + \sigma'^2_{t+1}h_{t+1}y}{\sigma^2_{r_{t+1}} + h^2_{t+1}\sigma'^2_{t+1}} \right)^2}
{\frac{\sigma'^2_{t+1}\sigma^2_{r_{t+1}}}{\sigma^2_{r_{t+1}} + h^2_{t+1}\sigma'^2_{t+1}}} \right) \mathrm{d} x
\end{darray}
$$
Let's remove $\mathrm{d}x$ from both side, and we have
$$
\forall x, f_{x_{t+1}}(x) =
= \mathrm{Const} \cdot \exp \left( - \frac{1}{2} \frac{\left(x - \frac{\sigma^2_{r_{t+1}} x'_{t+1} + \sigma'^2_{t+1}h_{t+1}y_{t+1}}{\sigma^2_{r_{t+1}} + h^2_{t+1}\sigma'^2_{t+1}} \right)^2}
{\frac{\sigma'^2_{t+1}\sigma^2_{r_{t+1}}}{\sigma^2_{r_{t+1}} + h^2_{t+1}\sigma'^2_{t+1}}} \right)
$$
Note that since $y$ is basically the value of $y_{t+1}$, it is
replaced with $y_{t+1}$.
This means that $x_{t+1}$ follows Gaussian distirbution! We can even
directly tell what is the mean and what is the variance of the
estimation from the above formula, i.e.
$$
\begin{cases}
\hat{x}_{t+1} &=& \frac{\sigma^2_{r_{t+1}} x'_{t+1} + \sigma'^2_{t+1}h_{t+1}y_{t+1}}{\sigma^2_{r_{t+1}} + h^2_{t+1}\sigma'^2_{t+1}} \\
\sigma^2_{t+1} &=& \frac{\sigma'^2_{t+1}\sigma^2_{r_{t+1}}}{\sigma^2_{r_{t+1}} + h^2_{t+1}\sigma'^2_{t+1}}
\end{cases}
$$
Making Sense of the Result</h2>
The above answer still looks very complicated, and let me try to
interpret it in a more intuitive way in this section.
Let's start with the mean
$$
\begin{darray}{rcl}
\hat{x}_{t+1} &=& \frac{\sigma^2_{r_{t+1}} x'_{t+1} + \sigma'^2_{t+1}h_{t+1}y_{t+1}}{\sigma^2_{r_{t+1}} + h^2_{t+1}\sigma'^2_{t+1}} \\
&=& \frac{\sigma^2_{r_{t+1}}}{\sigma^2_{r_{t+1}} + h^2_{t+1}\sigma'^2_{t+1}} \cdot x'_{t+1} +
\frac{\sigma'^2_{t+1}h_{t+1}}{\sigma^2_{r_{t+1}} + h^2_{t+1}\sigma'^2_{t+1}} \cdot y_{t+1} \\
&=& \frac{\sigma^2_{r_{t+1}}}{\sigma^2_{r_{t+1}} + h^2_{t+1}\sigma'^2_{t+1}} \cdot x'_{t+1} +
\frac{\sigma'^2_{t+1}h^2_{t+1}}{\sigma^2_{r_{t+1}} + h^2_{t+1}\sigma'^2_{t+1}} \cdot
\frac{y_{t+1}}{h_{t+1}} \\
&=& K \cdot x'_{t+1} + (1 - K) \cdot \frac{y_{t+1}}{h_{t+1}}
\end{darray}
$$
Note that in the above formula, we let
$$
K = \frac{\sigma^2_{r_{t+1}}}{\sigma^2_{r_{t+1}} + h^2_{t+1}\sigma'^2_{t+1}}
$$
$K$ is clearly a number between $0$ and $1$. This means that the mean
of estimation of $x_{t+1}$ is actually a weighted combination
of $x'_{t+1}$ and $y_{t+1} / h_{t+1}$. It is worth noting that

$x'_{t+1}$ is the best guess you can have based on pure prediction</li>
$y_{t+1} / h_{t+1}$ is the best guess you can have based on pure observation</li>
</ol>
So this is basically about trusting both of those two evidences with a
grain of salt. And how much you trust each of them depends on the
variance of each guess. The bigger variance, the less trustworthy.
Very reasonable, right?
What about the estimated variance of $x_{t+1}$? As we have defined
$K$, it can be written as
$$
\sigma^2_{t+1} = K \cdot \sigma'^2_{t+1}
$$
This is intuitively just updating the pure prediction based variance
estimation as we have observation now. Note that becauese $K < 1$, the
final estimated variance is going to be smaller than the actual pure
prediction based estimated variance!
So at this moment we can now summarize the procedure of Kalman Filter
update, i.e. how to obtain $t+1$-step estimation based on $t$-step
estimation and new observation.
Step I: Compute the pure prediction based estimation.
$$
\begin{cases}
x'_{t+1} &=& a\hat{x}_t \\
\sigma'^2_{t+1} &=& a^2 \sigma_t^2 + \sigma_{e_t}^2
\end{cases}
$$
Step II: Compute the combination weight $K$, which is often called the Kalman Gain.
$$
K = \frac{\sigma^2_{r_{t+1}}}{\sigma^2_{r_{t+1}} + h^2_{t+1}\sigma'^2_{t+1}}
$$
Step III: Use Kalman Gain $K$ and observation $y_{t+1}$ to
update the pure prediction based estimation.
$$
\begin{cases}
\hat{x}_{t+1} &=& K \cdot x'_{t+1} + (1 - K) \cdot \frac{y_{t+1}}{h_{t+1}} \\
\sigma^2_{t+1} &=& K \cdot \sigma'^2_{t+1}
\end{cases}
$$
Summarry</h2>
This post demonstrated the derivation of 1D Kalman Filter, and also
slightly touched the intuitive interpretation of it. Also, I think
many of the techniques used here such as generative model and
differential forms can find their applications in many other
situations.
However, in reality, 1D Kalman Filter is rarely useful enough. This
post should have prepared you for the next journey - multivariate
Kalman Filter. Stay tuned!}

Comments</h1>
The current zola theme I am using does not support uterrances. I considering switching to this theme</a> to enable uterrances comments, but for now, if you have comments or discussion, please leave them as github issues</a> manually. Sorry about the inconvenience!</p>

Acknowledgement</h1>
Special thanks to ChatGPT</a>, an AI language model trained by OpenAI, for helping me revise and improve this post.</p>

Appendix</h1>
I want the post to be very self-contained. Therefore I prepared 3 lemmas in the appendix so that main article won't be too distracting. All the 3 lemmas are about the product of two Gaussian pdfs, and it is suggested that you read them in order.</p>

Break's Forge World

The Intuitive VAE

Transformation of Probabilistic Distributions

VLAN Configuration by Examples

Brief Notes on Tech Leading

Minimalist's Kalman Filter Derivation, Part II

Minimalist's Kalman Filter Derivation, Part I

Declarative Docekr Container Service in NixOS

Break's Forge World

The Intuitive VAE

Comments</h1> The current zola theme I am using does not support uterrances. I considering switching to this theme</a> to enable uterrances comments, but for now, if you have comments or discussion, please leave them as github issues</a> manually. Sorry about the inconvenience!</p>

Transformation of Probabilistic Distributions

VLAN Configuration by Examples

Acknowledgement</h1> Special thanks to ChatGPT</a>, an AI language model trained by OpenAI, for helping me revise and improve this post.</p>

Brief Notes on Tech Leading

Minimalist's Kalman Filter Derivation, Part II

Kalman Filter Is a Special Bayes</h3> We say that Kalman filter is a special form of Bayes Filter because it poses 3 constraints on Bayes filter, one for each of the known conditions:</p>

Appendix</h1> I want the post to be very self-contained. Therefore I prepared 3 lemmas in the appendix so that main article won't be too distracting. All the 3 lemmas are about the product of two Gaussian pdfs, and it is suggested that you read them in order.</p>

Minimalist's Kalman Filter Derivation, Part I

Single Step Prediction</h2> Let's say we have a one dimensional linear Markovian System, whose transition function is know. This means</p> The state of the system can be represented as a single scalar. Let's denote it as $x$.</p> </li>

Declarative Docekr Container Service in NixOS

The Web Server Container</h2> The web server then follows pretty much the same way as the Database container.</p>

Comments</h1>
The current zola theme I am using does not support uterrances. I considering switching to this theme</a> to enable uterrances comments, but for now, if you have comments or discussion, please leave them as github issues</a> manually. Sorry about the inconvenience!</p>

Acknowledgement</h1>
Special thanks to ChatGPT</a>, an AI language model trained by OpenAI, for helping me revise and improve this post.</p>

Appendix</h1>
I want the post to be very self-contained. Therefore I prepared 3 lemmas in the appendix so that main article won't be too distracting. All the 3 lemmas are about the product of two Gaussian pdfs, and it is suggested that you read them in order.</p>