Central limit theorem in a nutshell
Table of Contents
In its simplest form, the central limit theorem states that the probability distribution of the sum of an increasingly large number of independent and identically distributed random variables (IIDs) approaches a Gaussian distribution.1
It seems that a bunch of concepts get involved here. But a simple example would just be tossing a coin $N$ times. The outcome of each toss, $x_i$, is considered as one of the IIDs, described by a common probability distribution $p(x_i)$. If we define $$ x_i=\begin{cases} 1,&\text{head} \\ 0,&\text{otherwise} \end{cases} \quad i=1,2,…,N $$ then the total number of heads we get, $X$, is the sum of those IIDs $$ X=\sum_{i=1}^N x_i. \tag{1} $$ Note that $X$ itself is also a random variable. The central limit theorem tells us that the probability distribution for $X$ approaches a Gaussian distribution as $N\rightarrow\infty$.
The beauty of this theorem is that it doesn’t matter what the original probability distribution $p(x_i)$ looks like, that is, the coin could be biased toward any side in some non-trivial way. The theorem is applicable as long as we are dealing with IIDs, i.e., using the same coin for each trail, with a reasonable assumption that the outcomes are independent from each other.
The goal of this post is to make sense of what’s going on in a way as simple as possible, while keeping things more or less self-contained.2 I’ll first review some basic concepts related to random variables, things like moments, cumulants, characteristic function, etc., where the emphasis is on the meaning of cumulants. Then we derive the result that lies at the heart of the central limit theorem: All the cumulants of the sum are proportional to the number of IIDs.
Characteristic function and moments
To make this post short and focused, I do need to assume you have some basic ideal about what is a random variable described by a probability distribution. The thing I’m going to review in is that we can learn some important properties of the probability distribution by looking at its cumulants, which can be generated by the log of the characteristic function of the distribution.
Let’s start by considering a single random variable $x$ described by a probability distribution $p(x)$. The so-called characteristic function is given by the Fourier transform of the distribution 3 $$ \tilde p(k)=\langle\mathrm e^{-\mathrm ikx}\rangle=\int p(x)\mathrm e^{-\mathrm ikx}\mathrm dx, $$ where $\langle\cdots\rangle$ denotes the expectation value. The motivation of defining such a characteristic function is to easily generate all the moments. To see this, first note that $$ \tilde p(k) =\sum_{m=0}^{\infty}\frac{(-\mathrm ik)^m}{m!}\langle x^m\rangle, $$ where $\langle x^m\rangle$ denotes the $m$-th moment of the distribution, which can then be generated by taking $m$-th derivative of the characteristic function with respect to $k$ and setting $k=0$ after that: $$ \langle x^m\rangle=\mathrm i^m\frac{\mathrm d^m\tilde p(k)}{\mathrm d k^m}\bigg|_{k=0},\quad m=1,2,… $$
Cumulants
Well, it turns out that compared to moments, cumulants are often more useful. The $m$-th cumulant of a distribution, denoted by $\langle x^m\rangle_{\text c}$, can also be generated by the characteristic function. The only change we need to make is to take the log before taking the derivative: $$ \langle x^m\rangle_{\text c}=\mathrm i^m\frac{\mathrm d^m \ln \tilde p(k)}{\mathrm d k^m}\bigg|_{k=0}. \tag{2} $$
Eq. $(2)$ can be regarded as the definition of cumulants. From this definition, it is not clear at all what cumulants can tell us about the distribution. Things become more clear when we start looking at the first few cumulants.
The first cumulant is the same as the first moment or the mean $$ \langle x\rangle_{\text c}=\langle x\rangle, $$ while the second cumulant is known as the variance or the square of the standard deviation $$ \langle x^2\rangle_{\text c}=\langle x^2\rangle-\langle x\rangle^2. $$ So the first cumulant tells us where the distribution is centered and the square root of the second tells us the typical size of the spread around the mean. See Fig. 1. Higher-order cumulants give more subtle details of the shape of the distribution, but you will see that the first two are sufficient for our purpose.
Many independent random variables
Cumulants become more useful when we consider many random variables, where the above formalism can be easily generalized. The joint characteristic function for many random variables is given by $$ \tilde P(\boldsymbol k)=\int P(\boldsymbol x)\mathrm e^{-\mathrm i\boldsymbol k\cdot\boldsymbol x}\mathrm d^N\boldsymbol x, \tag{3} $$ where $P(\boldsymbol x)$ is the joint probability distribution of $N$ random variables, listed in the vector $\boldsymbol x=(x_1,…,x_N)$. Similarly, vector $\boldsymbol k=(k_1,…,k_N)$. Note that we used both upper and lower case $p$, representing different distribution functions.
Then various cumulants can be generated in a similar way $$ \langle x_1^{m_1}\cdots x_N^{m_N}\rangle_{\text c} =\mathrm i^{m_1+\cdots+m_N}\frac{\partial^{m_1}}{\partial k_1^{m_1}}\cdots\frac{\partial^{m_N}}{\partial k_N^{m_N}}\ln \tilde P(\boldsymbol k)\bigg|_{\boldsymbol k=0}. \tag{4} $$
I know that writing down such a huge and general formula is intimidating. But when we try to apply it to some simple cases, it will start to make sense. So let us consider independent random variable. Independence means that the joint probability distribution is give by the product of individual probability distributions $$ P(\boldsymbol x)=\prod_{i=1}^Np_i(x_i). $$ Then the Fourier integral in Eq. $(3)$ can be factorized and hence the joint characteristic function also becomes a product $$ \tilde P(\boldsymbol k) =\prod_{i=1}^N\int p_i(x_i)\mathrm e^{-\mathrm ik_ix_i}\mathrm dx_i=\prod_{i=1}^N\tilde p_i(k_i). $$ Taking the logarithm $$ \ln\tilde P(\boldsymbol k)=\sum_{i=1}^N\tilde p_i(k_i) \tag{5} $$ and plugging Eq. $(5)$ into $(4)$, it is not hard to see that all the cumulants involving different variables vanish $$ \langle x_1^{m_1}\cdots x_N^{m_N}\rangle_{\text c}=0, \quad\text{if two or more $m_i\neq 0$}. $$
For example, the cumulant involving two independent random variables (known as covariance) $$ \langle x_ix_j\rangle_{\text c}=\langle x_ix_j\rangle-\langle x_i\rangle\langle x_j\rangle=0. $$ Independence is the same as no correlation. So roughly speaking, cumulants measure the correlations between random variables.
Cumulants of the sum of IIDs
Finally we are ready to prove the central limit theorem for IIDs. By doing this, we are simplifying things further and consider the case of $N$ not only independent but also identical probability distributions.
Remember that we want to determine the probability distribution of the sum of IIDs, $X$ given by Eq. $(1)$. The IIDs have a common probability distribution $p(x_i)$. So the joint probability distribution $$ P_{\text{IIDs}}(\boldsymbol x)=\prod_{i=1}^Np(x_i) $$ Then the characteristic function of $X$ can be factorized as follows $$ \tilde p_X(k) =\prod_{i=1}^N\int p(x_i)\mathrm e^{-\mathrm ikx_i}\mathrm dx_i =[\tilde p(k)]^N, $$ where $\tilde p(k)$ is the common characteristic function of the IIDs. So $$ \ln\tilde p_X(k)=N\ln\tilde p(k) $$ Since all the cumulants are generated from the logarithm of the characteristic function according to Eq. $(2)$, the $m$-th cumulant of $X$ is simply $$ \boxed{\langle X^m\rangle_{\text c}=N\langle x^m\rangle_{\text c}} \tag{6} $$ where $\langle x^m\rangle_{\text c}$ denotes the common $m$-th cumulant of the IIDs.
When cumulants are extensive
The above boxed equation is what I mean by extensive cumulants, i.e. all the cumulants are proportional to $N$. Recall that, as illustrated in Fig. 1, each cumulant defines a length scale that determines certain property of the distribution. The first one determines the center location, while all the others determines the properties of the shape of the distribution.
Let us define the length scale associated with each cumulant as $$ \ell_m=\left(\langle X^m\rangle_{\text c}\right)^{1/m}. $$ Then from Eq. $(6)$, $$ \ell_m\propto N^{1/m}. $$ For example, $$ \ell_1=\langle X\rangle_{\text c}=\langle X\rangle=N\langle x\rangle\propto N $$ is the mean and $$ \ell_2=\sqrt{\langle X^2\rangle_{\text c}}=\sqrt{N\langle x^2\rangle_{\text c}}\propto \sqrt N $$ is the standard deviation.
As $N\rightarrow\infty$, $\ell_1$ dominates. But $\ell_1$, i.e. the mean, does not tell us anything about the shape of the distribution. So we also need to include $\ell_2$, the standard deviation. All other $\ell_m$’s becomes unimportant in this limit. Now you should realize that such a probability distribution is a Gaussian distribution.
More formally, if a probability distribution only has the first two cumulants, the logarithm of the characteristic function only contains a linear and a quadratic term of $k$, hence $$ \tilde p_X(k)=\exp\left(-\mathrm i\ell_1k-\frac{\ell_2^2k^2}{2}\right). $$ The probability distribution can then be calculated from inverse Fourier transform $$ p_X(X)=\frac{1}{2\pi}\int\tilde p_X(k)\mathrm e^{\mathrm ikx}\mathrm dk =\frac{1}{\sqrt{2\pi\ell_2^2}}\exp\left[-\frac{(X-\ell_1)^2}{2\ell_2^2}\right], $$ which is exactly a Gaussian distribution.
-
The conditions (independent & identical) can be relaxed to some extent. But to keep things simple, here we won’t touch on those generalizations. ↩︎
-
I’ll closely follow chapter 2 of Kardar, Statistical Physics of Particles, Cambridge University Press, 2007. ↩︎
-
Other definitions are used as well. But as you can see from the following discussion that the exponential function is essential, while the numerical factor is not. ↩︎