Some Takeaways from Statistical Consequences of Fat Tails
[Probability Theory]
A review of Nassim Nicholas Taleb’ s probability theory research and commentary book, SCoFT
Isaac Gelman, December 2020


The primary takeaway of Statistical Consequences of Fat Tails is this:

        Take nothing for granted.

This includes the some of the most accepted statistical techniques. NNT uses the phrase it is what it is to describe the realization that another 300 years of data is required to test a statistical hypothesis. Or that a dataset has no variance - that our best guess of a distribution’s standard deviation will not converge in a lifetime’s worth of data.

The book takes the reader back to first principles, with a particular focus on “fat-tailed” random variables such as the SPY and BTC daily movements. Fat tailed random variables challenge our conceptions of mean and standard deviation. Linear regression also breaks under fat tails. The convincing case is made that power law distributions should be the default for modeling data rather than the thin-tailed Normal distribution.

The book is also, somewhat esoterically, a commentary on the state of modern data analysis. As data infiltrates every aspect of life, analysts and researchers are hasty to apply commoditized and hyper-convenient statistical packages. Similar to the massive influx of users into the Robinhood trading app, a spike in use can also spike misuse. We likely want to avoid such crashes as 2008, the flash crash, or the fate of the fund Long Term Capital Management, and we sure want to maintain scientific integrity and cut the risk of mass statistical misunderstanding.

What are Fat Tails?


Any distribution with more density in the tails than the Normal distribution is said to have thick tails. This corresponds to raw kurtosis > 3. Having “thick tails” is not difficult: the tail density needs to decay slower than Normal, scoft_2.png.

Fat tailed distributions are the thickest tailed distributions. The power law is an example of this - they’re the distributions with so much additional density in their tails that moments E[X^p] are no longer finite.

Power Law Distributions

The quintessential example of the power law distribution is engineer and sociologist Vilfredo Pareto’s discovery that 20% percent of taxpayers had 80% of the income across countries in Europe. One parameter of the Pareto power law distribution is α, which is known as the tail index. Pareto’s 80-20 example corresponds to α = 1.16. The tail index describes the behavior of density decay in the tail, as its name implies.

P(X > x) ~ L(x)xα, where X is the random variable, P(X > x) is the probability of X exceeding x, and L(x) is a slowly varying function.

Recall the inverse square law from Newton’s gravitation scoft_4.png (α = 2), wherein the force of gravity is inversely proportional to the square of the distance between two masses. Similarly, our survival function on the left hand side, P(X > x), operates according to a power law - we know that the survival function is inversely proportional to xα for a fixed tail index α.

The strange thing about power law distributions is that, depending on the tail index α, some of its moments may not exist or be infinite. There is no finite mean if α < 1, and there is no finite variance if α < 2. The same applies for skewness at α < 3 and kurtosis when α < 4, and so on. The tails get thicker as α gets smaller.

Pseudo-convergence: A tail index less than 2 doesn’t mean that we can’t compute the sample variance of dataset. Rather, betting on the stability of the variance is unwise because this sample variance will never converge, and can in fact “spike” at any time. Furthermore, if the 4th moment (kurtosis) doesn’t exist, this may imply unbearably slow convergence of the 2nd moment (variance).

The Central Limit Theorem, which is typically very useful for sums and averages, requires a finite variance, so tail indices α < 2 do not obey. The assumption for the analytic Black-Scholes-Merton price for a financial option - that the random walk sum of movements converges to the Normal distribution - is also violated, so that breaks too. If the tail index is slightly over 2, it will converge to the Normal in the limit, but very slowly. As Taleb says, "you will eventually get there if you have a long, very long life."

Let’s say we know that a set of samples come from a power law distribution. It stands to reason that those “tail” events – the unlikely events of the atypically large magnitude – are the most indicative of the tail behavior. But these tail events are rare. Without a deep understanding of the underlying process which has generated these samples, it can be tough to rule out that the data was generated by a power law. In this sense, we might consider that “most” processes are fat tailed by default – or, we should at least assume they are until we have enough quantitative or qualitative data to prove otherwise.

Moments, Moments, Moments: Mean & Variance

Below we sample 100,000 points from a power law distribution with tail index α = 1.8. While the sample variance is computable, it never converges and thus we should not trust the number or bet on it. A single observation repeatedly sends it flying.



But maximum likelihood methods still work:



Analyzing Daily Bitcoin Movements

We can directly analyze daily Bitcoin price movement data. The Student T distribution is a two-tail power law distribution, and the “degrees of freedom” parameter translates to the tail index. Here, we find that α = 1.39. Finite mean, but no finite variance. Visualizing the empirical and the maximum likelihood estimated CDF, we see visibly that they match well. However, the sample variance exhibits the typical pseudo-convergent behavior even after 2 years of samples. Log-rescaled, infinite support result omitted as this is more direct for intuition.






Graphics:Cumulative Sample Variance

Slowness of Convergence

Even if we have a tail index 2 < α < 4, the speed of convergence for the variance may be too slow. Compare the speed of convergence of a Student T Distribution with α = 3 degrees of freedom (infinite kurtosis) against samples from α = 5 degrees of freedom (finite kurtosis). α = 3 takes over 95,000 data points to converge to the true variance. The code is omitted for brevity.