Two mathematical tidbits that I came across recently, posted mainly for my own benefit.

**1. L1 distance between probability density functions**

If and are two random variables with densities and , respectively, the density of the scaled variable is

and similarly for , where . Using this transformation rule, we see that the distance between and satisfies

Setting , we get

This latter result also holds for -dimensional vector random variables (for which the transformation rule is ).

In their book on the approach to nonparametric density estimation, Luc Devroye and László Györfi prove a much more general version of this result, which includes as a special case the fact that distance between PDFs is invariant under continuous, strictly monotonic transformations of the coordinate axes. In the one-dimensional case, this means that one can have a visual idea about the distance between two PDFs by bringing the real line into a finite interval like by a monotonic transformation.

The figure below shows the transformed version of the PDFs above, via the transformations , . The area of the gray region inside the interval below is the same as the area of the gray region in the third plot above (which is spread over the whole real line).

You can read the proof of the general version of the theorem in the introductory chapter of the Devroye-Györfi book, which you can download here. Luc Devroye has many other freely accessible resources on his website, and he offers the following quote as an aid in understanding him.

**2. L1 penalty and sparse regression: A mechanical analogy**

Suppose we would like to find a linear fit of the form to a data set , where , , and . Ordinary least-squares approach to this problem consists of seeking an -dimensional vector that minimizes

One sometimes considers regularized versions of this approach by including a “penalty” term for , and minimizing the alternative objective function

where is a tuning parameter, and is a function that measures the “size” of the vector . When

,

we have what is called *ridge regression*, and when

,

we have the so-called *LASSO*. Clearly, in both cases, increasing the scale of the penalty term results in solutions with “smaller” , according to the appropriate notion of size.

Perhaps a bit surprisingly, for the case of penalty (LASSO), one often gets solutions where some are not just small, but exactly zero. I recently came across an intuitive explanation of this fact based on a mechanical analogy, on a blog devoted to compressed sensing and related topics. The following three slides reproduced from a presentation by Julien Mairal perhaps do not exactly constitute a “proof without words“, but are really helpful nevertheless. The terms in the figures represent the and versions of the penalty function as “spring” and “gravitational” energies, respectively. Increasing the spring constant makes smaller, but not zero. Increasing the gravity (or the mass), on the other hand, eventually makes zero.

In case you haven’t seen it before, here is a visual from the original LASSO paper by Tibshirani, comparing and penalties in a two-dimensional problem ( denotes the solution of the ordinary least squares problem, without any constraint or penalty):

This provides another intuitive explanation of the sparsity of solutions obtained from LASSO. In order to make sense of this figure, you should keep in mind that the regularized problems above are equivalent to the constrained problem

where is analogous to the tuning parameter , and stands for or penalty, as above. (If this was a bit too cryptic, you can take look at the original LASSO paper, which, according to Google Scholar, has been cited more than 9000 times.)