Archive for the ‘Statistics’ Category

Elastic net, LASSO, and LARS in Python

July 26, 2012

I’m currently looking for implementations of the LASSO and Elastic Net, otherwise known as L1 and L1/L2 regularised linear regression respectively, in Python. The options seem to be scikit.learn and glmnet-python. The former offers coordinate ascent or the LARS algorithm coded in pure Python (with Numpy obviously), whereas the latter just wraps Jerome Friedman’s Fortran code from the R glmnet package.

Timing comparison

Runtime comparison between LASSO/Elastic net implementations from scikit.learn and glmnet-python. x-axis: number of features P. y-axis: time in seconds. Synthetic data with N=400, P/10 non-zero coefficients sampled from N(0,9), and 0.01 variance Gaussian noise.


As you might expect, the Fortran code is significantly faster in general, although for large P the LARS scikit.learn implementation is competitive with glmnet, presumably because the Python overhead becomes less noticeable. Unfortunately as far as I can see scikit.learn does not include a LARS implementation for the elastic net.

A Bayesian view of the Stein estimator

November 17, 2008

I’ve been attending Statistical Theory, a course in Part III Maths at Cambridge, taught by Richard Samworth. In the latest lecture, Richard showed the Stein estimator for the mean of multivariate Gaussian has lower expected loss (measured as the Euclidean distance between the true and estimated values of the mean) than the MLE for any value of \theta. From a frequentist perspective this appears completely unintuitive, but from a Bayesian perspective it appears much more reasonable.

Assume we observe vector X drawn from a multivariate normal of dimension p, with mean \theta and identity covariance matrix. The MLE of \theta is then just X, but the Stein estimator

\theta_s=\left( 1-\frac{p-2}{||X||^2} \right) X

The fact that this estimator performs better than the ML is termed shrinkage, because the estimator is shrunk towards 0.

How would a Bayesian approach this problem? First let’s put a Gaussian prior on \theta, so

\theta \sim N_p(\theta;0,\lambda^{-1}I)

where \lambda is a precision (inverse variance). In a fully Bayesian setting we would then put a Gamma prior on \lambda, but unfortunately we would then have to resort to sampling to infer the posterior over \theta. Assuming \lambda is known, then the posterior of \theta is

P(\theta|X,\lambda) \propto P(X|\theta) P(\theta|\lambda) = N_p(\theta;(1+\lambda)^{-1}X,(1+\lambda)^{-1})

Thus the expected value of \theta is
E(\theta|X)=(1+\lambda)^{-1}X

Now let’s find the MLE of \lambda. This is not ideal, but is tractable at least. To do this we’ll first integrate out \theta:

P(X|\lambda)=\int P(X|\theta) P(\theta|\lambda) d\theta
=N_p(X;0,\lambda_x^{-1}I)
where
\lambda_x=\frac{\lambda}{1+\lambda}.

An unbiased estimate of \lambda_x is given by
\lambda_x^{ML}=\frac{|X|^2}{p-1}.

Substituting for \lambda and rearranging gives
\lambda^{ML} = \left( \frac{|X|^2}{p-1}-1 \right)^{-1}

Substituting into the expression for E(\theta|X) above and rearranging gives
E(\theta|X) = \left( 1-\frac{p-1}{|X|^2} \right) X,
which is very close to the Stein estimate. I suspect that some choice of prior on \lambda would result in a MAP estimate which would give the p-2 term.

The conclusion is that an estimator which has unintuitively desirable properties in a frequentist framework, is intuitively a sensible estimator using under a Bayesian framework.