Regarding my work on 454 pyrosequencing error rates with Professor Holmes this summer, I was recently asked about how to calculate a p-value for comparing two draws from a Binomial distribution to test the hypothesis that the number of substitutions seen in the sample is significantly greater than the number of substitutions seen in the control. There is actually no need to use the Poisson approximation, and the Binomial distribution very naturally takes care of varying coverage. I explain my approach here.

## Archive for December, 2008

### Binomial p-values for comparing pyrosequencing samples

December 10, 2008### OK, Matlab isn’t that weird

December 10, 2008So as Johan pointed out I made a pretty big mistake in my previous post because rand(10000) does not generate 10000 random numbers, it generates a 10000×10000 matrix of random numbers. Here’s the corrected code:

tic a=binornd(1,0.9,10000,1); toc tic a=rand(10000,1)<0.9; toc tic for i=1:10000 a=binornd(1,0.9); end toc tic for i=1:10000 a=rand<0.9; end toc

And here are the corrected times:

Elapsed time is 0.002682 seconds. Elapsed time is 0.000368 seconds. Elapsed time is 0.462683 seconds. Elapsed time is 0.000401 seconds.

This does seem a lot more reasonable. The extra overhead in calling binornd presumably explains why (2) takes a lot longer than (1). (4) is almost as fast as (2) because (4) is automatically vectorised. However, it appear the compiler does not manage to vectorise (3) because it is a load slower than (2).

Although reasonable, there still seems to be a lot of inefficiency in how the random number generator’s output is used in binornd. rand generates a random number between [0,1]. If it generated a true real from the uniform distribution on [0,1], we could construct an infinite sequence of independent binary random variables just by taking it’s binary representation. Let’s assume that it does generate a true draw from the uniform distribution up to the numerical accuracy defined by the 64-bit internal representation of a double: we then have 52 random bits (since out of the 64bits 52 define the fractional part), 52bits of information. The distribution I actually want to draw from (0 with probability 0.9 and 1 with probability 0.1) has information bits. When we do the rand<0.9 operation, we throw away 52-0.469 bits, which is pretty wasteful in any situation where the random number generation is critical, such as Monte Carlo simulation.

### Matlab is weird

December 5, 2008Try running this code in Matlab:

tic a=binornd(1,0.9,10000,1); toc tic a=rand(10000)<0.9; toc tic for i=1:10000 a=binornd(1,0.9); end toc tic for i=1:10000 a=rand<0.9; end toc

This is the output I get:

Elapsed time is 0.001861 seconds. Elapsed time is 3.184994 seconds. Elapsed time is 0.455896 seconds. Elapsed time is 0.000405 seconds.

This seems very strange. I can see that the second approach is wasteful: generating a random number between 0 and 1 contains a lot more information than the one bit the cutoff reduces it to. But why is the final loop so fast? Aren’t loops meant to be bad in Matlab? It’s reassuring that the binornd function is faster than the second approach, but even more strange that it’s then slower in the loop in the third approach!

My conclusion is that Matlab is a very strange platform, and that you should be very careful assuming one way of doing something will be the fastest. It also pushes home the point that it’s a pretty dodgy environment to use for performance testing of algorithms!

### “Ridiculous stats”

December 1, 2008So Johan recently showed me this ridiculous nature article, where they do linear regression on winning olympic 100m sprint times to conclude that by 2150 women will run faster than men. Two responses (here and here) criticize the article, but I thought I should compare to the obvious Bayesian approach: GP regression. I used Carl’s code to perform the regression in Matlab, with squared exponential covariance function, and I optimized the hyperparameters using the minimise function. The two plots below show the results.

These plots show that GP regression agrees pretty well with intuition: the data tells us nothing about what will happen past about 2030.