# Choosing Histogram Bin Widths

## by Andre on 9 February 2007

Sam at Everyday Scientist doesn’t like changing histogram bin widths when comparing data sets. When I first read his post, it sounded pretty reasonable, but it just comes back to how to pick histogram bin sizes in the first place. It seemed like the kind of thing I should know, so I tried to find something about selecting bin widths for histograms and came across this paper by David Scott: Biometrika, Vol. 66, No. 3. (Dec., 1979), pp. 605-610. (Penn doesn’t have an electronic subscription to Biometrika, but the paper’s also available through JSTOR... What the heck, just download it here [pdf] for your educational purposes).

In this paper, Scott derives “the formula for the optimal histogram bin width [...] which asymptotically minimizes the integrated mean squared error” h:

where n is the number of samples and f is the underlying probability distribution. For Gaussian data, this becomes a nice rule of thumb:

where s is an estimate of the standard deviation.

Getting back to Sam’s post, the figure he didn’t like came from Ron Vale’s lab and one of the things I noticed is that on the Biometrika site they list some papers that cite the Scott paper (scroll down). One of them is also from Vale’s lab. This means that at least some people in Vale’s group are familiar with this rule and probably used it when choosing the bin widths in that figure. In other words, far from being a mistake or a trick to make the data look nicer, those bin widths were probably chosen to best represent the underlying probability distribution.

Does anyone know of updates or improvements on the Scott paper that we should know about?

I do know of a better method—based on information—but I have yet to properly write it up. I put a note at http://cosmo.nyu.edu/hogg/research/2007/02/09/binning.pdf that sets it out quickly.

Yes, there is a better way. It’s a field of statistics called density estimation. The idea is that instead of dropping each point into a bin, you instead sum some set of functions centered at each data point (for instance, Gaussians). The trick then becomes choosing the width, and it turns out there are quite universal ways of choosing appropriate widths. There’s a rather inscrutable math text on it (Combinatorial Methods in Density Estimation by Devroye and Lugosi), but that phrase should lead you to some introductions through Google.

I mentioned this over in the comments to the other post, but I figured I’d note it here too: my immediate reaction to the data in the Everyday Scientist post was skepticism, but your post and the comments on the other post here have been quite interesting (and illuminating) – so thanks.

This is good stuff. I have always just played with bin width to see if I got widely varying results, and if not, picked one where each bin had a good number of counts.

However, Sam still has a point. What he is bothered by is that it is difficult to compare the two graphs. So maybe what Sam should have said is that the Vale lab should have taken enough data so that both histograms could have similar bin widths.

Thanks David and Frederick! Frederick, does your blog have a feed?

In theory, my blog should have a feed like all WordPress blogs, but it’s having enormous technical difficulties at the moment (it’s a small host run by one guy, but it’s also the only public host I know of that has Latexrender support…and I don’t have anywhere to run my own Wordpress from at this point). For instance, you can’t comment at all which makes it practically useless as a blog. It’s enough to drive me back to just posting essays on a webpage with latex2html.

I pulled out the Devroye and Lugosi book again today (which isn’t nearly as inscrutable as I remember). The kernel density estimate occupies a couple of later chapters (didn’t get to them today while reading on the train; probably tomorrow), but I did find some gems. If I can get my blog to cooperate I’ll try to post them. There’s this particularly interesting connection to Kolmogorov entropy, and a lot of the foundational issues I’ve been having with probability recently may be resolvable in this framework.

The things I do while waiting for cells to recover from electroporation…

I think that you are mixing up two different issues.

1. When you are comparing two data sets using histograms you need to use the same binsize, and same axis limits.

2. When you use a histogram no one bin width is perfect. Scott’e rule of thumb is based on beautifully Gaussian data, and really doesn’t apply to data that is not Gaussian. If the data is Gaussian you don’t need to look at a histogram anyway. You need to plot the same data several times using different bin widths to extract information about the distribution.

If you want a basic example of this look at www.ggobi.org/book

or at the tips case study on my Statistics 503 web page.

I have a Bayesian method of choosing bin sizes that works quite nicely. It makes none of the assumptions that Scott and others make in assuming that the underlying density is known.

I am currently submitting the paper to Phys Rev E, but a preprint can be found at:

http://arxiv.org/abs/physics/0605197

The MATLAB package is available for free and can be downloaded at:

http://knuthlab.rit.albany.edu/downloads/OPTBINS_Package.zip

Enjoy,

Kevin

Thanks Kevin, I’ll have a look at it. I’m actually running into this problem again in a case where I don’t have a guess of the underlying distribution so your method, or perhaps one of the methods recommended above, will come in handy.