by Andre on 10 February 2007
Sam at Everyday Scientist doesn’t like changing histogram bin widths when comparing data sets. When I first read his post, it sounded pretty reasonable, but it just comes back to how to pick histogram bin sizes in the first place. It seemed like the kind of thing I should know, so I tried to find something about selecting bin widths for histograms and came across this paper by David Scott: Biometrika, Vol. 66, No. 3. (Dec., 1979), pp. 605-610. (Penn doesn’t have an electronic subscription to Biometrika, but the paper’s also available through JSTOR… What the heck, just download it here [pdf] for your educational purposes).
In this paper, Scott derives “the formula for the optimal histogram bin width […] which asymptotically minimizes the integrated mean squared error” h:
where n is the number of samples and f is the underlying probability distribution. For Gaussian data, this becomes a nice rule of thumb:
where s is an estimate of the standard deviation.
Getting back to Sam’s post, the figure he didn’t like came from Ron Vale’s lab and one of the things I noticed is that on the Biometrika site they list some papers that cite the Scott paper (scroll down). One of them is also from Vale’s lab. This means that at least some people in Vale’s group are familiar with this rule and probably used it when choosing the bin widths in that figure. In other words, far from being a mistake or a trick to make the data look nicer, those bin widths were probably chosen to best represent the underlying probability distribution.
Does anyone know of updates or improvements on the Scott paper that we should know about?