Academics Andre's Research Biocuriosities Books Graduate School History of Science Hot off the Press Igor's Research Interdisciplinarity Molecule of the Month Open Access Philip's Research Philosophy of Science Physics Physicsworld.com
Backreaction Ceclia's Blog at PHD Comics Cocktail Party Physics Cosmic Variance The Daily Transcript Easternblot Everyday Scientist The Evilutionary Biologist Freelancing Science The Futile Cycle Good Math, Bad Math iMechanica in singulo Incoherently Scattered Ponderings Juniorprof Klara Stefflova Life of a Lab Rat The Loom Metadatta Mixed States Morning Coffee Physics Not Even Wrong Notes from the biomass Notional Slurry OpenScience Project Pharyngula PLoS Blog Ponderings of a fool Recombinants The Sandwalk SciAm Observations ScienceBlogs Scientific Clearing House Shtetl-Optimized Three-toed Sloth Uncertain Principles What's New by Bob Park
Sam at Everyday Scientist doesn’t like changing histogram bin widths when comparing data sets. When I first read his post, it sounded pretty reasonable, but it just comes back to how to pick histogram bin sizes in the first place. It seemed like the kind of thing I should know, so I tried to find something about selecting bin widths for histograms and came across this paper by David Scott: Biometrika, Vol. 66, No. 3. (Dec., 1979), pp. 605-610. (Penn doesn’t have an electronic subscription to Biometrika, but the paper’s also available through JSTOR... What the heck, just download it here [pdf] for your educational purposes).
In this paper, Scott derives “the formula for the optimal histogram bin width [...] which asymptotically minimizes the integrated mean squared error” h:

where n is the number of samples and f is the underlying probability distribution. For Gaussian data, this becomes a nice rule of thumb:

where s is an estimate of the standard deviation.
Getting back to Sam’s post, the figure he didn’t like came from Ron Vale’s lab and one of the things I noticed is that on the Biometrika site they list some papers that cite the Scott paper (scroll down). One of them is also from Vale’s lab. This means that at least some people in Vale’s group are familiar with this rule and probably used it when choosing the bin widths in that figure. In other words, far from being a mistake or a trick to make the data look nicer, those bin widths were probably chosen to best represent the underlying probability distribution.
Does anyone know of updates or improvements on the Scott paper that we should know about?
Biocurious is written by Andre Brown and Philip Johnson, since 2005. Content of the weblog is licensed under a Creative Commons Attribution-Share Alike 3.0 License.