The Significance of Significance

Quick Note!
This blog is now supported by London Cleaning Solution

So earlier this year, with hope in my heart, I entered the Wellcome Science Writing Prize for 2013. Yesterday, like many others, I got the email saying that my piece had not been short listed this year so I decided to put it up here so you can have a read if you’re curious/bored

I’m not disappointed, though I am shockingly thin-skinned in some areas of my life I’ve always been a complete and utter hard arse when it comes to my writing. No doubt I’ll enter again next year! Commiserations to my friends who also didn’t make the cut, and congratulations to any reading this who did (though I’m not aware of any up to this point).

Bear in mind that I could have written a much longer piece on P values, and no doubt will at some point in future, but this was a strict 800 words. Without further ado….

The Significance of Significance

“Scientists have found a significant link…”

“A significant new study shows….”

Don’t scientists drive you mad sometimes? With every report on a new study they seem full of bluster. We get excited there’s a cure for cancer, and it turns out to be a study on the colour of seal’s noses. That’s because there’s a difference between what a journalist might mean by “significant” and what a scientist intends, which is statistical significance.

Statistical significance is a way of showing how big the role of chance is in our findings. The smaller the role of chance, the more we can be certain that the “Thing We Did” affected the “Result”. The bigger the role of chance, the less sure we can be of our theory (hypothesis). We can figure it all out with calculations that give us something called a probability, or “p”-value, which shows the role of chance. But not all scientists are happy with how p-values are used, and some think that using p-values gives us a skewed view of reality.

Surprisingly, it all started with beer. In the early 1900s, Guinness Chief Brewer William Sealy Gosset wanted to find the best barley to make his pints. But that meant testing lots of barley crops against each other. There are lots of reasons why crops may vary – where and when they were planted, the sowing method, what the weather was like that year and the number of pests, for example. To eliminate the role of these chance factors, Gosset needed lots of barley, land and time – not a very economical way of finding out about economy. Gosset needed to find a significant result from as few harvests as he could.

Luckily, Guinness employed the best graduates and Gosset was no exception. A gifted mathematician and chemist, he worked with renowned statistician Karl Pearson, and finally published his calculations for small sample significance testing, “t-tables”, in Pearson’s journal Biometrika in 1908.

But it was statistician Ronald Fisher who took p-values from brewing into mainstream science. He wanted to reproduce Gosset’s tables in his 1925 book, Statistical Methods for Research Workers, but due to copyright disputes with Pearson he had to rework Gosset’s t-tables. One of Fisher’s tables gets the blame for science’s fixation with a magic number:

“P” is less than 0.05

It means the role of chance in your study is less than 5%, or 1 in 20. If you toss a coin 1000 times and expect an even result between heads and tails, you might allow for, say, 550 heads and 450 tails and put it down to chance. Any more than that, and you might suspect your coin was unevenly weighted.

Scientists quickly latched onto small scale significance testing and the power of “p”, particularly in psychology. In his article Negativland, Prof Keith Laws says that in the 1920s only 17% of psychology papers used significance tests, but by the 1960s it had risen to 90% . And the figure still rises.

So what’s the problem? Why the controversy? The 0.05 means too much. Publication biasis a common problem in science journals – we’re more likely to get published if our study supports our initial theory, preferably with a big “wow”. If our study shows that the role of chance was too big and it’s hard to judge, we might not even bother trying to get published and shut it in a drawer. In a study of 609 American Psychological Association members, 82% said they’d submit a paper that supports their hypothesis, but only 43% would try with a non-significant finding. This mirrors the views of journal editors. Publication in a major journal is a career high, so some scientists try hard to get those big results and tiny p-values. “Data dredging” programs exist to nudge your p-value down as far as possible. While falsification is thankfully rare, these “makeovers” are common practice and skew visible science towards the sensational when in reality, science is much more commonplace and uncertain.

What’s more, p-values place too much emphasis on certainty, rather than the actual effect. Economist Deirdre McCloskey explains well. Imagine you have two diet pills, one promises to lose you up to 20lbs but with high variance. You might lose 10lbs, or 30lbs. The other pill promises only 5lbs, but with a lower variance, maybe between 4.8 and 5.2lbs. Which would you choose? The one with the bigger effect, of course, the one that says you might lose more. The impact of the result is more important to most than how sure we are of it.

We may talk of changing the p-value culture, but since Edwin Boring first criticised p-values in 1919, how much longer until it changes? The misguided emphasis on p-values is too significant to ignore.