P-Hacking — Part 02: Issue with the P value & why do researchers p-hack?
In these type of studies scientists either reject, or fail to reject the null.
This binary decision process can result in 4 possible outcomes:
1. The null is true and we correctly fail to reject it
2. The null is true but we incorrectly reject it.
3. The null is false and we correctly reject it.
4. The null is false and we incorrectly fail to reject it.
Out of these options, scientists who expect to see a relationship are usually hoping for the 3rd result. The issue comes when they fail to reject the null. Then it is considered a lack of any evidence, not evidence that nothing happened. So scientists and researchers are moved to find something significant. The problem with p-value comes with scientific experiments work.
As you know, as the scientific method suggests many studies use a control group and an experimental group. The experimental group face a condition which scientists want to know more about whereas the control group is not limited to that particular condition. The control group is for the comparison. This is where the P value comes in. With the calculations scientists can determine whether any differences found between the two groups are due to random chance, sampling error, or due to the actual factor they were already testing. As you know p-value ranges from 0–1, and the general rule seems to be that the lower the P value, the better. But these extreme values (0 and 1) are generally not seen in the study results. Therefore in practice, anything less than .05 is considered statistically significant and worthy of publication. But it is not that simple. There’s a need for a higher standard than that before you
accept most of the experimental results. All these were unveiled by a paper release in 2005 by John P. A. Ioannidis entitled “Why most published research is false”.
The problem with p-value — P-Hacking
P-hacking is manipulating data or analyses to artificially get significant p-values or in other words when analyses are being chosen based on what makes the p-value significant, not what’s the best analysis plan. I this way scientists get their way around statistical significance and get final results to prove their hypothesis. p-value is only really valid for a single measure.
Once you’re comparing a whole set of variables the probability that at least
one of them gives you a false positive goes higher thus creating publishable, significant results or in other words “p-hack”. Researchers can make various decisions about their analysis that can decrease the p-value. Consider a situation where you analyze your data and you find it nearly reaches statistical significance, so you decide to collect just a few more data points to be sure then if the p-value drops below .05 you stop collecting data by being incorrectly confident that these additional data points could only have made the result more significant and you ignore the possibility of results turning out to be less significant. But numerical simulations show that relationships can cross the significance threshold by adding more data points even though a much larger sample would show that there really is no relationship. Furthermore there are a great number of ways to increase the likelihood of significant results like: having two dependent variables, adding more observations, controlling for gender, or dropping one of three conditions. Combining all three of these strategies together increases the likelihood of a false-positive to over sixty percent, and that is using p less than 0.05.
(Source — 2005 paper by John P. A. Ioannidis entitled “Why most published research is false”)
How much of the published research work is actually false? The answer seems to be 5% according to the standard p value. If everyone is using p less than .05 as a cut-off for statistical significance, you would expect five of every hundred results to be false positives but that unfortunately underestimates the problem. Here’s an explanation by Dr. Derek Mueller about the issue. (Is Most Published Research Wrong)
“Imagine you’re a researcher in a field where there are a thousand hypotheses
currently being investigated. Let’s assume that ten percent of them reflect true relationships and the rest are false, but no one of course knows which are which, that’s the whole point of doing the research. Now, assuming the experiments are pretty well designed, they should correctly identify around say 80 of the hundred true relationships this is known as a statistical power of eighty percent, so 20 results are false negatives, perhaps the sample size was too small or the measurements were not sensitive enough. Now considered that from those 900 false hypotheses using a p-value of .05, forty-five false hypotheses will be incorrectly considered true. As for the rest, they will be correctly identified as false but most journals rarely published no results: they make up just ten to thirty percent of papers depending on the field, which means that the papers that eventually get published will include 80 true positive results: 45 false positive results and maybe 20 true negative results.
Nearly a third of published results will be wrong even with the system working normally, things get even worse if studies are under-powered, and analysis shows they typically are, if there is a higher ratio of false-to-true hypotheses being tested or if the researchers are biased…”“So, recently, researchers in a number of fields have attempted to quantify the problem by replicating some prominent past results.The reproducibility Project repeated a hundred psychology studies but found only 36% had a statistically significant result the second time around and the strength of measured relationships were on average half those of the original studies. An attempted verification of 53 studies considered landmarks in the basic science of cancer only managed to reproduce six even working closely with the original study’s authors these results are even worse…”
The American Statistical Association put out the statement clarifying the P value because as the ASA’s executive director, Ronald L. Wasserstein remarks, “The p-value was never intended to be a substitute for scientific reasoning.” Overemphasis on P value often leads to the neglect of other information
in studies such as effect size. In some places it’s done by performing many statistical tests on the data and only paying attention to those that come back with significant results, instead of stating a single hypothesis about an underlying effect before the analysis and then conducting a single test for it. Although p-values are helpful in assessing how incompatible the data are with a specified statistical model, other factors like the design of a study, the quality of the measurements, the external evidence for the phenomenon under study, and the validity of assumptions should be managed. The p-value is easily misinterpreted. For example, it is often equated with the strength of a relationship, but a tiny effect size can have very low p-values with a large enough sample size.
Based on the his research, he has mentioned several interesting corollaries that one might deduce about the probability that a research finding is indeed true in the paper by John P. A. Ioannidis entitled “Why most published research is false”.
Corollary 1: The smaller the studies conducted in a scientific field, the less likely the research findings are to be true.
Corollary 2: The smaller the effect sizes in a scientific field, the less likely the research findings are to be true.
Corollary 3: The greater the number and the lesser the selection of tested relationships in a scientific field, the less likely the research findings are to be true.
Corollary 4: The greater the flexibility in designs, definitions, outcomes, and analytical modes in a scientific field, the less likely the research findings are to be true.
Corollary 5: The greater the financial and other interests and prejudices in a scientific field, the less likely the research findings are to be true.
Corollary 6: The hotter a scientific field (with more scientific teams involved), the less likely the research findings are to be true.
Why do researchers allure to p hack?
Survival in the scientific community
Science is my passion. We respect scientists so much for delivering us the knowledge to move forward as mankind. But why do some of them tend to fake and manipulate results in ways I mentioned above? In science, being able to publish your results is your ticket to job stability, a higher salary, and prestige. In this quest to achieve positive results, sometimes things can go wrong. If you are choosing analysis based on what makes the p-value significant, not what’s the best analysis plan, that is P-hacking.
Scientists design experiments to get the lowest possible P at the expense of solid scientific reasoning and investigation just like some teachers teach targeting the tests. Sometimes after their experiment is conducted, they select the specific analysis methods that are more likely to lead to statistically significant results and pick which results to include or leave out? There are a lot of reasons which pave the way to these situations.
One of them is that, results with statistically significant effects are much more likely to be published. Journals are far more likely to publish results that reach statistical significance so if a method of data analysis results in a p-value less than .05 then you’re likely to go with that method, publication’s also more likely if the result is novel and unexpected, this encourages researchers to investigate more and more unlikely hypotheses which further decreases the ratio of true to spurious relationships that are tested. Two researchers recently found that the number of studies published that contain P values in their abstracts has doubled from 1990 to 2014, and of those studies that included a P value, 96% were below .05. Getting published can play a huge role in career advancement. Scientists and organizations aiming to get results published, get funding, or generally advance their career might be the main reason. Scientists under constant pressure to publish with tenure and funding on the line and to get published it helps to have results that seem new and striking.
A researcher puts a lot of heart, time, and effort into doing a study and imagine he or she gets a non-significant result overall, that’s
pretty disappointing. No one is likely to publish your non-results.
Their careers depend on it. The following are two of those examples.
“There is no cost to getting things wrong, the cost is not getting them published!…My success as
a scientist depends on me publishing my findings and i need to publish as frequently as possible in the
most prestigious outlets that I can” — Brian Nosek PhD“Replication studies are so really funded and is so underappreciated they never
get published no one wants to do them there’s no reward system there in place that enables it to happen so you just have all of these exploratory studies
out there that are taken as fact that this is a scientific fact that’s never actually been confirmed” — Elizabeth Iorns PhD
The issue is that there is not a practical way currently in of determining whether the results are genuine. What about replication? Science could be self-corrected by having other scientists replicate the findings of an initial discovery and that has happened a lot of times in the history. In theory it is possible but in real life it is somewhat complicated. Once when three researchers attempted to replicate one experiment, and found the results obtained was not that significantly valuable. Unfortunately when they tried to publish their new findings in the same journal as the original paper they were rejected because the journal refused to publish replication studies! Sadly in science the successful strategy is to not even attempt to replicate old studies because few journals will publish them. Furthermore there is higher possibility that your results won’t be statistically significant any way and it might be hard to convince colleagues about the lack of reproducibility of the original experiment and if you’re that unlucky you might even be accused of just not conducting the experiment right. So a far better approach is to test novel and unexpected hypotheses and then p-hack your way to a statistically significant result!
Continue to read about some interesting experiments done to show how easily you can manipulate the p value in the next article.