The Economist, the Wall Street Journal and Science have all reported on their front pages that a surprisingly large percentage of scientific results cannot be reproduced. In other words, they are probably wrong. The actual percentage of non-reproducible results in fields such as medicine and psychology is estimated to be more than 50%. There are several causes for this reproducibility crisis, but an important one is the use of the p-value based hypothesis test (‘p < 0.05 means significant result’), a standard method from the 1930s that has several major flaws, yet is still used almost universally. Our Machine Learning group is developing new, more robust testing methods that are much more reliable and that won’t so easily tell you that, for example, a treatment works when in fact it doesn’t. They are also more flexible. For example, suppose you tested 50 patients and got an almost-but-not-quite significant result – the treatment may work but you’re not very sure yet. Your boss tells your there is money available to test an additional number of patients. With standard p-value based testing you cannot do this, because you have to specify in advance how many data points you will collect. With the new methods, a researcher can keep gathering data for as long as required.
This research is relevant whenever there is small data. Even in our big data age this happens all the time – think indeed about drug testing (for ethical and financial reasons, not too many subjects can be tested), but also, for example, when trying to infer whether some rare form of cancer occurs more often in the vicinity of a particular industrial plant. If 50 people in the Netherlands have the disease, 15 of whom live near the plant, does this give an indication that something’s wrong? Or not? This research can thus be relevant for applied scientists, but also for government agencies.
We have developed new testing methods that allow data gathering to continue as long as required, and that have proven optimality properties that are much stronger than those available with the standard hypothesis test. In the past we have also provided advice about statistical issues in lawsuits. For example, whenever a patient died in a particular hospital ward, the same nurse was always on duty – this happened seven times within one year. Could it be a coincidence? (In the particular case we were involved in, it definitely could). We are also more generally active in societal aspects of probability and statistics. For example, we gave crash-courses on statistics and probability to public prosecutors and judges. We were also involved in the successful effort to establish a more sophisticated and fairer system for allocation of pupils to secondary schools for which there is more demand than capacity – the previously used system was essentially an all-or-nothing lottery.
Contact person: Peter Grünwald
Research group: Machine Learning (ML)
Research partners: Leiden University, University of Amsterdam