I gave a workshop on preregistration to honours students last Friday and mentioned that preregistration provides evidence that you created your hypothesis in advance of seeing the data. The students naturally asked how to prove that the data were collected after the preregistration. I pointed out that one can’t even prove that the data are real. At least, nobody has put together an accepted method for this.
I have advocated for research funders and universities to create infrastructure for data chain-of-custody records to solve this, much as major camera companies are creating trusted timestamping etc. as part of their camera hardware, to certify that they aren’t deepfakes.
I haven’t heard of any movement on this in the case of my field of psychology, where human data is collected. Some biological Internet-connected instrument manufacturer have their machines automatically create records and send them to a manufacturer-managed database, I believe.
The lack of action on this is galling, even by those with plenty of resources. Harvard, hello – don’t you want to protect your overall reputation from the hit by the several fraudsters that more and more of us know that you hired and promoted?
So now the flood has begun: AI-written, AI-fabricated-data papers contaminating science, making fraud much easier and so surely greatly increasing its rate. At least the journal publishers have a financial interest in doing something about this, but I don’t see them getting their act together to create real infrastructure for this.
So I predict that in the short- to medium-term, it will be a case of rich-get-richer, where major Western universities that people/publishers trust more will get a pass, while those who (sometimes justifiably) are seen as more questionable (e.g., researchers at medical schools in China) due to high apparent past rates of fraud will be increasingly discriminated against. Merton’s norm of universalism (science is judged by the work itself, not by the reputation of the scientist) will rapidly be undermined.
Much has been said about how expensive academic journals are. Large companies like Elsevier, Sage, Springer Nature, Taylor & Francis, and Wiley publish most of the major journals, and their shareholders pocket much of the “rent” they receive thanks to academics’ labor.
There are alternatives. One of them is based on Wikipedia, whose process for vetting information is more transparent than that of most journals. The back-and-forth between authors and other Wikipedia volunteers that result in changes to Wikipedia is right there in the talk pages, available as it is happening, and anyone can chime in. Contrast this with academic journals, which are largely a closed shop.
To be fair, while the “shops” may be closed, they do have more windows than they used to. Many journals have come out from behind the paywalls, and now practice more accountability, such as by indicating which editor handled each article, and by having a policy on editors publishing in their own journal. To their credit, the Association for Psychological Science journals, for example, have long had a policy that when an editor or associate editor submits to their own journal, the review process for their article is managed by an external guest editor to avoid conflicts of interest. When I was an associate editor several years ago at one APS journal (AMPPS), this is what we did. I recently realized that not all APS editors are aware of their own policy, however, and that sort of forgetting is another example of why keeping the windows open, so that we can see what is happening inside, is important.
Photo: Public domain.
As part of the open windows principle, we should also expect journals to produce evidence that they effectively evaluate submissions for whether they are scientifically sound. Now, if asked how they we can be confident that they are publishing quality scholarship, most journal editors would point to peer review. When asked to produce examples, however, they’d have to say something like “Can’t do that! Peer review reports are confidential.”
This “you’ll just have to trust us” type of situation is ironic for a class of people who long have held skepticism to be critical to what they do. And for me as an acculturated academic, I confess it almost feels like a betrayal to state this as plainly as I have. I imagine colleagues trying to push aside the point, with responses like “Alex, you know we try hard to get good peer reviewers, besides, in the end, science yields things that work, so your point is misleading.”
Photo: Public domain
I actually agree that science works on average, but often readers need to know whether there is much reason to have confidence in particular papers. Fortunately not all editors are so defensive that they cannot acknowledge this point. It took time, but by about a decade or so ago, a bunch of journal editors had freed the peer review reports from the confines of their password-protected journal management systems, allowing anyone to read them. Finally, readers had direct evidence of how well a journal is actually vetting its articles. Just as importantly, readers no longer had to rely on the overall journal reputation to make a guess about the process undergone by an individual article – they could actually see the peer review reports for an article they were interested in.
While the processes happening inside journals had to be dragged into the open, Wikipedia and its associated projects have always had openness baked in.
One project associated with Wikipedia is the WikiJournal of Science. This is a proper scholarly journal, one indexed by mainstream publication databases such as the Scopus database maintained by Elsevier. But unlike a conventional journal, most of the peer review process at the WikiJournal of Science happens in the open from the beginning. It’s all in the “Discuss” page that sits alongside each article.
In another convergence with mainstream journals, four years ago the prestigious eLife journal announced that they would only review manuscripts that had already been published elsewhere as a preprint, as part of their “long-term plan to create a system of curation around preprints that replaces journal titles as the primary indicator of a paper’s perceived quality and impact”. This has always been the preferred route for the WikiJournal of Science – manuscripts ideally are submitted by linking to a publicly-available preprint.
I’ve been an associate editor for the WikiJournal of Science for a year or so. One manuscripts I handled reported a study suggesting that geckos spontaneously “play” by running in running wheels. As the editor, I was pleased to have the opportunity to usher in new knowledge about these gravity-defying reptiles.
The Australian house gecko. CC-BY me.
One of my first jobs was to email several experts on animal play to ask them to review the manuscript, which the author had posted as a preprint on WikiJournal Preprints. Two agreed, and after receiving the peer reviews, I posted them on the preprint’s Discuss page where, if anyone else were moved to do so, they could also comment. The author responded to the reviewers’ comments, and those responses also can be seen on the Discuss page. Much of the reviewing process, then, works like a conventional journal, just more transparently and able to appear in real time.
When I edited that manuscript, I had no scientific knowledge of animal play (moreover, I had consistently resisted our dog’s offers to give me real-world experience).
Hugo. More attractive dogs abound, but yeah, you can use this photo if you want.
It would have been nice if we had had a more knowledgeable editor for the gecko manuscript, but we’re currently spread pretty thin in the editorial department. That’s one reason for this post (apply to be an editor! You don’t need to know anything about geckos!).
As the 🦎 example illustrates, like a conventional academic journal, we publish original research at the WikiJournal of Science. But the most common use of the journal is for academic peer review of articles that are intended for Wikipedia itself, and these typically don’t include original research. Before I joined the journal as an editor, for example, I saw that the Wikipedia article for “multiple object tracking” was a bit spotty in its coverage. Unsurprising, of course, as it’s quite an obscure topic. But because I had just written a short book on object tracking, I considered myself well-placed to write a more comprehensive Wikipedia entry. The eventual article I wrote was based on my book, together with others’ publications, so it didn’t count as the type of original research that is prohibited by Wikipedia.
I submitted my draft Wikipedia article to the WikiJournal, and it eventually passed peer review. As a result, the editor replaced the existing Wikipedia entry with my article. This was quite satisfying – given how widely Wikipedia is used, my contributions to this obscure topic are probably now much more influential then if they had remained confined to academic journals and my book.
A nice aspect of the WikiJournal of Science is that part of the revision process occurs almost instantaneously, thanks to its wiki infrastructure. As I read through a submission, I typically make small edits on the preprint itself to improve the language, just as many Wikipedians do when they come across a Wikipedia entry they are interested in. The reviewers of the manuscript are able to do the same thing. The author is not obliged to keep these edits, of course; they can revert them and explain why in their response letter.
This really should be seen as basic functionality, as it is similar to the nearly universally-used Track Changes in Word or Google Docs. But despite most of us collaborating on documents in that way for decades now, most academic journals still don’t have this functionality.
Reviewers and editors at traditional journals typically aren’t able to enter the journal’s system and directly make suggestions on the manuscript. Instead, they write their comments in a separate, standalone document or form. This lack of functionality for scholarly communication is one illustration of how little the scientific community has gotten for the billions of dollars that they have been paying to publishers each year (the previous link is for APCs alone; it doesn’t even include the subscriptions payments, and the free peer reviewing that academics do).
The unwieldiness of journals’ systems is not because corporations generally don’t deliver good products or continually improve their service; many do. But academic journals are not part of a functioning economic market. In the dreamworld of a functioning market for scholarly communication, the journal that provides the best service and features would win the most market share. In the world we actually live in, the owners of the journals (who are sometimes the publisher, and sometimes a scientific society) simply wait for submissions from the researchers. They know that researchers will stick with the journals that have the highest impact factors in their field, which then results in those journals maintaining a high impact factor, with little effect of the fees charged or the quality of the services provided.
I think that all of this means that you should support diamond open access journals in general, not just the WikiJournal of Science.
Diamond open access journals are those that are free to read and to publish in. They typically use open source software (the wiki infrastructure in the case of the WikiJournal of Science, and Open Journal Systems for thousands of other diamond OA journals) hosted by a nonprofit institution, such as a university. The open source software does tend to be more klunky than the big publishers’ systems, which does mean it’s more annoying for the academic editors involved. But the alternative, the tradeoff of letting corporate publishers handle things in return for billions of dollars and a corruption of academic values, is an even worse deal.
But why should you, an individual scholar, have to do something about this? The primary way that scientists in a field come together to get things done (aside from doing science, reviewing, and editing itself) is through scholarly societies. Scientific societies were designed to serve scientists’ interests. They should be leading the way to reducing dependence on corporate publishers and creating diamond OA journals.
But many scientific societies have been captured by their publishers. Here’s how it happened. As part of a contract giving a publisher the right to publish the society’s journal, the publisher provides the society with a payment. Over the years, this payment rose, reflecting the steady increase in subscription and/or APC fees. While the payment is only a small fraction of how much the journal makes (otherwise the publisher wouldn’t have the high profits that they do), it’s a substantial amount of money for a scientific society, and quite a high percentage in the current era of declining in-person conference fees. Societies pay much of their staff salaries off of this, and many hired more staff with this money. For many societies, these staff end up making most of the society’s decisions, or advising the academics who ostensibly make the decisions but offer little resistance. As the staff’s jobs depend on maintaining the society’s revenue, giving up the publishing income is a non-starter. This dynamic has played out even at some of the most respected and active scientific societies, as we recently learned in the case of the American Association for the Advancement of Science.
Within psychology, the Association for Psychological Science (APS) is another example. Six months ago, APS suddenly announced they were starting a new journal, with no evidence of consultation of academics. Indeed, the announcement was strangely light on details of why they were starting a journal and what the vision for it was. So I wasn’t the only one who suspected this was concocted simply to create a new revenue stream.
Yesterday, I did some digging. The publisher used by APS, Sage, maintains a spreadsheet with their list of publication fees (APCs) for the open access journals they publish. Advances in Psychological Science Open is now in that list, just below Advances in Methods and Practices in Psychological Science, formerly APS’ only fully open access journal. The price to publish in the new journal? Two Benjamins and five bills!
That APC (Article Processing Charge) of $2500 is $1500 more than that for APS’ more well-established journal (AMPPS).
In short, APS is starting an expensive journal that has little to no buy-in from the community (judging from social media) and hoping that demand for the prestige of the APS brand, combined with the reject-and-refer system developed by PLOS, and perfected by Nature Publishing, will bring the money rolling in.
If you’re a tenured academic, you shouldn’t be editing for journals like that!
I better re-phrase that. Because admittedly, I myself took an editorial stipend from APS, first at Perspectives on Psychological Science over ten years ago when some of us started the Registered Replication Report format there, and subsequently when we co-founded the journal Advances in Methods and Practices in Psychological Science.
Here’s my rewrite: if you are a tenured academic, you should be devoting a bunch of your time to cultivating alternatives to the usual money-sucking journal racket.
Over at freejournals.org, we highlight quality diamond OA journals and we diamond OA editors try to support each other. So here I am, trying to promote this. While not many people read this blog, a lot of people are occasionally forced to read emails from me (simply because I am a more-or-less tenured academic). Therefore, I have changed my email signature I now advertise the diamond OA initiatives that I am most involved in.
My email signature, some of the time.
And now it is time for me to turn to other activities for avoiding the news.
Postscript. Perhaps the biggest challenge facing the WikiJournal of Science is our high liability insurance bill (for things like defamation suits); my colleagues have contacted dozens of insurers but none would give us a lower bill. And that was before Elon Musk started threatening Wikipedia! If you think you can help us, please get in touch.
Don’t allow your writing to be tied to one platform – register your science-related blog with Rogue Scholar, the free blog indexing service helping bring science blogs into scholarly database infrastructure.
I checked with Martin Fenner, who created and runs Rogue Scholar, and he said it works fine with non-paywalled Substack blog posts, I think because the full text of free posts are provided by Substack in the RSS feed… I believe Rogue Scholar needs the full text partially to try to find some of the metadata needed to populate into scholarly databases.
I should say, however, that I found the registration page difficult to navigate and needed Martin’s help to register my blog. Turns out that this is because the site was recently re-worked and some parts still rely on the legacy codebase, causing some of the site’s internal links to be confusing. Growing pains are to be expected, however, especially for a free project, one that I believe is very much is worth your support!
I confess that I am an experiment chauvinist – I look down on studies that are purely observational, studies that don’t manipulate anything. Where does my prejudice come from? One factor is that as a perceptual and cognitive psychologist, when I do science, I’m usually interested in the causes, or underlying mechanisms, of a phenomenon. For the phenomena that I’m interested in, typically one can easily do a controlled experiment that allows one to infer a cause of the phenomenon.
For many aspects of the universe that humans are interested in understanding, experiments are often not feasible, and sometimes wouldn’t even be appropriate to achieve the sort of knowledge researchers are after. Below, for a class I teach called “Good science, bad science”, I tried to get beyond my provincial experiment-centrism to explain to students the value of studies that make observations but don’t manipulate anything.
—
Most sciences advance through a combination of observational and experimental studies, often done by different researchers. For example, in medicine, treatments for diseases are usually best tested with experiments, for example with half of a group of patients randomly assigned to one treatment and half assigned to another. However, observational studies, where the researchers don’t actually manipulate anything, can also be critically important to advancing knowledge.
Decades of study of health records found that people who exercise more have less heart disease. Because this was an observational study, however, the lower heart disease rates in those who exercise a lot might have been due to confounding factors. Perhaps only people who start out without chronic diseases are able to exercise much, or perhaps people who live in rural areas with cleaner air are more likely to do outdoor activities, which often involve exercise. So it might have been that the reason for the lower incidence of heart disease is due to breathing cleaner air, not due to getting more exercise.
An experiment in which a random half of the study participants are assigned to exercise, and the other half don’t, helped resolve the debate, because the random assignment ensures that, on average, there will will be no confounding difference between the groups, such as living in a place with cleaner air. However, doing experiments with people is often both more difficult than only observing them, and also much more expensive. If everything goes well (e.g., those assigned to the exercise group actually do the exercises, and those assigned to the other group don’t), one may be able to safely conclude that exercise reduces the chance of heart disease. There are problems, however, of generalizing this to the real world, where very few may actually exercise as intensively, as frequently, or in the same way as those who followed the exercise protocol of the experiment.
Often, neither observational studies nor experiments fully answer a research question by themselves, but when they both point to the same conclusion, we can justifiably be very confident in that conclusion.
Some fields of research, such as astronomy, are almost entirely observational. For many thousands of years, people have speculated about what causes the motion of the stars and planets. Various hypotheses were invented, hypotheses which could never be tested with experiments, because people were never able to change the movements of astral bodies. However, by amassing a very large set of observations, people made progress by revising their theories so that they could explain more and more of the observations.
Tycho Brahe’s observatory, which also extended underground. Image: public domain.
Johannes Kepler was uncompromising in his quest to explain the precise observations of the movements of the planets that had been made by Tycho Brahe. In his dogged attempts to fit the data, Kepler came up with the idea that the planets followed elliptical orbits. This explained Brahe’s observations better than the circular-orbits version of heliocentrism, contributing to heliocentrism’s eventual triumph.
In biology, ideas about the origins of plants and animals came about almost entirely through considering observations. From the meticulous records of a long line of European naturalists, Darwin knew of many thousands of observations regarding various plants and animals. When combined with his own observations during the voyage of the Beagle, including in then-remote (to Europeans) places like Australia, Darwin formulated his theory of “descent with modification”, now known simply as “the theory of evolution”.
An illustration by G.R. Waterhouse of a native rat that Darwin and he caught in southwest Australia and documented for European science. Image public domain.
The concept of reproducibility and replication, a focus of this class, can be more complicated for observational sciences than for the experimental sciences. If subsequent researchers wanted to confirm that Australia had a rat species that really looked the way that Waterhouse and Darwin had illustrated, they could go to southwest Australia and set out a trap with cheese as Darwin had done, but even if they put the trap in exactly the same location, they were unlikely to end up with the exact same rat in their trap and as the local population of those rats may have shifted locations, so they might not catch any. Because the world is always changing, it can be hard to know whether a difference in observations should cast much doubt on a previous study.
Sometimes one can build replication into the initial effort to make observations. For example, when an important event is predicted to occur in astronomy, researchers arrange for multiple telescopes around the globe to collect observations near-simultaneously. That way, if one telescope yields different results than the others, the researchers will know that they should investigate whether it was functioning correctly before trusting its observations.
I’ll never forget that email. It was 2016, and I had been helping psychology researchers design studies that, I hoped, would replicate important and previously published findings. As part of a replication-study initiative that I and the other editors had set up at the journal Perspectives on Psychological Science, dozens of labs around the world would collect new data to provide a much larger dataset than that of the original studies.
With the replication crisis in full swing, we knew that data dredging and other inappropriate research practices meant that some of the original studies were unlikely to replicate. But we also thought our wide-scale replication effort would confirm some important findings. Upon receiving the “this is not credible” message, however, I began to be haunted by another possibility — that at least one of those landmark studies was a fraud.
The study in question was reminiscent of many published in high-impact journals in the mid-2010s. It indicated that people’s mood or behavior could be shifted a surprising amount by a subtle manipulation. The study had found that people became happier when they described a previous positive experience in a verb tense suggesting an ongoing experience — rather than one set firmly in the past. Unfortunately for psychology’s reputation, social-priming studies like that had been falling like a house of cards, and our replication failed, too. In response, the researchers behind the original study submitted a new experiment that appeared to shore up their original findings. With their commentary, the researchers provided the raw data for the new study, which was unusual at the time, but it was our policy to require it. This was critical to what happened next.
One scholar involved in the replication attempt had a close look at the Excel spreadsheet containing the new data. The spreadsheet had nearly 200 rows, one for each person who had supposedly participated in the experiment. But the responses of around 70 of them appeared to be exact duplicates of other people in the dataset. When the duplicates were removed, the main result was no longer statistically significant.
After thanking the scholar who had caught the problem, I pointed out the data duplication to the researchers behind the original study. They apologized for what they described as an innocent data-processing mistake. Then, rather conveniently, they discovered some additional data they said they had accidentally omitted. With that data added in, the result was statistically significant again. By this point, the scholar who had caught the duplication had had enough. The new data, and possibly the old, were no longer credible.
I conducted my own investigation of the Excel data. I confirmed the irregularities and found even more inconsistencies when I examined the raw data exactly as downloaded from the online service used to run the study. The other journal editors and I still didn’t believe that the reason for the irregularities was fraud — all along, the researchers behind the original study had seemed very nice and were very obliging about our data requests — but we decided that we shouldn’t publish the commentary that accompanied the questionable new data. We also reported them to their university’s research-integrity office. After an investigation, the university found that the data associated with the original study had been altered in strategic ways by a graduate student who had also produced the data for the new study. The case was closed, and the paper was retracted, but the cost had been substantial, involving thousands of hours of work by dozens of people involved in the replication, the university investigators, and at least one harried journal editor (me).
More recently, two high-profile psychology researchers, Francesca Gino of Harvard and Dan Ariely of Duke, faced questions about their published findings. The data in Excel files they have provided show patterns that seem unlikely to have occurred without inappropriate manipulation of the numbers. Indeed, one of Ariely’s Excel files contains signs of the sort of data duplication that occurred with the project I handled back in 2016.
Ariely and Gino both maintain that they never engaged in any research misconduct. They have suggested that unidentified others among their collaborators are at fault. Well, wouldn’t it be nice, for them and for all of us, if they could prove their innocence? For now, a cloud of suspicion hangs over both them and their co-authors. As the news has spread and the questions have remained unresolved, the cloud has grown to encompass other papers that Ariely and Gino were involved in, for which clear data records have not yet been produced. Perhaps as much to defend their own reputations as to clean up the scientific record, Gino’s collaborators have launched a project to forensically examine more than 100 of the papers that she has co-authored. This vast reallocation of academic expertise and university resources could, in a better system, be avoided.
How? Researchers need a record-keeping system that indicates who did what and when. I have been using Git to do this for more than a decade. The standard tool of professional software developers, Git allows me to manage my psychology-experiment code, analysis code, and data, and provides a complete digital paper trail. When I run an experiment, the data are recorded with information about the date, time, and host computer. The lines of code I write in R to do my analysis are also logged. An associated website, GitHub, stores all of those records and allows anyone to see them. If someone else in my lab contributes data or analysis, they and their contributions are also logged. Sometimes I even write up the resulting paper through this system, embedding analysis code within it, with every data point and statistic in the final manuscript traceable back to its origin.
My system is not 100 percent secure, but it does make research misconduct much more difficult. Deleting inconvenient data points would be detectable. Moreover, if trusted timestamping is used, the log of file changes is practically unimpeachable. Git is not easy to learn, but the basic concept of “version history” is today part of Microsoft Word, Google Docs, and other popular software and systems. Colleges and universities should ensure that whatever software their researchers use keep good records of what the researchers do with their files.
While enabling more recording of version history would be only a small step, it could go a long way. The Excel files that Gino and Ariely have provided have little to no embedded records indicating what changes were made and when. That’s not surprising — their Excel files were created years ago, before Excel could record a version history. Even today, however, with its default setting, Excel deletes from its record any changes older than 30 days. Higher-ed institutions should set their enterprise Excel installations to never delete their version histories. This should also be done for other software that researchers commonly use.
Forensic data sleuthing has found that a worrying number of papers published today contain major errors, if not outright fraud. When the anesthesiologist John Carlisle scrutinized work submitted to the journal he edited, Anaesthesia, he found that of 526 submitted trials, 73 (14 percent) had what seemed to be false data, and 43 (8 percent) were so flawed they would probably be retracted if their data flaws became public (he termed these “zombie” trials). Carlisle’s findings suggest that the literature in some fields is rapidly becoming littered with erroneous and even falsified results. Fortunately, the same record-keeping that allows one to conduct an audit in cases of fraud can also help colleges, universities, journals, and researchers prevent errors in the first place.
Errors will always occur, but they are less likely to cause long-lasting damage if someone can check for them, whether that’s a conscientious member of the research team, a reviewer, or another researcher interested in the published paper. To better check the chain of calculations associated with a scientific claim, more researchers should be writing their articles in a system that can embed code, so that the calculations behind each statistic and point on a plot can be checked. These are sometimes called “executable articles” because pressing a button executes code that can use the original data to regenerate the statistics and figures.
Scholars don’t need to develop such systems from scratch. A number of services have sprung up to help those of us who are not seasoned programmers. A cloud service called Code Ocean facilitates the creation of executable papers, preserving the software environment originally used so that the code still executes years later. Websites called Overleaf and Authorea help researchers create such documents collaboratively rather than leaving it all on one researcher’s computer. The biology journal eLifehas used a technology called Stencila to permit researchers to write executable papers with live code, allowing a paper’s readers to adjust the parameters of an analysis or simulation and see how that changes its results.
Universities and colleges, in contrast, have generally done very little to address fraud and errors. When I was a Ph.D. student in psychology at Harvard, there were two professors on the faculty who were later accused of fraud. One of them owned up to the fraud and helped get her work retracted. The other, Marc Hauser, “lawyered up” and fought the accusations, but nevertheless he was found by Harvard to have committed scientific misconduct (the U.S. Office of Research Integrity also found him to have fabricated data).
As a result, Harvard had more than a decade after the findings of serious fraud by two of its faculty members to prepare for, and try to prevent, future misconduct. When news of the Gino scandal broke, I was shocked to learn how little Harvard seemed to have improved its policies. Indeed, Harvard scrambled to rewrite its misconduct policies in the wake of the new allegations, opening up the university to accusations of unfair process, and to Gino’s $25-million lawsuit.
The problems go well beyond Harvard or Duke or even the field of psychology. Not long after John Carlisle reported his alarming findings from clinical-trial datasets in anesthesiology, a longtime editor of the prestigious BMJ (formerly the British Medical Journal) suggested that it was time to assume health research is fraudulent until proven otherwise. Today, a number of signs suggest that the problems have only worsened.
Marc Tessier-Lavigne is a prominent neuroscientist and was, until recently, president of Stanford University. He had to resign after evidence emerged of “apparent manipulation of research data by others” in several papers that came from his lab — but not until after many months of dogged reporting by the Stanford student newspaper. Elsewhere in the Golden State, the University of Southern California is investigating the star neuroscientist Berislav Zlokovic over accusations of doctored data in dozens of papers, some of which led to drug trials in progress.
In biology labs like those of Tessier-Lavigne and Zlokovic, the data associated with a scientific paper often include not only numbers but also images from gel electrophoresis or microscopy. An end-to-end chain of certified data provenance there presents a greater challenge than in psychology, where everything involved in an experiment may be in the domain of software. To chronicle a study, laboratory machines and microscopes need to record data in a usable, timestamped format, and must be linked into an easy-to-follow laboratory notebook.
If we want science to be something that society can still trust, we must embrace good data management. The $25 million that Harvard could lose to Gino — while a mere drop in the operating budget — would go far if spent on developing good data-management systems and training researchers in their use. The reputational returns to Harvard, to its scholars, and to academic science in general would repay the investment many times over. It’s time to stop pretending academic fraud isn’t a problem, and to do something about it.
Today the Gates Foundation announced that they will “cease support for individual article publishing fees, known as APCs, and mandate the use of preprints while advocating for their review”. I am excited by this news because over the last couple decades, it’s been disheartening to see large funders continue to pour money down the throats of high-profit multinational publishers .
In their announcement, the Gates Foundation has recommendations for research funders that include the following:
Invest funding into models that benefit the whole ecosystem and not individual funded researchers.
They also state that funders, and researchers, should support innovative initiatives that facilitate peer review and curation separately from traditional publication.
Diamond OA journals, which are free to authors as well as readers, clearly fit the bill, as well as journal-independent review services such as Peer Community In, PreReview, and COAR-Notify. I’m an (unpaid) advisory board member of the Free Journal Network, which supports (and does some light vetting of) diamond OA journals. I’m also an associate editor at the free WikiJournal of Science, Meta-Psychology, and the coming Meta-ROR metascience peer review platform. All of these initiatives are oriented around providing free peer review of preprints.
Such initiatives have had trouble attracting funding, as have preprint servers, despite the enormous benefit preprint servers have provided of rapid dissemination of research; much faster than through journals.
Because of how agreements like Germany’s DEAL (and Australia’s planned deal) facilitate publisher lock-in, my favorite episode in the history of such negotiations is the extended periods when German and Californian universities did not have access to Elsevier publications, pushing them away from Elsevier rather than toward it. As Björn Brembs and I wrote in 2017, the best DEAL is no deal. When funders have an agreement with them, researchers are unfortunately pushed toward high-profit, progress-undermining publishers like Elsevier as in that case publishing with Elsevier is free, while it may not be with more progressive and lower-cost publishers. And as an Australian colleague was quoted saying, the proposed agreement with Elsevier would “enshrine a national debt to wealthy international publishers, who were likely to tack on hefty increases once an agreement was reached.”
To evaluate and build on previous findings, a researcher sometimes needs to know exactly what was done before.
Computationalreproducibility is the ability to take the raw data from a study and re-analyze it to reproduce the final results, including the statistics.
Empiricalreproducibility is demonstrated when, if the study is done again by another team, the critical results reported by the original are found again.
Poorcomputational reproducibility
Economics Reinhart and Rogoff, two respected Harvard economists, reported in a 2010 paper that growth slows when a country’s debt rises to more than 90% of GDP. Austerity backers in the UK and elsewhere invoked this many times. A postgrad failed to replicate the result, and Reinhart and Rogoff sent him their Excel file. They had unwittingly failed to select the entire list of countries as input to one of their formulas. Fixing this diminished the reported effect, and using a variant of the original method yielded the opposite result than that used to justify billions of dollars’ worth of national budget decisions.
A systematic study found that only about 55% of studies could be reproduced, and that’s only counting studies for which the raw data were available (Vilhuber, 2018).
Cancer biology The Reproducibility Project: Cancer Biology found that for 0% of 51 papers could a full replication protocol be designed with no input from the authors (Errington, 2019).
Not sharing data or analysis code is common. Ioannidis and colleagues (2009) could only reproduce about 2 out of 18 microarray-based gene-expression studies, mostly due to lack of complete data sharing.
Artificial intelligence (machine learning) A survey of reinforcement learning papers found only about 50% included code, and in a study of publications associated with neural net recommender systems, only 40% were found to be reproducible (Barber, 2019).
Poor empirical reproducibility
Wet-lab biology. Amgen researchers were shocked when they were only able to replicate 11% of 53 landmark studies in oncology and hematology (Begley and Ellis, 2012).
“I explained that we re-did their experiment 50 times and never got their result. He said they’d done it six times and got this result once, but put it in the paper because it made the best story.” – Begley
A Bayer team reported that ~25% of published preclinical studies could be validated to the point at which projects could continue (Prinz et al., 2011). Due to poor computational reproducibility and methods sharing, the most careful effort so far (Errington, 2013), of 50 high-impact cancer biology studies, decided only 18 could be fully attempted, and has finished only 14, of which 9 are partial or full successes.
62% of 21 social science experiments published in Science and Nature between 2010 and 2015 replicated, using samples on average five times bigger than the original studies to increase statistical power (Camerer et al., 2018).
61% of 18 laboratory economics experiments successfully replicated (Camerer et al., 2016).
39% of 100 experimental and correlational psychology studies replicated (Nosek et al.,, 2015).
53% of 51 other psychology studies (Klein et al., 2018; Ebersole et al., 2016; Klein et al. 2014) and ~50% of 176 other psychology studies (Boyce et al., 2023)
Medicine
Trials: Data for >50% never made available, ~50% of outcomes not reported, author’s data lost at ~7%/year (Devito et al, 2020)
I list six of the causes of this sad state of affairs in another post.
Ferrari Dacrema, Maurizio; Cremonesi, Paolo; Jannach, Dietmar (2019). “Are We Really Making Much Progress? A Worrying Analysis of Recent Neural Recommendation Approaches”. Proceedings of the 13th ACM Conference on Recommender Systems. ACM: 101–109. doi:10.1145/3298689.3347058. hdl:11311/1108996.
Nosek, B. A., Aarts, A. A., Anderson, C. J., Anderson, J. E., Kappes, H. B., & Collaboration, O. S. (2015). Estimating the reproducibility of psychological science. Science, 349(6251), aac4716–aac4716.
Prinz, F., Schlange, T. & Asadullah, K. Nature Rev. Drug Discov. 10, 712 (2011).
Many of the practices associated with modern science emerged in the early days of the Royal Society of London for Improving Natural Knowledge, which was founded in 1660. Today, it is usually referred to as simply “the Royal Society”. When the Royal Society chose a coat of arms, they included the words Nullius in verba:
Nullius in verba is usually taken to mean “Take nobody’s word for it”, which was a big departure from tradition. People previously had mostly been told to take certain things completely on faith, such as the proclamations of the clergy and even the writings of Aristotle.
In the early 1600s, René Descartes had written a book urging people to be skeptical of what others claim, no matter who they are.
This caught on in France, even among the public — many people started referring to themselves as Cartesians. Meanwhile in Britain, the ideas of Francis Bacon were becoming influential. His skepticism was less radical than Descartes’, and included many practical suggestions for how knowledge could be advanced.
Bacon’s mix of skepticism with optimism about advancing knowledge using observation led, in 1660, to the founding in London of “a Colledge for the Promoting of Physico-Mathematicall Experimentall Learning”. This became the Royal Society.
The combination of skepticism and the opening up of knowledge advancement to contemporary people, not just traditional authorities, set the stage for the success of modern science. When multiple skeptical researchers take a close look at the evidence behind a new claim and are unable to find major problems with the evidence, everyone can then be more confident in the claim. As the historian David Wootton has put it, “What marks out modern science is not the conduct of experiments”, but rather “the formation of a critical community capable of assessing discoveries and replicating results.”
Taking the disregard of traditional authority further, in the 20th century the sociologist Robert Merton suggested that scientists value universalism. By universalism, Merton meant that in science, claims are evaluated without regard to the sort of person providing the evidence. Evidence is evaluated by scientists, Merton wrote, based on “pre-established impersonal criteria”.
Universalism provides a vision of science that is egalitarian, and universalism is endorsed by large majorities of today’s scientists. However, those who endorse it don’t always follow it in practice. Scientific organizations such as the Royal Society can be elitist. For example, sometimes the scholarly journals that societies publish treat reports by famous researchers with greater deference than those by other researchers.
Placing some trust in authorities (such as famous researchers) is almost unavoidable, because in life we have to make decisions about what to do even when we can’t be completely certain of the facts. In such situations, it can be appropriate to “trust” authorities, believing their proclamations. We don’t have the resources to assess all kinds of scientific evidence ourselves, so we have to look to those who seem to have a track record of making well-justified claims in a particular area. But when it comes to the development of new, cutting-edge knowledge, science thrives on the skepticism that drives the behavior of some researchers.
Together, the values of communalism, skepticism, and a mixture of universalism and elitism shaped the growth of scientific institutions, including the main way in which researchers officially communicated their findings: through academic journals.
In a clever bit of rhetoric, Professor Dorothy Bishop came up with “the four horsemen of irreproducibility“: publication bias, low statistical power, p-hacking, and HARKing. In an attempt at more complete coverage of the causes of the replication crisis, here I’m expanding on Dorothy’s four horsemen by adding two more causes, and using different wording. This gives me six P’s of the replication crisis! Not super-catchy, but I think this is useful.
1. For me, P-hacking was always the first thing that came to mind as a reason that many published results don’t replicate. Ideally, when there is nothing to be found in a comparison (such as no real difference between two groups), with the p=0.05 criterion used in many sciences, only 5% of studies will yield a false positive result. However, researchers hoping for a result will try all sorts of analyses to get the p-value to be less than .05, partly because that makes the result much easier to publish. This is p-hacking, and it can greatly elevate the rate of false positives in the literature.
Substantial proportions of psychologists, criminologists, applied linguists and other sorts of researchers admit to p-hacking. Nevertheless, p-hacking may be responsible for only a minority of the failures to successfully replicate previous results. Three of the other p’s below also contribute to the rate of false positives, and while researchers have tried, it’s very hard to sort out their relative importance.
2.Prevarication, which means lying, unfortunately is responsible for some proportion of the positive but false results in the literature. How important is it? Well, that’s very difficult to estimate. Within a psychology laboratory, it is possible to arrange things so that one can measure the rate at which people lie, for example to win additional money in a study, so that helps, but some of the most famous researchers to do so have, well, lied about their findings. And we know that fraudsters work in many research areas, not just dishonesty research. In some areas of human endeavor, regular audits are conducted – but not in science.
3.Publication bias is the tendency of researchers to only publish findings that they find interesting, that were statistically significant, or that confirmed what they expected based on their theoretical perspective. This has resulted in a colossal distortion of reality in some fields, to favor researchers’ pet theories, and resulted in lots of papers about all sorts of phenomena that may not actually exist. Anecdotally, I have heard about psychology laboratories that used to run a dozen studies every semester and only publish the ones that yielded statistically significant results. For those areas where researchers are always testing for something that truly exists (are there any such fields?), publication bias results in inflated estimates of its size.
4. Low statistical power. Most studies in psychology and neuroscience are underpowered, so even if the hypotheses being investigated are true, the chance that any particular study will yield statistically significant evidence for those hypotheses is small. Thus, researchers are used to studies not working, but to get a publication, they know they need a statistically significant result. This can drive them toward publication bias, as well as p-hacking. It also means that attempts to replicate published results often don’t yield a significant result even when the original result is real, making it difficult to resolve the uncertainty about what is real and what is not.
5. A particularly perverse practice that has developed in many sciences is pretending you predicted the results in advance. Also known as HARKing, this gives readers a much higher confidence in published phenomena and theories that they deserve. Infamously, the psychologist Daryl Bem gave students and fellow researchers the following advice:
There are two possible articles you can write: (1) the article you planned to write when you designed your study or (2) the article that makes the most sense now that you have seen the results. They are rarely the same, and the correct answer is (2).
If one follows this advice, with every study the goalpost is moved to match the interesting aspects of the data, even though pure chance is often the only cause of those interesting findings. It’s practices like this, together with publication bias and p-hacking, that are believed to be responsible for Bem’s apparent discovery that ESP is real, which he published in a prestigious social psychology journal.
6. Even when a scientific result reflects a true phenomenon rather than being spurious, it can be difficult to for subsequent researchers to replicate that result. We already ran into this above with the fact that most published studies have low statistical power. Another factor is poor reporting practices (yes, I’m counting this as another ‘p’!). In their papers, researchers often do not describe their study in enough detail for other researchers to be able to duplicate what was done. For example, the Reproducibility Project: Cancer Biology initially aimed to replicate 193 experiments, but none of the experiments were described in sufficient detail in the original paper to enable the researchers to design protocols to repeat the experiments, and for 32% of the associated papers, the authors never responded to inquiries or declined to share reagents, code, or data.
The six P’s don’t exhaust the reasons for poor reproducibility. Simple errors, for example, are another cause, and such errors are surely committed both by original researchers and by replicating researchers (although replication studies seem to be held to a higher standard by journal editors and reviewers than are original studies).
Many steps have been suggested to improve the dire situation that the 6 P’s (and more) have led to. At the most relevant places for science, however, such as journals and universities, these measures are often ignored or adopted only grudgingly, so there remains a long way to go.