TXST researchers participate in large-scale collaboration, release new findings on research credibility

TXST's Angela Jones, Ph.D., and Sean Patrick Roche, Ph.D., both contributed to the article “Investigating the replicability of the social and behavioral sciences" published in the journal, Nature.

angela jones headshot
Angela Jones, Ph.D.

Findings from the Systematizing Confidence in Open Research and Evidence (SCORE) program—a collaborative effort involving 865 researchers, including two Texas State University faculty members—have been published in Nature as a collection of three papers alongside release of five additional preprints.

Funded by the U.S. Defense Advanced Research Projects Agency (DARPA), SCORE examined multiple dimensions of research repeatability, including reproducibility, robustness and replicability, and assessed how well people and machine methods can predict whether findings will replicate. 

Angela Jones, Ph.D., and Sean Patrick Roche, Ph.D., both associate professors in the TXST School of Criminal Justice and Criminology, contributed to the article “Investigating the replicability of the social and behavioral sciences,” with lead author Andrew Tyner of the Center for Open Science in Charlottesville, Va. Roche and Jones conducted one of the many replications featured in the full dataset, and also reviewed and provided comments on the complete manuscript. 

sean patrick roche headshot
Sean Patrick Roche, Ph.D.

“Science only works if findings hold up when someone else tests them,” Roche said. “This study gives us the clearest picture yet of where the social and behavioral sciences stand on that front — and shows us that the challenge of replicability isn't confined to any one field. We’re proud to be a part of the kind of honest self-examination that makes science stronger.”

The program’s outcomes will contribute to strengthening how research is interpreted and communicated, work that supports authors, reviewers, funders, policymakers and readers’ understanding and use of research evidence. Improving credibility assessment will help focus attention and resources on areas of research that can most effectively accelerate the production of knowledge and solutions.

SCORE was coordinated by the Center for Open Science (COS). Human expert assessments were conducted by two independent teams, repliCATS and Replication Markets. Three teams led by researchers at Pennsylvania State University, TwoSix Technologies and the University of Southern California implemented machine-learning and algorithmic approaches to predicting replicability, and the Metascience Lab at Eötvös Loránd University coordinated robustness assessments. 

Across the program, SCORE sampled claims from 3,900 papers published between 2009 and 2018 in 62 journals spanning criminology, economics, education, health, management, psychology, political science, sociology and other fields. 

To support consistent interpretation of these results, SCORE also released a short preprint that explains the standardized terminology used in the program. In this research, “reproducibility” refers to re-running the same analysis on the same data; “robustness” tests whether conclusions hold under reasonable alternative analyses of the same data; and “replicability” refers to whether findings hold up when tested with new data. 

Key findings include: 

  • About half of tested findings replicated, consistent with prior large-scale replication efforts: Of 164 papers subjected to replication attempts, 49% replicated by the most common criterion (statistical significance with the same pattern as the original study). 
  • Field differences were limited: No social and behavioral science field showed consistently higher repeatability overall, although variation in data availability and sharing practices contributed to observed differences in reproducibility across fields. 
  •  Humans forecast replication reasonably well; tested machine methods did not: Human assessments achieved 76% and 78% success rates by the best-performing metric for the two methods, respectively. Three distinct machine-based methods were tested, and none were effective at predicting which claims would replicate successfully or not. 
  • Different analyses led to different results: 72% of reproduction tests reproduced at least approximately, and 53% reproduced precisely. But, in robustness testing of 100 papers, 34% of independent re-analyses matched the original within a narrow tolerance (± 0.05 Cohen’s d units), and 57% matched within a wider tolerance. 
  • There is no single indicator of credibility: The papers emphasize that credibility assessments are diverse and no singular measure captures credibility. Repeatability is just one type of credibility indicator, and outcomes varied substantially within different repeatability assessments: reproducibility, robustness, and replicability.

“The main message of SCORE is a simple one: research is hard. And, in some ways, the hard work begins after making a discovery,” said Tim Errington, senior director of Research at COS and one of the SCORE project leaders. 

“A tremendous amount of work is needed to verify and have enough confidence in new discoveries to build foundations for further discovery.”

SCORE’s analyses focus on papers published from 2009 to 2018. Since that period, the social and behavioral sciences have continued to evolve policy and practice, including strengthened journal requirements for sharing data and code and, in some cases, incorporating reproducibility checks into publication processes. The collection situates SCORE’s results within that broader trajectory of ongoing reform.

In addition to the papers, SCORE has released openly accessible datasets, algorithms, and replication and reanalysis materials to support further research on scientific credibility. 

For more information, visit the Nature website.

For more information, contact:

TXST Office of Media Relations, 512-245-2180