Retraction Watch on Förster (Blog)

Retractionwatch has two pieces out on alleged data irregularities in work published by social psychologist Jens Förster.

They cite an unnamed (!) (LOWI?, UvA?) report dated 2012:

These papers report 40 experiments involving a total of 2284 participants (2242 of which were undergraduates). We apply an F test based on descriptive statistics to test for linearity of means across three levels of the experimental design. Results show that in the vast majority of the 42 independent samples so analyzed, means are unusually close to a linear trend. Combined left-tailed probabilities are 0.000000008, 0.0000004, and 0.000000006, for the three papers, respectively.

While at [Jacobs University](http://www.jacobs-university}, I took two or three classes with Dr. Förster – and he was a great teacher. I don’t know social psychology, nor this research, and have insufficient training in statistics, but I’ve read these reports and comments with great interest.

It strikes me that something is wrong about the tone, and that the scientific community is confused about two standards that may apply here:

  1. A (quasi-criminal) prosecution of potential wrong-doing, concerned with assigning guilt and issuing consequences.
  2. An investigation into, and debate over the reproduceability, statistical rigor (etc.) of the research, concerned with finding truth and communicating doubt.

For 1., we have well-established standards and procedures, epitomized by a criminal proceeding. The powers wielded under this kind of are quite severe, including punishment but also - and clearly the case here - irreparable damage to personal reputation. Because (quasi-criminal) proceedings are such a powerful thing, the rule of law has greatly curtailed its use, and rightfully so. For example, such proceedings happen in well-defined courts – not on random websites and comments threads. Under this kind of proceedings, “defendants” also have quite extensive rights, and these rights must be balanced against other social goods. The balance of proof is always on the prosecution, and until proven otherwise beyond reasonable doubt, we presume innocence.

None of this is the case here, and Dr. Förster may be right to complain:

I do feel like the victim of an incredible witch hunt directed at psychologists after the Stapel-affair.

For 2., however, the standards and procedures are quite different. The powers over individuals under this process are very limited; there is nothing (or ought to be nothing) dishonorable about being wrong, about your hypotheses being disconfirmed, or your findings not being irreproduceable. In fact, science might progress best, if the individual consequences are strictly limited and everyone is free to tinker. Because the powers wielded here are so small and the ultimate subject is the research, and not the person, we can afford much harsher standard. Science needs little gatekeeping, and no singular, regulated arena of adjudication; research should be open for everyone to reproduce and investigate. Of course, transparency and openness also means that everyone signs their work by name, which has not been the case here. Crucially, the balance of proof also shifts. Under (positivist) norms (at least), a hypothesis is assumed false, unless confirmed. That’s how the null-hypothesis convention operates; unless you can show that the result is not a fluke, it’s assumed to be a fluke.

Under this standard, if and to the extent that the (anonymous) report’s findings hold up, there seem to be some questions about the published work by Förster. According to the report, among other alleged irregularities, the shown effects are presumed to be too linear to be likely, that is, the effect (over three levels of the IV) is not only present, but it’s also near-perfectly linear.

Richard Gill notes that:

there is no reason from psychology why the population mean score of control subjects should be precisely half way between population mean scores of “high” treated subjects and “low” treated subjects. In study after study. There is every reason from statistics why, even if this were true in the population, it won’t be anywhere close to true in small samples. And if it is not even true in the population, it is even less likely to be close to true in small samples.

If this argument holds water, under standard 2., it doesn’t matter why these irregularities are observed. Science, again, is not concerned with assigning blame. But as a finding, absent another explanation, it would seem to be very, very unlikely.

It’s the inverse of non-significance; it’s too un-random to be likely true, but maybe a similar logic should apply. Considering the odds of this result – if the report has it right – the published data do not in fact confirm the stated hypotheses, because these hypotheses would imply some non-linearity, or at least, noise.

It is insufficient to note, as the original (UvA?) report cited by Förster does, that

it is always possible (according to the reviewer) that we will understand odd patterns in psychology at a later point in time

It would only be a slight exaggeration then, to consider non-significant findings as confirmation, on the rationale that some later theory might explain the non-significance.

To sum up:

  • Has there been wrong-doing? – That’s not a proper question for science under this standard.
  • Do the results hold up to rigorous standards? – Yeah, maybe, they just report a different finding than intended. They allegedly report a yet unexplained anomaly.
  • Should the paper be retracted? – Probably not, it should be edited, instead. We should publish non-findings, and unexplained anomalies.

So where does this leave science?

I think we should strengthen our procedures under 2 (reproduceability etc.), and keep far, far, away from the stuff of 1 (prosecution).

This might imply:

  • greatly improved transparency and openness. In the age of github what reason is there for data, analysis and work not to be completely out in the open, for everyone to vet, and for decades to come?
  • as a corrolary, maybe the old, closed-science mode of peer-review-once-then-publish should be scrapped for a more continuous, dynamic mode of peer review (again, github for science comes to mind)
  • a greater appreciation of non-findings and anomalies; those are fit for publishing, too. Maybe, someone with results as Dr. Förster ought not to have to argue that his hypotheses are confirmed, to get published.
  • a greater appreciation for deeply skeptical people who double- and triple-check other people’s work (in other words, party poopers)
  • not mixing up 1 (prosecution) and 2 (reproducability etc.) In the search for truth, we don’t suspect intent, question characters or, invoke consequences. These things are someone else’s job, and under much more demanding standards.