### A modest proposal for parapsych research

Nov. 14th, 2010 01:55 pm**lederhosen**

Once upon a time, the usual explanation for parapsychology "findings" was bad experimental technique (subjects were getting cues as to the correct answer, that sort of thing). Sometimes due to deliberate dishonesty, often because experimental design is hard.

These days, experimental technique has tightened up a lot, due largely to sceptics and partly to those parapsych researchers who think they've got something and want to be taken seriously.

This has established that if psi effects exist, they're

The problem with this is that with a large data set and a lot of choices about how to perform your analysis, it's awfully easy to cherry-pick for significance. So, my suggestion is:

Every time a human subject generates a data point (makes a prediction, etc etc), we should use a random number generator to generate, say, a thousand fake versions of the same data point, all consistent with the null hypothesis. From these, we form one real data set, and a thousand fakes.

Stats analysis is then performed blind - the statistician decides on appropriate analysis techniques without knowing which is the real data set. The decision on journal publication is also made blind; only after the article is irrevocably committed to publication does anybody get to find out whether the 'significant' data set identified is the real one.

This wouldn't fix all problems with parapsych research (and depending on the nature of the data, it might take some work to generate fake data that can pass as the real thing) but I think it would be useful in many cases. This needn't be restricted to parapsych; I'm pretty sure there are other fields that would benefit from this too.

*I'm not inclined to dignify the "we have strong psi powers that completely vanish under fraud-proof conditions" argument with a response.

These days, experimental technique has tightened up a lot, due largely to sceptics and partly to those parapsych researchers who think they've got something and want to be taken seriously.

This has established that if psi effects exist, they're

*weak**. If we had psychics who could predict the outcome of a coin-toss with 100% accuracy - or even 55% - one of them would long ago have claimed James Randi's money. To detect weak effects, you need to run very large experiments containing thousands or millions of trials, and perhaps apply sophisticated statistical techniques to analyse the data.The problem with this is that with a large data set and a lot of choices about how to perform your analysis, it's awfully easy to cherry-pick for significance. So, my suggestion is:

Every time a human subject generates a data point (makes a prediction, etc etc), we should use a random number generator to generate, say, a thousand fake versions of the same data point, all consistent with the null hypothesis. From these, we form one real data set, and a thousand fakes.

Stats analysis is then performed blind - the statistician decides on appropriate analysis techniques without knowing which is the real data set. The decision on journal publication is also made blind; only after the article is irrevocably committed to publication does anybody get to find out whether the 'significant' data set identified is the real one.

This wouldn't fix all problems with parapsych research (and depending on the nature of the data, it might take some work to generate fake data that can pass as the real thing) but I think it would be useful in many cases. This needn't be restricted to parapsych; I'm pretty sure there are other fields that would benefit from this too.

*I'm not inclined to dignify the "we have strong psi powers that completely vanish under fraud-proof conditions" argument with a response.

## no subject

Date: 2010-11-14 04:54 am (UTC)winterkoninkjeTo detect weak effects, you need to run very large experiments containing thousands or millions of trials, and perhaps apply sophisticated statistical techniques to analyse the data.I think, fundamentally, that's the problem. Ignoring parapsych for the moment, it has become clear to me over the last few years that our current methods of detecting significance cannot handle truly large data sets. This shows up in NLP research where, because we use million-word corpora for training and test on thousands of sentences, almost every difference in measurements will count as "significant". In reality, however, those "significant" differences (even when the effect size is something notable like +0.5~1% absolute accuracy, against a 93~95% baseline) can be easily overwhelmed by minor changes to modeling parameters or by switching to a different testing corpus (even in the same domain!). So most folks in NLP and similar machine learning fields don't even bother with measuring significance, because the best it can do is confirm when negligible differences are insignificant.

The problem is that enough data can make almost every difference appear significant. So if psi is going to be discovered (or NLP/ML is going to be

scientificabout its research) we need some folks to figure out new stats methods which don't break under giga-scale sampling. I'm no statistician, but I'd guess that such solutions would have to consider something like the distribution over the "local" distributions represented in the data; something like gathering statistics not about the data but rather about the results of a continuous version of k-fold cross validation over the data. Because large data are always going to be influenced by a number of factors, and we need to filter out the local effects of such influences before we can hope to find anything out about a weak global influence.## no subject

Date: 2010-11-14 08:50 am (UTC)lederhosenThe problem in a vast number of fields is that people assume they can just use a standard prepackaged statistical test without having to think about what assumptions it might be making, and that just isn't so.

I am not a great fan of significance testing; while it certainly has its points, I think a lot of its appeal is that it can be used without having to think about certain issues that one probably should think about, e.g. "what are your priors?" The word 'significance' itself is also problematic, because its meaning is so hugely different from the everyday meaning.

Somewhat different topic, but Anscombe's quartet neatly illustrates the pitfalls of relying on a couple of simple numeric measures without actually looking at the data.

## no subject

Date: 2010-11-14 08:39 pm (UTC)winterkoninkjeThe problem in a vast number of fields is that people assume they can just use a standard prepackaged statistical test without having to think about what assumptions it might be making, and that just isn't so.Definitely. That's certainly a major problem in psychology from what I've seen. A lot of NLP folks are really good with their stats, so they know better, but the problem is that assumptions like IID are so pervasive in statistical theory that sometimes practitioners make assumptions because that's where the math is. Looking for your keys where the light is good instead of where you lost them, and all that.