lederhosen: (Default)
[personal profile] lederhosen
Once upon a time, the usual explanation for parapsychology "findings" was bad experimental technique (subjects were getting cues as to the correct answer, that sort of thing). Sometimes due to deliberate dishonesty, often because experimental design is hard.

These days, experimental technique has tightened up a lot, due largely to sceptics and partly to those parapsych researchers who think they've got something and want to be taken seriously.

This has established that if psi effects exist, they're weak*. If we had psychics who could predict the outcome of a coin-toss with 100% accuracy - or even 55% - one of them would long ago have claimed James Randi's money. To detect weak effects, you need to run very large experiments containing thousands or millions of trials, and perhaps apply sophisticated statistical techniques to analyse the data.

The problem with this is that with a large data set and a lot of choices about how to perform your analysis, it's awfully easy to cherry-pick for significance. So, my suggestion is:

Every time a human subject generates a data point (makes a prediction, etc etc), we should use a random number generator to generate, say, a thousand fake versions of the same data point, all consistent with the null hypothesis. From these, we form one real data set, and a thousand fakes.

Stats analysis is then performed blind - the statistician decides on appropriate analysis techniques without knowing which is the real data set. The decision on journal publication is also made blind; only after the article is irrevocably committed to publication does anybody get to find out whether the 'significant' data set identified is the real one.

This wouldn't fix all problems with parapsych research (and depending on the nature of the data, it might take some work to generate fake data that can pass as the real thing) but I think it would be useful in many cases. This needn't be restricted to parapsych; I'm pretty sure there are other fields that would benefit from this too.

*I'm not inclined to dignify the "we have strong psi powers that completely vanish under fraud-proof conditions" argument with a response.

Date: 2010-11-14 04:54 am (UTC)
winterkoninkje: shadowcrane (clean) (Default)
From: [personal profile] winterkoninkje
To detect weak effects, you need to run very large experiments containing thousands or millions of trials, and perhaps apply sophisticated statistical techniques to analyse the data.

I think, fundamentally, that's the problem. Ignoring parapsych for the moment, it has become clear to me over the last few years that our current methods of detecting significance cannot handle truly large data sets. This shows up in NLP research where, because we use million-word corpora for training and test on thousands of sentences, almost every difference in measurements will count as "significant". In reality, however, those "significant" differences (even when the effect size is something notable like +0.5~1% absolute accuracy, against a 93~95% baseline) can be easily overwhelmed by minor changes to modeling parameters or by switching to a different testing corpus (even in the same domain!). So most folks in NLP and similar machine learning fields don't even bother with measuring significance, because the best it can do is confirm when negligible differences are insignificant.

The problem is that enough data can make almost every difference appear significant. So if psi is going to be discovered (or NLP/ML is going to be scientific about its research) we need some folks to figure out new stats methods which don't break under giga-scale sampling. I'm no statistician, but I'd guess that such solutions would have to consider something like the distribution over the "local" distributions represented in the data; something like gathering statistics not about the data but rather about the results of a continuous version of k-fold cross validation over the data. Because large data are always going to be influenced by a number of factors, and we need to filter out the local effects of such influences before we can hope to find anything out about a weak global influence.

Date: 2010-11-14 08:39 pm (UTC)
winterkoninkje: shadowcrane (clean) (Default)
From: [personal profile] winterkoninkje
Yeah, errors often aren't independent. Also, because the sources are so heterogeneous, we can't ever really assume that we can avoid sampling bias (nor that samples are identically distributed), which is why things are so genre specific and why changing testing corpora can completely alter the outcomes.

The problem in a vast number of fields is that people assume they can just use a standard prepackaged statistical test without having to think about what assumptions it might be making, and that just isn't so.

Definitely. That's certainly a major problem in psychology from what I've seen. A lot of NLP folks are really good with their stats, so they know better, but the problem is that assumptions like IID are so pervasive in statistical theory that sometimes practitioners make assumptions because that's where the math is. Looking for your keys where the light is good instead of where you lost them, and all that.


lederhosen: (Default)

February 2017


Most Popular Tags

Style Credit

Expand Cut Tags

No cut tags
Page generated Mar. 25th, 2017 11:39 am
Powered by Dreamwidth Studios