lederhosen

From:

Without knowing anything about NLP, the testing you're talking about sounds like something where the 'independence of errors' assumption mightn't hold up, which would definitely be a problem for large sample sizes.

The problem in a vast number of fields is that people assume they can just use a standard prepackaged statistical test without having to think about what assumptions it might be making, and that just isn't so.

I am not a great fan of significance testing; while it certainly has its points, I think a lot of its appeal is that it can be used without having to think about certain issues that one probably should think about, e.g. "what are your priors?" The word 'significance' itself is also problematic, because its meaning is so hugely different from the everyday meaning.

Somewhat different topic, but Anscombe's quartet neatly illustrates the pitfalls of relying on a couple of simple numeric measures without actually looking at the data.