**lederhosen**

(Not sure whether to use my politics icon or my maths icon for this one.)

Since a couple of y'all have been discussing this: there has been much mention of the alleged fact that 70% of African American voters in California voted for Prop 8. But how reliable is that number?

First off, there are a couple of different types of error that can affect a poll. One is 'sampling error', which comes from random variation in who you select for your poll. To understand sampling error, think of tossing a coin a hundred times - you would expect to get 50 heads and 50 tails, but because every toss is random and unrelated to the last, you could easily end up with 40-60 or 60-40.

You can reduce sampling error by taking a larger sample. As a rule of thumb, if you're sampling people completely at random, independent from one another, your margin of error for something like this is about 1/sqrt(n), where n is the number of people polled. For instance, a lot of political polls work on samples of around a thousand people, which gives them a margin of error of about 1/sqrt(1000), i.e. about 1/30, or ~ 3%. In the end, your sample size is a compromise between accuracy and expense.

[For stats geeks: this is the 95% confidence interval, and that approximation breaks down if the true underlying distribution is a long way off 50-50, but it's good enough for present purposes.]

The main intent of exit polls is to pick the overall outcome of a vote - will California go to Obama, or McCain? Will Proposition 8 pass or fail? The CNN exit poll for California had a sample size of 2240 people, which will give you about a 2% margin of sampling error (actually a bit more, for reasons covered below). Not perfect, but close enough to be useful.

Now, these polls also collect demographic data that can let you examine how the vote broke down among groups of interest - handy if you want to compare men vs. women, old vs young, etc. But it's not the primary purpose of the poll, and so the sample size isn't chosen to guarantee a certain level of accuracy in those 'subpopulation' estimates.

(Why, yes,

What this means is that you should not assume the raw number reported there means anything at all, unless you have a margin of error for it.

In this case, the CNN poll notes that African Americans are 10% of the overall sample, i.e. around 224 people. This is not a large number, and it means that the margin of sampling error would be fairly large - around 7%.

But as I noted above, this assumes that the people in your sample are selected completely independently of one another. This isn't quite true for CNN's exit polls. As discussed here, what they actually do is select a random sample of

Calculating how that affects the accuracy of your poll is fiddly, and I doubt CNN/EMR have enough data to do a precise calculation on that. (When I need to do this sort of thing - not for political polling, but things like employment estimates etc - I have the luxury of access to 20Gb of individual Census records, which makes life easier, but it's still fairly fiddly.) But at a very rough guess, this effect will probably increase that margin to around 10-15%. (Note that the magic numbers relevant to clustering for the general population may not translate to a given subpopulation, so the impact of clustering could be worse than for the general population.)

Throw in the potential for non-sampling error due to systematic biases (e.g. exit polls don't catch absentee voters, some people refuse to answer them, etc etc) and... well, while it's quite possible that African-Americans *were* more favourable to Prop 8 overall, the exit poll is far from clinching evidence for that.

Since a couple of y'all have been discussing this: there has been much mention of the alleged fact that 70% of African American voters in California voted for Prop 8. But how reliable is that number?

First off, there are a couple of different types of error that can affect a poll. One is 'sampling error', which comes from random variation in who you select for your poll. To understand sampling error, think of tossing a coin a hundred times - you would expect to get 50 heads and 50 tails, but because every toss is random and unrelated to the last, you could easily end up with 40-60 or 60-40.

You can reduce sampling error by taking a larger sample. As a rule of thumb, if you're sampling people completely at random, independent from one another, your margin of error for something like this is about 1/sqrt(n), where n is the number of people polled. For instance, a lot of political polls work on samples of around a thousand people, which gives them a margin of error of about 1/sqrt(1000), i.e. about 1/30, or ~ 3%. In the end, your sample size is a compromise between accuracy and expense.

[For stats geeks: this is the 95% confidence interval, and that approximation breaks down if the true underlying distribution is a long way off 50-50, but it's good enough for present purposes.]

The main intent of exit polls is to pick the overall outcome of a vote - will California go to Obama, or McCain? Will Proposition 8 pass or fail? The CNN exit poll for California had a sample size of 2240 people, which will give you about a 2% margin of sampling error (actually a bit more, for reasons covered below). Not perfect, but close enough to be useful.

Now, these polls also collect demographic data that can let you examine how the vote broke down among groups of interest - handy if you want to compare men vs. women, old vs young, etc. But it's not the primary purpose of the poll, and so the sample size isn't chosen to guarantee a certain level of accuracy in those 'subpopulation' estimates.

(Why, yes,

**lederhosen**HAS been working on sample design for subpopulation estimates for most of the last year...)What this means is that you should not assume the raw number reported there means anything at all, unless you have a margin of error for it.

In this case, the CNN poll notes that African Americans are 10% of the overall sample, i.e. around 224 people. This is not a large number, and it means that the margin of sampling error would be fairly large - around 7%.

But as I noted above, this assumes that the people in your sample are selected completely independently of one another. This isn't quite true for CNN's exit polls. As discussed here, what they actually do is select a random sample of

*precincts*, and then draw their sample from these precincts. This is common practice among survey organisations (my work included), because it just isn't practical to drive over to one precinct, sample a couple of people, and then go on to the next one. 'Clustering' the sample makes it more cost-effective, but it also means that your selections are no longer independent of one another - if you sample Bob, you're also likely to sample other people from his precinct, who may have similar characteristics.Calculating how that affects the accuracy of your poll is fiddly, and I doubt CNN/EMR have enough data to do a precise calculation on that. (When I need to do this sort of thing - not for political polling, but things like employment estimates etc - I have the luxury of access to 20Gb of individual Census records, which makes life easier, but it's still fairly fiddly.) But at a very rough guess, this effect will probably increase that margin to around 10-15%. (Note that the magic numbers relevant to clustering for the general population may not translate to a given subpopulation, so the impact of clustering could be worse than for the general population.)

Throw in the potential for non-sampling error due to systematic biases (e.g. exit polls don't catch absentee voters, some people refuse to answer them, etc etc) and... well, while it's quite possible that African-Americans *were* more favourable to Prop 8 overall, the exit poll is far from clinching evidence for that.