(Meet Rasmus Pianowski, a 17-year-old kid from Hamburg, Germany, obsessed with American politics. Apparently, he followed Nate Silver's 538.com pretty closely, and has been inspired to try out his own election prediction models. He's done well enough to earn himself a research intern position at Pollster.com. He's volunteered for a lot of Democratic statewide candidates -- Al Franken, for one -- and recently was inspired by Tyler Gernant's candidacy to volunteer for him...Enjoy! - promoted by Jay Stevens)
This is going to be a fairly technical post, so don't say I didn't warn you.
I'm going to liveblog the incoming results of the Democratic Congressional Primary tomorrow, and I likely won't have time to explain what I'm doing then, so I'll simply do it now.
I'm going to employ two separate statistical methods to extrapolate incoming county returns to the other parts of the state, so that we'll hopefully know where the race is roughly going to end up when 10 or 15 counties have reported.
The first one is a fairly straightforward ordinary least squares multiple regression analysis.
In layman's terms that means that the results of the counties that already have reported are broken down and analyzed based on up to 16 different socio-economic and political variables. The goal is to explain what drives the election results-- that's fairly intuitive. For example, in the Alabama general election 2008 it's obvious that the racial make-up of each county is a good predictor of the McCain-Obama election result.
Now, you need to quantify that, and you also need to expand it to more than just one variable- and that's what this OLS-multiple linear regression does.
Basically, we'll end up with a formula like (I'm making this up) GernantVote%= MedianHouseholdIncome*0.00004+percentage w/o health insurance*-1.2+Gore voteshare*0.8+20.
Then, we just need to plug in the MHI, percentage w/o health insurance and Gore's voteshare for each MT county, and we have an estimate for each county. I'm weighting them by the percentage that the counties historically contribute to Democratic Primary electorates, and then we have an estimate for the whole primary. This is a fairly robust method.
The second one is a so-called k-nearest neighbor Analysis.
Basically, I'm analyzing how similar Montana's counties are to each other by standardizing them. Essentially, if they have roughly similar values on a number of variables, their similarity score will be very low, if they're diverse, it will be high.
The 'twin counties'- the most similar to each other- are Sanders and Mineral at 0.69, while the two most different ones are Gallatin and Roosevelt at 5.08.
Then, for each county that hasn't reported yet, I'm taking the three most similar counties out of those that do have reported, and average the results of those counties.
When those results are again weighted by turnout, we have a second estimate of where the race is going to end up.
This should work fairly well when the counties come in in an order that's independent from the voting results. If for example McDonald's strongholds all come in early and Tyler's late, it wouldn't do well, as it could choose only from McDonald-counties when it's looking at what counties are similar.
If both approaches show roughly the same result, we can be relatively sure that it's going to work out that way.
I'm already very much looking forward to tomorrow. |