April 29, 2008

Re-Pre-Visiting North Carolina

A few days ago, I did a post on the North Carolina primary that, judging by the comments, went over a lot of people’s head. I’m going to try to slow this one down a bit, because it really is some fascinating stuff, and because regression is pretty straightforward when your idiot commentator doesn’t speak in jargon.

During that last post, I ran a series of regression analyses that indicated that Hillary Obama should get between 50 and 55% of the vote in North Carolina. I found that two factors alone — the percentage AA population and the percentage of college educated voters — worked to explain a massive 70-90% of the difference between Obama’s share of the vote and Hillary’s share of the vote in the South. This was true when I used data from the state level as well as when I used data from the Congressional district level.

Because I’m a big dork, I’ve now run that test at the county level. The results are surprisingly robust. Because I want people to “get” this one, I’m reprinting something I posted earlier here explaining regression analysis and how it works. It’s down at the end, and gives you everything you need in order to “get” what I’m doing.

The New Regression

What I did this time was take the county data for Obama’s performance in every county in the Old Confederacy, excluding FL and AR. I split Edwards’ vote in SC 50-50 between Obama and Clinton. I ran the regression to see how Obama’s performance correlated with the African American percentage in a county and the % of college educated voters.

The analysis had an adjusted r-square of .792. This means roughly that 79% of the difference in Obama’s performance in any two counties in the South is explainable solely on the basis of race and education. Want to know why he got 60% of the vote in Madison county, Alabama, but only 15% of the vote in neighboring Jackson county? Madison has Hunstville, and is 21% black and 34% college educated. Neighboring Jackson is 3% black and 10% college educated. What the r-square of .792 means, is that roughly 36 points of the 45-point difference between Madison and Jackson counties is attributable to just to the racial and educational differences between the counties.  What is the other 9% attributable to?  We don’t know.  There’s some variable, or series of variables, out there that will explain it.  We just don’t know what it is.  And given that voting is made up of thousands of individualized decisions, it is probably impossible to figure out all of the variables at work here.

Again, this is stunning, though not entirely surprising. As might be expected, the two variables are statistically significant, with t-stats of 54 and 27. This means that we are about 99.9999999999999% certain that the relationship between Obama’s percentage and college education we see is not due to random chance, and 99.99999999999999999999999999% certain that the relationship between Obama’s percentage and race is not due to random chance. [And, for the real statistics geeks, the residuals average out to zero. There is some evidence of heteroskedasticity in the variables, but not, I think, enough to affect the significance of the variables at this level].

Finally, the coefficient for race is .83 and the coefficient for college education is .94. This means that for every additional percentage of African Americans in a particular county, Obama’s vote percentage in the county raises .83 points. And for every additional percentage of college educated voters in a particular county, Obama’s vote percentage in the county raises .94 points.

North Carolina

What we can then do is take these estimates for the rest of the South, and apply them to North Carolina’s counties. For each county, I multiplied the black percentage times .83 and the college educated percentage times .94, added them together and then added 17.5 (the constant; I can explain in the comments if you want). From that, we should be able to generate a pretty good picture of what North Carolina should look like. Doing so, we get this prediction

UPDATE — Just to clarify, green is Obama and Hillary is blue.  The bluest Hillary color means she is predicted to get 65%+, the greenest Obama color means he is predicted to get 65%+.  The colors change shade in roughly 3% increments:

If, on next Wednesday, the map of North Carolina looks anything like this, you can bet it will be splayed all over the front page here. If, however, it doesn’t, I’ll forget I ever wrote this.

Incidentally, weighting the results for county population, we end up with Hillary losing 45-55%. The interesting thing is that the residuals for the regression of counties across the South (ie the difference between whatever the model predicts and the actual results) are evenly distributed for positive and negative for all states except Tennessee. For Tennessee, almost every county overstates Obama’s performance. Is NC like Tennessee? We shall see . . .

My explanation of regression analysis is below the fold.


Regression for Dummies
Basically what OLS (Ordinary Least Square) Regression does is determine how certain defined variables (independent variables) affect a given outcome (the dependent variable). For example, let’s just say we were trying to determine what the various factors have been in the Democratic Primary so far. Why did Obama win SC, while Hillary won CA? You might postulate that the African American (AA) population, the Hispanic population, and the number of college students are a big part of the story.

Regression allows us to test how changes in those variables affect the dependent variable (your outcome). What you do with regression analysis is open excel, and make a column with a row for each state’s AA population, Hispanic population, and college students (your independent variables). Then you enter a column with what it is you are trying to explain, in this case the percentage of the vote Obama received in each county. Excel will spit out a page that tells you some key statistics: Your r-square, your coefficient for each independent variable, your t-stat for each independent variable, and the residuals. More on those later.

Regression basically gives you an equation line. It gives you the best possible equation for explaining the relationship between your independent variable with the dependent variables that you have chosen. This equation is basically simple algebra: It gives you an X-intercept, and a coefficient for each independent variable you have chosen. In other words, lets say that the OLS model gives you a coefficient of “2? for AA. That tells you that for every 1% of AA population in the Democratic primary, Obama beats Hillary by another two points. Let’s say that Hispanic population is “-1.” That tells you that for every Hispanic percent of the population, Obama loses a point. And if college is .5, he gains a half point for every percentage of college student. There will also be an “intercept” which you should think of as a baseline. Let’s say the baseline is 10.

If we have identified all of the factors that influence voting, we should be able to go back and plug the variables in for each state and perfectly explain the outcome. If State A had 10% AA population, 20% Hisp. and 5% student, that should mean, if our variables are “right” that Obama won the state by 12.5 points (intercept of 10 + (10*2) 20 for the AA population + (-1*20) -20 Hispanic + (.5*5) 2.5 for students.

Of course, you rarely get all the “right” variables, because life is too complicated. So what regression gives you is an “r-square” which basically tells you how well your independent variables explain the dependent variable. For a “perfect” explanation, your r-square is 1 (or -1, but let’s keep it simple). For a completely wrong explanation, your r-square is 0.
In other words, if I hypothesize that my paycheck is related to the number of hours I work at McDonald’s, I would input my hours worked for each pay period as my independent variable, and my paycheck as a dependent variable. And, voila, my r-square should be one, since there should be a perfect correlation with hours worked and paycheck. In fact, my coefficient will be my hourly wage. If, however, I think that my paycheck is determined by the number of french fries I serve, I will input FF served each pay period as my dependent variable, and my eventual paycheck as my independent variable. My r-square should be close to zero, unless I work on a per-french-fry commission. Now it may not be exactly zero, since busy times of the year may cause me to work more, allowing french fries sold to serve as something of a proxy for the numbers of hours I work. But you get the idea (in reality, this is actually an important thing to keep in mind with regression is that it only measures correlation, not causation. So if you find out that A correlates with C, it may not be the case that A is causing C, because A could actually be caused by B, which also causes C (ie working more hours means you serve more french fries, which means that hours or FF served can tell you what your wage is, even though hours worked is the causal agent).

Now let’s say I run my hourly regression, and my r-square is .8. Why isn’t it 1.00? What could I possibly be missing??? A-ha, some pay periods I worked overtime, so my paycheck isn’t a simple function of hours x wages. So I can insert a second variable for number of overtime hours worked. My r-square should improve, since I’ve put in a new variable that better explains my line. You may notice that this only explains linear relationships, and you’re right. Life rarely has linear relationships, which is why you never see an r-square of one. There are ways for testing R-square’s significance, but for our purposes, just accept r-square as a basic test of how well your variables explain a given outcome.

One other important thing is the t-stat, which tells you whether a given independent variable has any relation to the dependent variable. For example, if we are trying to predict what factors explain whether someone gets a speeding ticket, we could input the average speed a driver drives on a trip, the driver’s skin color (1 for black, 0 for white), whether he had a sticker saying he gave to cops (1 for yes, 0 for no), and whether he wore white tennis shoes (1 for yes, 0 for no). Our dependent variable would be whether we got a ticket on the trip or not. We would get a coefficient for each variable we put in, which tells us how much each average additional MPH a driver drove increased the probability of getting a ticket (almost certainly positive, indicating driving faster increased your odds of getting a ticket), how much being black affected things (almost certainly positive), how much having a ticket affected your chances (probably negative, since having a sticker decreases your odds), and a variable for white tennis shoes.

The t-stat tells you if the variable is statistically significant, ie if we are 95% certain that there is a relationship between the independent variable and the dependent variable. I won’t explain the math, but you want the t-stat to be greater than 2 or less than -2. So here you would probably get a very high t-stat for speed, probably a lower but still significant t-stat for skin color and stickers, and probably a t-stat pretty close to zero for white tennis shoes (unless, again, white tennis shoes correlates well with some other factor that drives the outcome (ie if we used red converse shoes as our variable, there shouldn’t be a factor, except that red converse shoes might correlate well with “alternative” or (showing my age) “skater” kids, who might be more likely to get a ticket from a cop).

Finally, you get the residuals, which tells you for each observation how much you’re off. Going back to the Obama/Clinton example, let’s assume that the model we chose actually explains the NJ results really well. You would get a small residual for NJ. Let’s say it doesn’t explain SC that well, you would get a large residual for SC.

by @ 8:06 pm. Filed under Poll Watch, Poll Watch - NC
Trackback URL for this post:
http://race42008.com/2008/04/29/re-pre-visiting-north-carolina/trackback/

17 Responses to “Re-Pre-Visiting North Carolina”

  1. Alex Knepper Says:

    During that last post, I ran a series of regression analyses that indicated that Hillary should get between 50 and 55% of the vote in North Carolina. I found that two factors alone — the percentage AA population and the percentage of college educated voters — worked to explain a massive 70-90% of the difference between Obama’s share of the vote and Hillary’s share of the vote in the South. This was true when I used data from the state level as well as when I used data from the Congressional district level.

    You mean Obama?

  2. Sean Oxendine Says:

    Blah blah blah. You and your little nits.

  3. Alex Knepper Says:

    So if you find out that A correlates with C, it may not be the case that A is causing C, because A could actually be caused by B, which also causes C (ie working more hours means you serve more french fries, which means that hours or FF served can tell you what your wage is, even though hours worked is the causal agent).

    Ye Olde Tyme Lurking Variable!

  4. Doug Forrester Says:

    ~60% of North Carolina DEM primary voters are white. ~30-35% are black.

    My informed guess is that Hillary finishes with 43-45% of the total vote.

  5. Matthew E. Miller Says:

    Sean,

    You might want to explain who is blue and who is green on that map.

  6. Doug Forrester Says:

    Blue must be Hill and green Obama.

    How appropriate.

  7. PnGrata Says:

    So your Appalachia outline map was great, but if the actual results look anything like this at the end, you can go get yourself a job with any polling firm in country at the drop of a hat.

  8. E Dogg Says:

    how big of a margin is dark blue?

  9. Clarence Claus Says:

    Sean, are you a statistician or something?

  10. Will North Carolina Have A Bit Of Tennessee Flavor? : Post Politics: Political News and Views in Tennessee Says:

    [...] Sean Oxendine does an interesting regression analysis designed to predict the outcome of the North Carolina primary and comes up with some interesting findings: Incidentally, weighting the results for county population, we end up with Hillary losing 45-55%. The interesting thing is that the residuals (ie the difference between whatever the model predicts and the actual results) are evenly distributed for positive and negative for all states except Tennessee. For Tennessee, almost every county overstates Obama’s performance. Is NC like Tennessee? [...]

  11. Clarence Claus Says:

    I noticed in South Carolina, Barack Obama got only like 29% of the white vote, but that was with Edwards in the race. In Alabama, he got roughly the same amount. Therefore just about all Edwards voters in the South would be for Hillary because they vote along racial lines. North Carolina has fewer blacks than South Carolina, so she might eke out a win, but it will be very close.

  12. Seth Holladay » Links » links for 2008-05-01 Says:

    [...] race42008.com » Blog Archive » Re-Pre-Visiting North Carolina (tags: politicalmaps northcarolina) [...]

  13. North Carolina Primary Maps - May 6, 2008 | Political Maps Says:

    [...] race42008.com - North Carolina Democratic Primary Prediction [...]

  14. Miako Says:

    Map significantly overstates Obama’s performance. A grayscale map would be much more effective.

  15. Sean Oxendine Says:

    Maybe so. But this conservative blogger is not about to pick which candidate to assign black to and which to assign white to. Seems like a no-win situation for me . . .

  16. race42008.com » Blog Archive » Poll Watch: SurveyUSA Indiana and North Carolina Democratic Primary Says:

    [...] also predicted last week that given the demographics of North Carolina, it was likely that the best Hillary could [...]

  17. Primary Day Predictions | Stop Her Now Blog Says:

    [...] really detailed analysis at Race42008 says Hillary takes Indiana by 14, losing North Carolina by [...]

GOP Nominee



Former Candidates

































Recent Posts

Biographies

Categories

Archives

Featured Archives


Race 4 2008 Interviews

Search

Blogroll

Newswire

Get this widget!

Facebook


Join Race 4 2008 on Facebook

Site Syndication

RightRoots

Main

Meta Data

Design and Hosting By