staticnotes.org

Reproducing Nate Silver's regression analysis on COVID death rates

⋅ 12 minute read

I spent some time recently studying causal inference methods. Two great resources for this are:

When I learn about an unfamiliar topic, I will always try to look for real-world examples. In this case I keep my eyes open about articles sharing statistical findings, where I can attempt to reproduce the results.

In Nate Silver’s recent article Fine, I’ll run a regression analysis. But it won’t make you happy external link he shows the effect of state partisanship and COVID vaccination rates on COVID death rates. The models used are simple linear regressions with one to three independent variables (partisanship, age, vaccination rates).

Silver’s point in the article is only partially about his finding that state partisanship is a good predictor for COVID death rates in that state. Instead, he tries to dismiss critics that argue that simple models are not valid when they leave out additional variables, e.g. age, co-morbidities. While he agrees that one has to justify a particular model design, simple models can deliver true insights while standing up to scrutiny. Moreover he adds that “there’s a general tendency [in the profession] to overfit models.”

This post is not about Silver’s point or the political aspect of the result. I simply wanted to reproduce the findings in the article.

In his article Silver runs four linear regressions:

  1. impact of state partisanship on COVID death rates
  2. impact of state partisanship and state age on COVID death rates
  3. impact of state partisanship, state age, and state vaccination rates on COVID death rates
  4. impact of state age and state vaccination rates on COVID death rates

We need to collect data for every US state on partisanship, age structure, vaccination rates, and COVID death rates. It turned out, perhaps not surprisingly, that finding and assembling the dataset from open sources took significantly more time than setting up the regressions. I had three issues when I tried to assemble the data from the sources that the article links to:

So I had to use in some cases different sources.

Building the data set

We are building the data set from four sources:

  1. US states COVID cases and deaths: https://github.com/nytimes/covid-19-data/blob/62ef34cfcb60214be873a38d73619da9ea57d50b/us-states.csv external link
  2. US states age statistics: https://www.prb.org/resources/which-us-states-are-the-oldest/ external link
  3. US states vaccination rates: https://ourworldindata.org/us-states-vaccinations external link
  4. US states election results 2020: https://www.kaggle.com/code/paultimothymooney/2020-usa-election-vote-percentages-by-state/output external link

COVID death rates

We want to calculate the COVID death rates (COVID deaths per 1M population) between two points in time (1. February 2021 when vaccines became widely available and 23. March 2023 as in the article). Silver links to worldometer.info external link as his dataset for covid deaths per state. However, I couldn’t find a way to download the data in timeseries form (numbers for each day) without scraping the website. Instead we are using the equivalent numbers published by the New York Times. Let’s load the CSV file into a pandas dataframe:

1import pandas as pd
2import numpy as np
3import statsmodels.api as sm
4import matplotlib.pyplot as plt
1df = pd.read_csv("us-states.csv")
1df['date'] = pd.to_datetime(df['date'])
2# create two series for the two dates of interest and rename the columns
3deaths_february = df[df.date == '2021-02-01'][["state", "deaths"]].rename(columns = {"deaths" : "deaths_2021_02_01"})
4deaths_last = df[df.date == '2023-03-23'][["state", "deaths"]].rename(columns = {"deaths" : "deaths_2023_03_23"})

With the two pandas series deaths_february and deaths_last we can calculate the number of death in this time period.

1deaths = deaths_last.merge(deaths_february, how = 'left', on = 'state').dropna(axis = 'index')
2deaths["deaths_after_vaccine"] = deaths["deaths_2023_03_23"] - deaths["deaths_2021_02_01"]

Now we can look at the state numbers for COVID deaths between 01.02.2021 and 23.03.2023:

1deaths

state deaths_2023_03_23 deaths_2021_02_01 deaths_after_vaccine
0 Alabama 21631 7688.0 13943.0
1 Alaska 1438 253.0 1185.0
3 Arizona 33190 13124.0 20066.0
4 Arkansas 13068 4895.0 8173.0
5 California 104277 41284.0 62993.0
6 Colorado 14245 5737.0 8508.0
7 Connecticut 12270 7119.0 5151.0
8 Delaware 3352 1101.0 2251.0
9 District of Columbia 1432 916.0 516.0
10 Florida 87141 26684.0 60457.0
11 Georgia 41055 13821.0 27234.0
12 Guam 416 130.0 286.0
13 Hawaii 1851 407.0 1444.0
14 Idaho 5456 1737.0 3719.0
15 Illinois 41618 21273.0 20345.0
16 Indiana 26179 9989.0 16190.0
17 Iowa 10770 4906.0 5864.0
18 Kansas 10232 3809.0 6423.0
19 Kentucky 18348 3995.0 14353.0
20 Louisiana 18835 8912.0 9923.0
21 Maine 2981 595.0 2386.0
22 Maryland 16672 7154.0 9518.0
23 Massachusetts 24441 14607.0 9834.0
24 Michigan 42311 15527.0 26784.0
25 Minnesota 14964 6270.0 8694.0
26 Mississippi 13431 6056.0 7375.0
27 Missouri 23998 7182.0 16816.0
28 Montana 3701 1235.0 2466.0
29 Nebraska 5068 2031.0 3037.0
30 Nevada 12093 4281.0 7812.0
31 New Hampshire 3018 1059.0 1959.0
32 New Jersey 36097 21513.0 14584.0
33 New Mexico 9110 3297.0 5813.0
34 New York 80109 43354.0 36755.0
35 North Carolina 29746 9397.0 20349.0
36 North Dakota 2529 1447.0 1082.0
37 Northern Mariana Islands 41 2.0 39.0
38 Ohio 42061 11233.0 30828.0
39 Oklahoma 16549 3564.0 12985.0
40 Oregon 9451 1980.0 7471.0
41 Pennsylvania 50701 21741.0 28960.0
42 Puerto Rico 5848 1836.0 4012.0
43 Rhode Island 3915 2173.0 1742.0
44 South Carolina 20192 7283.0 12909.0
45 South Dakota 3222 1778.0 1444.0
46 Tennessee 29035 9660.0 19375.0
47 Texas 94518 37405.0 57113.0
48 Utah 5316 1669.0 3647.0
49 Vermont 939 175.0 764.0
50 Virgin Islands 130 24.0 106.0
51 Virginia 23782 6474.0 17308.0
52 Washington 15905 4404.0 11501.0
53 West Virginia 8132 2028.0 6104.0
54 Wisconsin 16485 6439.0 10046.0
55 Wyoming 2014 596.0 1418.0

Age characteristics

We load data for US states population age characteristics and calculate the proportion of the state population equal or older than 65. We would expect that states with an older population are more severely affected by COVID (all else being equal).

1states_population = pd.read_csv("states_population_age_2020.csv", thousands=',').drop(["Rank", "Population Ages 65+ (percent of state population)"], axis = 1)
2states_population = states_population.rename(columns = {"State" : "state", "Total Resident Population (thousands)": "total_population_thousands", "Population Ages 65+ (thousands)" : "population_age65_thousands"})
3states_population["perc_over_65"] =  states_population["population_age65_thousands"] / states_population["total_population_thousands"] * 100.0

Next we add the perc_over_65 column to the deaths dataframe.

1# drop due to missing population data for DC, Guam, Puerto Rico, Northern Mariana Islands, Virgin Islands
2deaths_and_states = deaths.merge(states_population[["state", "total_population_thousands", "perc_over_65"]], on = 'state', how = 'left').dropna()

Moreover, to use the same variable scaling as in the article, we scale the deaths_after_vaccine variable from absolute numbers to be COVID deaths per million population.

1deaths_and_states["deaths_per_million"] = deaths_and_states["deaths_after_vaccine"] / ((deaths_and_states["total_population_thousands"] * 1000) / 1_000_000 )
1# dataframe with deaths, population, and age variable
2deaths_and_states.head(5)

state deaths_2023_03_23 deaths_2021_02_01 deaths_after_vaccine total_population_thousands perc_over_65 deaths_per_million
0 Alabama 21631 7688.0 13943.0 4922.0 17.757009 2832.791548
1 Alaska 1438 253.0 1185.0 731.0 13.132695 1621.067031
2 Arizona 33190 13124.0 20066.0 7421.0 18.515025 2703.948255
3 Arkansas 13068 4895.0 8173.0 3031.0 17.683933 2696.469812
4 California 104277 41284.0 62993.0 39368.0 15.179841 1600.106686

Presidential election results

Similar to the article we want to generate a variable biden which indicates Joe Biden’s margin of victory over Donald Trump in the 2020 elections.

1# load election results by state for 2020 presidential election
2election_results = pd.read_csv("democratic_vs_republican_votes_by_usa_state_2020.csv").drop("usa_state_code", axis = 1)
3election_results["biden"] = election_results["percent_democrat"] - 50
4election_results.head(10)

state DEM REP usa_state percent_democrat biden
0 Alabama 843473 1434159 Alabama 37.032892 -12.967108
1 Alaska 45758 80999 Alaska 36.098993 -13.901007
2 Arizona 1643664 1626679 Arizona 50.259682 0.259682
3 Arkansas 420985 761251 Arkansas 35.609218 -14.390782
4 California 9315259 4812735 California 65.934760 15.934760
5 Colorado 1753416 1335253 Colorado 56.769307 6.769307
6 Connecticut 1059252 699079 Connecticut 60.241900 10.241900
7 Delaware 295413 199857 Delaware 59.646859 9.646859
8 District of Columbia 258561 14449 District of Columbia 94.707520 44.707520
9 Florida 5294767 5667834 Florida 48.298456 -1.701544
1# merge to deaths dataframe
2death_and_election = deaths_and_states.merge(election_results[["state", "biden"]], how = "left", on = 'state')

Vaccination rates

Lastly, we add the vaccination rates per state after the pandemic was over. I use the vaccination rates from 10.5.2023 as this data was easily available.

1vaccinations_raw = pd.read_csv("us_state_vaccinations.csv").rename(columns={"location": "state"})
2vaccinations_raw.replace({'New York State': 'New York'}, inplace=True)
3# chose date and relevant columns
4vaccinations = vaccinations_raw[vaccinations_raw.date == '2023-05-10'][["state","people_vaccinated_per_hundred"]]

Next, we merge the vaccination rates with the dataframe to get our final dataset.

1dataset = death_and_election.merge(vaccinations, how = "left", on = 'state')
2dataset[["state", "deaths_per_million", "biden", "perc_over_65", "people_vaccinated_per_hundred"]].head(5)

state deaths_per_million biden perc_over_65 people_vaccinated_per_hundred
0 Alabama 2832.791548 -12.967108 17.757009 65.12
1 Alaska 1621.067031 -13.901007 13.132695 73.23
2 Arizona 2703.948255 0.259682 18.515025 78.37
3 Arkansas 2696.469812 -14.390782 17.683933 70.09
4 California 1600.106686 15.934760 15.179841 85.07

I am also storing the final dataset as a parquet file in case you want to play with it. To find it, follow the link at the bottom of the page.

1dataset[["state", "deaths_per_million", "biden", "perc_over_65", "people_vaccinated_per_hundred"]].to_parquet("death_partisanship_final_dataset.parquet", engine = 'pyarrow')

Regressions

With the assembled dataset we can now reproduce Nate Silver’s findings.

1. Regression: state partisanship on death rates

Silver starts with a one-variable baseline model. And indeed biden, Joe Biden’s winning margin, is a statistically significant predictor of COVID deaths. According to the model, one percentage point increase in Biden’s margin reduced expected COVID deaths by ~30 covid deaths per million population.

1x = dataset[['biden']]
2y = dataset['deaths_per_million']
3# add a constant term to the model
4x = sm.add_constant(x)
5# run regression
6model = sm.OLS(y, x).fit()
7print_model = model.summary()
8print(print_model)
                            OLS Regression Results                            
==============================================================================
Dep. Variable:     deaths_per_million   R-squared:                       0.286
Model:                            OLS   Adj. R-squared:                  0.272
Method:                 Least Squares   F-statistic:                     19.27
Date:                Sun, 14 Jan 2024   Prob (F-statistic):           6.22e-05
Time:                        18:05:09   Log-Likelihood:                -381.14
No. Observations:                  50   AIC:                             766.3
Df Residuals:                      48   BIC:                             770.1
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const       2031.3050     72.204     28.133      0.000    1886.129    2176.481
biden        -29.7773      6.784     -4.389      0.000     -43.418     -16.137
==============================================================================
Omnibus:                        0.718   Durbin-Watson:                   2.263
Prob(Omnibus):                  0.698   Jarque-Bera (JB):                0.732
Skew:                          -0.266   Prob(JB):                        0.694
Kurtosis:                       2.737   Cond. No.                         10.8
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

We can also plot the data set and the regression line to get a more intuitive overview of the result:

1fig, ax = plt.subplots(figsize = (9, 9))
2ax.scatter(dataset["biden"],  dataset["deaths_per_million"])
3plt.xlabel("Margin % for Biden (positive values mean Democrats won)")
4plt.ylabel("Post vaccine (Feb 2021) deaths per 1M population")
5b, m = model.params
6ax.axline(xy1=(0, b), slope=m, color='red')
7plt.grid()

2. Regression: state partisanship and age on death rates

Silver’s article is a response to criticism that the model above is not valuable because different states have different age structures and that may explain most of the variation in death rates. Silver remarks that the results hold even when controlling for age. We can add age perc_over_65 to the model and re-run the regression and indeed we get the same result. biden is still significant with a similar coefficient.

1x2 = dataset[['biden', 'perc_over_65']]
2y2 = dataset['deaths_per_million']
3
4x2 = sm.add_constant(x2)
5model2 = sm.OLS(y2, x2).fit()
6print_model2 = model2.summary()
7print(print_model2)
                            OLS Regression Results                            
==============================================================================
Dep. Variable:     deaths_per_million   R-squared:                       0.397
Model:                            OLS   Adj. R-squared:                  0.371
Method:                 Least Squares   F-statistic:                     15.44
Date:                Sun, 14 Jan 2024   Prob (F-statistic):           6.99e-06
Time:                        18:05:12   Log-Likelihood:                -376.95
No. Observations:                  50   AIC:                             759.9
Df Residuals:                      47   BIC:                             765.6
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
================================================================================
                   coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------
const          256.0210    609.788      0.420      0.677    -970.714    1482.756
biden          -32.9481      6.397     -5.151      0.000     -45.817     -20.079
perc_over_65   101.6118     34.690      2.929      0.005      31.824     171.400
==============================================================================
Omnibus:                        1.194   Durbin-Watson:                   2.318
Prob(Omnibus):                  0.551   Jarque-Bera (JB):                1.061
Skew:                          -0.162   Prob(JB):                        0.588
Kurtosis:                       2.365   Cond. No.                         162.
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

3. Regression: state partisanship, age, and vaccination rates on death rates

Clearly your political leanings are not a predictor for your chance of dying from COVID. The assumption behind the model is that Republicans have been less likely to get vaccinated, and unvaccinated humans have a higher mortality w.r.t COVID. We can check what happens if we add the vaccination rates as a variable to the model.

We can see below that vaccination rate is significant and explains away the effect of the biden variable.

1x3 = dataset[['biden', 'perc_over_65', 'people_vaccinated_per_hundred']]
2y3 = dataset['deaths_per_million']
3x3 = sm.add_constant(x3)
4model3 = sm.OLS(y3, x3).fit()
5print_model3 = model3.summary()
6print(print_model3)
                            OLS Regression Results                            
==============================================================================
Dep. Variable:     deaths_per_million   R-squared:                       0.450
Model:                            OLS   Adj. R-squared:                  0.414
Method:                 Least Squares   F-statistic:                     12.55
Date:                Sun, 14 Jan 2024   Prob (F-statistic):           4.01e-06
Time:                        18:05:13   Log-Likelihood:                -374.63
No. Observations:                  50   AIC:                             757.3
Df Residuals:                      46   BIC:                             764.9
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
=================================================================================================
                                    coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------------------------
const                          1528.3900    841.133      1.817      0.076    -164.723    3221.503
biden                           -15.9990     10.110     -1.583      0.120     -36.349       4.351
perc_over_65                    114.7298     34.042      3.370      0.002      46.207     183.253
people_vaccinated_per_hundred   -18.5072      8.743     -2.117      0.040     -36.106      -0.908
==============================================================================
Omnibus:                        0.262   Durbin-Watson:                   2.192
Prob(Omnibus):                  0.877   Jarque-Bera (JB):                0.454
Skew:                           0.027   Prob(JB):                        0.797
Kurtosis:                       2.537   Cond. No.                     1.08e+03
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.08e+03. This might indicate that there are
strong multicollinearity or other numerical problems.

4. Regression: age and vaccination rates on death rates

In a final step, we remove the correlated variable biden from the model to demonstrate the impact of age and vaccination rate on death rate:

1x4 = dataset[['perc_over_65', 'people_vaccinated_per_hundred']]
2y4 = dataset['deaths_per_million']
3x4 = sm.add_constant(x4)
4
5model4 = sm.OLS(y4, x4).fit()
6
7print_model4 = model4.summary()
8print(print_model4)
                            OLS Regression Results                            
==============================================================================
Dep. Variable:     deaths_per_million   R-squared:                       0.420
Model:                            OLS   Adj. R-squared:                  0.396
Method:                 Least Squares   F-statistic:                     17.03
Date:                Sun, 14 Jan 2024   Prob (F-statistic):           2.74e-06
Time:                        18:05:14   Log-Likelihood:                -375.95
No. Observations:                  50   AIC:                             757.9
Df Residuals:                      47   BIC:                             763.6
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
=================================================================================================
                                    coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------------------------
const                          2386.5532    653.203      3.654      0.001    1072.478    3700.629
perc_over_65                    117.0242     34.551      3.387      0.001      47.516     186.532
people_vaccinated_per_hundred   -29.4653      5.423     -5.434      0.000     -40.374     -18.556
==============================================================================
Omnibus:                        0.653   Durbin-Watson:                   2.211
Prob(Omnibus):                  0.722   Jarque-Bera (JB):                0.527
Skew:                           0.244   Prob(JB):                        0.768
Kurtosis:                       2.881   Cond. No.                         829.
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Jupyter Notebook

You can find the jupyter notebook and the datasets for this post here external link .

If you have any thoughts, questions, or feedback about this post, I would love to hear it. Please reach out to me via email.

#notebook   #statistics