4  Empirical Applications of Instrumental Variables Regression

In this chapter we will apply the concepts of Instrumental Variables Regression, which are those regression models that aim to solve the problem arising when the error term u is correlated with the regressor of interest, and so that the corresponding coefficient is estimated inconsistently.

We have previously addressed the issue of omitted variables bias by adding the omitted variables to the regression, trying to mitigate the risk of biased estimation of the causal effect of interest. However, if we don’t have data on the omitted factors, multiple regression is not sufficient.

The same issue arises when causality runs both from X to Y and from Y to X, so that there is simultaneous causality bias. There will be again an estimation bias that cannot be corrected for by multiple regression.

Instrumental variables (IV) regression is a general solution to obtain a consistent estimator of the unknown causal coefficients when the regressor X is correlated with the error term u. In this chapter we focus on the IV regression tool called two-stage least squares (TSLS).

4.1 Data Set Description

We will use the data set CigarettesSW which comes with the package AER (Christian Kleiber and Zeileis 2008). It is a panel data set that contains observations on cigarette consumption and several economic indicators for all 48 continental federal states of the U.S. from 1985 to 1995.

# load the data set 
library(AER)
data("CigarettesSW")
# get an overview
summary(CigarettesSW)
     state      year         cpi          population           packs       
 AL     : 2   1985:48   Min.   :1.076   Min.   :  478447   Min.   : 49.27  
 AR     : 2   1995:48   1st Qu.:1.076   1st Qu.: 1622606   1st Qu.: 92.45  
 AZ     : 2             Median :1.300   Median : 3697472   Median :110.16  
 CA     : 2             Mean   :1.300   Mean   : 5168866   Mean   :109.18  
 CO     : 2             3rd Qu.:1.524   3rd Qu.: 5901500   3rd Qu.:123.52  
 CT     : 2             Max.   :1.524   Max.   :31493524   Max.   :197.99  
 (Other):84                                                                
     income               tax            price             taxs       
 Min.   :  6887097   Min.   :18.00   Min.   : 84.97   Min.   : 21.27  
 1st Qu.: 25520384   1st Qu.:31.00   1st Qu.:102.71   1st Qu.: 34.77  
 Median : 61661644   Median :37.00   Median :137.72   Median : 41.05  
 Mean   : 99878736   Mean   :42.68   Mean   :143.45   Mean   : 48.33  
 3rd Qu.:127313964   3rd Qu.:50.88   3rd Qu.:176.15   3rd Qu.: 59.48  
 Max.   :771470144   Max.   :99.00   Max.   :240.85   Max.   :112.63  
                                                                      

Use ?CigarettesSW for a detailed description of the variables.

4.2 Problem Description

The relation between commodity demand and prices is a fundamental and widely observed issue in economics. Health economics focuses on how individual health-related behaviors are influenced by healthcare systems and regulatory policies. Smoking serves as a prime example in public policy discussions due to its association with various illnesses and negative impacts on society.

Cigarette consumption could potentially be reduced by increasing taxes on cigarettes. The question is by how much taxes must be increased to reach a certain reduction in cigarette consumption.

Elasticity is commonly estimated and used by economists to answer this kind of questions. But an OLS regression of log quantity on log price cannot be used to estimate the price elasticity for the demand of cigarettes, since there is simultaneous causality between demand and supply.

In this case, the effect on demand quantity of a change in price can instead be estimated using IV regression.

4.3 The IV Estimator with a Single Regressor and a Single Instrument

Consider the simple regression model

Y_i = \beta_0 + \beta_1 X_i + u_i, \quad i = 1, \ldots, n \tag{5.1} where the error term u_i is correlated with the regressor X_i (X is endogenous) such that the OLS estimator is inconsistent for the true \beta_1 (the causal effect of X on Y). Instrumental variables estimation uses an additional, “instrumental” variable Z to isolate that part of X that is uncorrelated with u, to obtain a consistent estimator for \beta_1.

Z must satisfy two conditions to be a valid instrument:

1. Instrument relevance condition: X and its instrument Z must be correlated: \rho_{Z_i,X_i} \neq 0

2. Instrument exogeneity condition: The instrument Z must not be correlated with the error term u: \rho_{Z_i,u_i} = 0.


The Two-Stage Least Squares Estimator

As its name suggests, TSLS proceeds in two stages. In the first stage, the endogenous regressor X is decomposed into a problem-free component, uncorrelated with the error term, that is explained by the instrument Z, and a problematic component that may be correlated with the error u_i. The second stage uses the problem-free component to estimate \beta_1.

The first stage regression model is X_i = \pi_0 + \pi_1 Z_i + \nu_i where \pi_0 + \pi_1 Z_i is the component of X_i explained by Z_i and \nu_i is the problematic component that cannot be explained by Z_i and exhibits correlation with u_i.

With the OLS estimates \hat{\pi_0} and \hat{\pi_1} the predicted values \widehat{X_i}, i=1,\ldots,n are obtained. If Z is a valid instrument, the predicted \widehat{X}_i are problem-free so that in the second stage regression, the OLS regression of Y on \widehat X, \widehat X is exogenous.

From the second stage regression we obtain the TSLS estimators \widehat\beta_0^{TSLS} and \widehat\beta_1^{TSLS}. For a single instrument case the TSLS estimator of \beta_1 is:

\widehat\beta_1^{TSLS} = \frac{s_{ZY}}{s_{ZX}} = \frac{ \frac{1}{n-1} \sum_{i=1}^{n} (Y_i - \bar{Y})(Z_i - \bar{Z})} {\frac{1}{n-1} \sum_{i=1}^{n} (X_i - \bar{X})(Z_i - \bar{Z})}, \tag{5.2} which is indeed the ratio of the sample covariance between Z and Y to the sample covariance between Z and X.

Assuming Z meets the requirements of a valid instrument, (5.2) is a consistent estimator for \beta_1 in (5.1). The Central Limit Theorem (CLT) suggests that as the sample size increases, the distribution of \widehat\beta_1^{TSLS} can be closely approximated by a normal distribution. Consequently, we can use t-statistics and confidence intervals, which can be calculated using certain functions in R.


For our problem, we are interested in estimating \beta_1 in

\log(Q_i^{\text{cigarettes}}) = \beta_0 + \beta_1 \log(P_i^{\text{cigarettes}}) + u_i \tag{5.3}

where Q_i^{\text{cigarettes}} is the number of cigarette packs sold per capita (the demand), P_i^{\text{cigarettes}} is the after-tax average real price per pack of cigarettes in state i and u_i represents other factors that affect the demand of cigarettes.

The instrumental variable we will use for instrumenting the endogenous regressor \log(P_i^{\text{cigarettes}}) is SalesTax, the portion of taxes on cigarettes arising from the general sales tax, measured in dollars per pack (in real dollars, deflated by the Consumer Price Index).

Before using TSLS, it is essential to ask whether the two conditions for instrument validity hold. First, the idea is that SalesTax is a relevant instrument, considering a high sales tax increases the after-tax sales price.

Since the sales tax does not directly influence the sold quantity, but indirectly through the price, it is plausible that SalesTax is exogenous. The credibility of this assumption will be further discussed later, but for now we keep it as a working hypothesis.

We first perform some transformations in order to obtain deflated cross section data for the year 1995, as we will consider data for the cross section of states in 1995 only. We also compute the sample correlation between the sales tax and price per pack.

# compute real per capita prices
CigarettesSW$rprice <- with(CigarettesSW, price / cpi)

#  compute the sales tax
CigarettesSW$salestax <- with(CigarettesSW, (taxs - tax) / cpi)

# check the correlation between sales tax and price
cor(CigarettesSW$salestax, CigarettesSW$price)
[1] 0.6141228
# generate a subset for the year 1995
c1995 <- subset(CigarettesSW, year == "1995")

The estimate of approximately 0.614 indicates that salestax and price exhibit positive correlation. However, a correlation analysis like this is not sufficient for checking whether the instrument is relevant. As mentioned, we will discuss later the issue of checking whether an instrument is relevant and exogenous.

The first stage regression is

\log(P_i^{\text{cigarettes}}) = \pi_0 + \pi_1SalesTax_i + \nu_i

We can estimate this model in R using lm().

# perform the first stage regression
cig_s1 <- lm(log(rprice) ~ salestax, data = c1995)
coeftest(cig_s1, vcov = vcovHC, type = "HC1")

t test of coefficients:

             Estimate Std. Error  t value  Pr(>|t|)    
(Intercept) 4.6165463  0.0289177 159.6444 < 2.2e-16 ***
salestax    0.0307289  0.0048354   6.3549 8.489e-08 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The first stage regression yields:

\log(\widehat{P_i^{\text{cigarettes}}}) = \underset{(0.029)}{4.617} + \underset{(0.005)}{0.031} \, SalesTax_i

indicating a positive relationship between the price of cigarettes and the sales tax.

How much of the observed variation in \log(P_i^{\text{cigarettes}}) is explained by the instrument SalesTax? This can be answered by looking at the regression’s R^2

# inspect the R^2 of the first stage regression
summary(cig_s1)$r.squared
[1] 0.4709961

which states that about 47\% of the variation in after tax prices is explained by the variation of the sales tax across states.

Next, we store \log(\widehat{P_i^{\text{cigarettes}}}), the fitted values obtained by the first stage regression cig_s1, in the variable lcigp_pred.

# store the predicted values
lcigp_pred <- cig_s1$fitted.values

Now in the second stage we run the regression of \log(Q_i^{\text{cigarettes}}) on \log(\widehat{P_i^{\text{cigarettes}}}) to obtain \widehat\beta_0^{TSLS} and \widehat\beta_1^{TSLS}:

# perform the second stage regression
cig_s2 <- lm(log(c1995$packs) ~ lcigp_pred)
coeftest(cig_s2, vcov = vcovHC)

t test of coefficients:

            Estimate Std. Error t value  Pr(>|t|)    
(Intercept)  9.71988    1.70304  5.7074 7.932e-07 ***
lcigp_pred  -1.08359    0.35563 -3.0469  0.003822 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Thus estimating the model (5.3) using TSLS yields

\widehat{\log(Q_i^{\text{cigarettes}})} = \underset{(1.70)}{9.72} \,- \underset{(0.36)}{1.08} \log(P_i^{\text{cigarettes}}) \tag{5.4}

This estimated regression function would be written using the regressor in the second stage, the predicted value \log(\widehat{P_i^{\text{cigarettes}}}). It is, however, conventional and more convenient simply to report the estimated regression function with \log(P_i^{\text{cigarettes}}) rather than \log(\widehat{P_i^{\text{cigarettes}}}).

Instead of manually performing TSLS in steps, we can use the function ivreg() from the AER package in R to compute the TSLS estimators in just one line of code. It is coded similarly as lm(). Instruments can be included in the standard regression formula by separating the model equation from the instruments using a vertical bar.

For our regression of interest the correct formula would be log(packs) ~ log(rprice) | salestax

# perform TSLS using 'ivreg()'
cig_ivreg <- ivreg(log(packs) ~ log(rprice) | salestax, data = c1995)

coeftest(cig_ivreg, vcov = vcovHC, type = "HC1")

t test of coefficients:

            Estimate Std. Error t value  Pr(>|t|)    
(Intercept)  9.71988    1.52832  6.3598 8.346e-08 ***
log(rprice) -1.08359    0.31892 -3.3977  0.001411 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We appreciate the same coefficient estimates for both approaches, although the latter standard errors differ from those previously computed with the manual approach in two steps. This is because the standard errors reported for the second-stage regression using lm() are invalid, as they do not account for the use of predictions from the first-stage regression as regressors in the second-stage regression.

Contrary to this, ivreg() performs the necessary adjustment automatically. Taking this into consideration together with the efficiency of the procedure, and although the step-by-step computation has been shown for demonstrating the mechanics of the procedure, it is recommended to use ivreg() function when estimating TSLS.

Additionally, it is important to compute heteroskedasticity-robust standard errors using vcovHC(), just like in multiple regression.

The TSLS estimate \widehat\beta_1^{TSLS} of -1.08 suggests that the demand for cigarettes is actually elastic. Its interpretation is that an increase in the price of 1\% is estimated to reduce consumption on average by approximately 1.08\%.

Recalling the discussion of instrument exogeneity, perhaps this estimate should not yet be taken too seriously. Even though the elasticity was estimated using an instrumental variable, there might still be omitted variables that are correlated with the sales tax per pack. A multiple IV regression would be more appropriate to mitigate that risk.

4.4 Multiple IV Regression: The General IV Regression Model

The General Instrumental Variables Regression Model and Terminology

Y_i = \beta_0 + \beta_1 X_{1i} + \ldots + \beta_k X_{ki} + \beta_{k+1} W_{1i} + \ldots + \beta_{k+r} W_{ri} + u_i \tag{5.5}

with i=1, \ldots,n is the general instrumental variables regression model where:

  • Y_i is the dependent variable,

  • \beta_0, \ldots, \beta_{k+1} are 1+k+r unknown regression coefficients,

  • X_{1i}, \ldots, X_{ki} are k endogenous regressors,

  • W_{1i}, \ldots, W_{ri} are r exogenous regressors, which are uncorrelated with u_i,

  • u_i is the error term,

  • Z_{1i}, \ldots, Z_{mi} are m instrumental variables.

The coefficients are overidentified if m > k, they are underidentified if m < k, and they are exactly identified when m=k. Estimation of the IV regression model requires exact identification or overidentification.


TSLS in the General IV Model

First-stage regression(s): Regress each of the endogenous variables (X_{1i}, \ldots ,X_{ki}) on all instrumental variables (Z_{1i}, \ldots, Z_{mi}), all exogenous variables (W_{1i}, \ldots, W_{ri}) and an intercept. Compute the fitted values (\hat X_{1i}, \ldots , \hat X_{ki}).

Second-stage regression: Regress the dependent variable on the predicted values of all endogenous regressors, all exogenous variables and an intercept using OLS. This gives \hat \beta^{TSLS}_0, \ldots,\hat \beta^{TSLS}_{k+r}, the TSLS estimates of the model coefficients.


The IV Regression Assumptions

  1. E(u_i|W_{1i}, \ldots, W_{ri})=0

  2. (X_{1i}, \ldots, X_{ki},W_{1i}, \ldots, W_{ri}, Z_{1i}, \ldots, Z_{mi}) are i.i.d. draws from their joint distribution.

  3. All variables have nonzero finite fourth moments, i.e., outliers are unlikely.

  4. The Zs are valid instruments


Two Conditions For Valid Instruments

For a set of m instruments Z_{1i}, \ldots, Z_{mi} to be valid, they must meet two conditions:

1. Instrument Relevance

If there are k endogenous variables, r exogenous variables and m \geq k instruments Z, and \hat X^*_{1i}, \ldots , \hat X^*_{ki} are the predicted values from the k population first stage regressions, it must hold that (\hat{X}^{*}_{1i}, \ldots , \hat {X}^{*}_{ki},W_{1i}, \ldots, W_{ri}, 1) are not perfectly multicollinear. 1 denotes the constant regressor which equals 1 for all observations.

If there is only one endogenous regressor X_i, there must be at least one non-zero coefficient on the Z and the W in the population regression for this condition to be valid. If all of the coefficients are zero, all the \hat{X}^{*}_i are just the mean of X such that there is perfect multicollinearity.

2. Instrument Exogeneity

All m instruments must be uncorrelated with the error term: \rho_{Z_{1i}, u_i} = 0, \ldots, \rho_{Z_{mi}, u_i} = 0


Employing TSLS functions in R such as ivreg() becomes more advantageous when dealing with a larger set of potentially endogenous regressors and instruments. It is straightforward, but there are, however, some specifications in correctly coding the regression formula.

Let’s imagine we would like to estimate the model

Y_i = \beta_0 + \beta_1 X_{1i} + \beta_2 X_{2i} + \beta_3 W_{1i} + u_i

where X_{1i} and X_{2i} are endogenous regressors that shall be instrumented by Z_{1i}, Z_{2i} and Z_{3i}, and W_{1i} is an exogenous regressor.

The corresponding data is available in a data.frame with column names y, x1, x2, w1, z1, z2 and z3.

While it might be tempting to specify the argument formula in the call of ivreg() as y ~ x1 + x2 + w1 | z1 + z2 + z3 , this is wrong. It is necessary to list all exogenous variables as instruments too, that is joining them by +’s on the right of the vertical bar: y ~ x1 + x2 + w1 | w1 + z1 + z2 + z3 where w1 is “instrumenting itself”.

See ?ivreg for the documentation of the function, where this is explained.

If we have a large number of exogenous variables, it might be convenient to provide an update formula with a . right after the | (this includes all variables except for the dependent variable) and to exclude all endogenous variables using a -.

For example, if there is one exogenous regressor w1 and one endogenous regressor x1 with instrument z1, the corresponding formula would be y ~ w1 + x1 | w1 + z1, which is equivalent to y ~ w1 + x1 | . - x1 + z1.

Application to the Demand for Cigarettes

As explained, although our previous regression function \log(Q_i^{\text{cigarettes}}) = 9.72 - 1.08 \log(P_i^{\text{cigarettes}}) was estimated using IV regression, it is plausible that this estimate is biased, as the TSLS estimator is inconsistent for the true \beta_1 if the instrument (the real sales tax per pack) correlates with the error term.

There might still be omitted variables that are correlated with the sales tax per pack, such as income. States with higher incomes may rely less on sales tax and more on income tax to fund their state government. Additionally, the demand for cigarettes is likely influenced by income. Therefore, we aim to reevaluate our demand equation by incorporating income as a control variable:

\log(Q_i^{\text{cigarettes}}) = \beta_0 + \beta_1 \log(P_i^{\text{cigarettes}}) + \beta_2 \log(income_i) + u_i \tag{5.6}

Before estimating (5.6) using ivreg() we define income as real per capita income rincome, we append it to the data set CigarettesSW and we create a subset again for the year 1995. Then we estimate the model following the instructions previously explained.

# add rincome to the dataset and create subset for 1995
CigarettesSW$rincome <- with(CigarettesSW, income / population / cpi)
c1995 <- subset(CigarettesSW, year == "1995")

# estimate the model
cig_ivreg2 <- ivreg(log(packs) ~ log(rprice) + log(rincome) | log(rincome) + 
                    salestax, data = c1995)
coeftest(cig_ivreg2, vcov = vcovHC, type = "HC1")

t test of coefficients:

             Estimate Std. Error t value  Pr(>|t|)    
(Intercept)   9.43066    1.25939  7.4883 1.935e-09 ***
log(rprice)  -1.14338    0.37230 -3.0711  0.003611 ** 
log(rincome)  0.21452    0.31175  0.6881  0.494917    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We obtain

\log(Q_i^{\text{cigarettes}}) = \underset{(1.26)}{9.43} - \underset{(0.37)}{1.14} \log(P_i^{\text{cigarettes}}) + \underset{(0.31)}{0.21} \log(income_i) \tag{5.7}

We can now add the cigarette-specific taxes (cigtax_i) as a further instrumental variable and estimate again using TSLS.

# add cigtax to the data set
CigarettesSW$cigtax <- with(CigarettesSW, tax/cpi)
c1995 <- subset(CigarettesSW, year == "1995")

# estimate the model
cig_ivreg3 <- ivreg(log(packs) ~ log(rprice) + log(rincome) | 
                    log(rincome) + salestax + cigtax, data = c1995)
coeftest(cig_ivreg3, vcov = vcovHC, type = "HC1")

t test of coefficients:

             Estimate Std. Error t value  Pr(>|t|)    
(Intercept)   9.89496    0.95922 10.3157 1.947e-13 ***
log(rprice)  -1.27742    0.24961 -5.1177 6.211e-06 ***
log(rincome)  0.28040    0.25389  1.1044    0.2753    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

If we use the instruments salestax_i and cigtax_i we would have 2 instruments (m=2) and k=1 so the coefficient on the endogenous regressor \log(P_i^{\text{cigarettes}}) is now overidentified.

The new TSLS estimate of (5.6) with two instruments is

\log(Q_i^{\text{cigarettes}}) = \underset{(0.96)}{9.89} - \underset{(0.25)}{1.28} \log(P_i^{\text{cigarettes}}) + \underset{(0.25)}{0.28} \log(income_i) \tag{5.8} When we compare the estimates from models (5.7) and (5.8), we observe smaller standard errors in (5.8).

The standard error of the estimated price elasticity is smaller by one-third in this equation (0.25 versus 0.37). The reason is that more information is being used in this estimation: using two instruments explains more of the variation in cigarette prices than just one.

If the instruments are valid, which is something essential to be checked, (5.8) would be considered more reliable.

4.5 Instrument Validity

If the general sales tax and the cigarette-specific tax are not valid instruments, TSLS becomes inadequate for estimating the previously discussed demand elasticity for cigarettes. Although both variables are likely relevant, their exogeneity remains a separate issue.

Stock and Watson (2020) argue that cigarette-specific taxes could be endogenous due to state-specific historical factors, such as the economic significance of tobacco farming and cigarette production industries, which may advocate for lower cigarette-specific taxes.

Given the plausibility that states reliant on tobacco cultivation have higher smoking rates, this introduces endogeneity into cigarette-specific taxes. While incorporating data on the scale of the tobacco and cigarette industry into regression analysis could potentially address this concern, such data is unavailable.

Given that the role of the tobacco and cigarette industry varies across states but remains consistent over time, we will utilize the panel structure of CigarettesSW.

As outlined in the panel data chapter, conducting regressions based on data changes between two time periods eradicates state-specific and time-invariant effects. Our focus is on estimating the long-term elasticity of cigarette demand, thus we will examine changes in variables between 1985 and 1995.

Consequently, the model to be estimated via TSLS, employing the general sales tax and cigarette-specific sales tax as instruments, is as follows:

\begin{align} \log(Q_{i,1995}^{\text{cigarettes}}) - \log(Q_{i,1985}^{\text{cigarettes}}) &= \beta_0 + \beta_1 \left[ \log(P_{i,1995}^{\text{cigarettes}}) - \log(P_{i,1985}^{\text{cigarettes}}) \right] \\ &\quad + \beta_2 \left[ \log(\text{income}_{i,1995}) - \log(\text{income}_{i,1985}) \right] + u_i \tag{5.9} \end{align}

We first create differences from 1985 to 1995 for the dependent variable, the regressors and both instruments:

# subset data for year 1985
c1985 <- subset(CigarettesSW, year == "1985")

# define differences in variables
packsdiff <- log(c1995$packs) - log(c1985$packs)

pricediff <- log(c1995$price/c1995$cpi) - log(c1985$price/c1985$cpi)

incomediff <- log(c1995$income/c1995$population/c1995$cpi) -
log(c1985$income/c1985$population/c1985$cpi)

salestaxdiff <- (c1995$taxs - c1995$tax)/c1995$cpi - 
                                  (c1985$taxs - c1985$tax)/c1985$cpi

cigtaxdiff <- c1995$tax/c1995$cpi - c1985$tax/c1985$cpi

We now estimate three different IV regressions of (5.9) using ivreg():

  1. TSLS using just the difference in the sales taxes between 1985 and 1995 as instrument.

  2. TSLS using just the difference in the cigarette-specific sales taxes 1985 and 1995 as instrument.

  3. TSLS using both the difference in the sales taxes 1985 and 1995 and the difference in the cigarette-specific sales taxes 1985 and 1995 as instruments.

# estimate the three models
cig_ivreg_diff1 <- ivreg(packsdiff ~ pricediff + incomediff | incomediff + 
                         salestaxdiff)

cig_ivreg_diff2 <- ivreg(packsdiff ~ pricediff + incomediff | incomediff + 
                         cigtaxdiff)

cig_ivreg_diff3 <- ivreg(packsdiff ~ pricediff + incomediff | incomediff + 
                         salestaxdiff + cigtaxdiff)

To obtain robust coefficient summaries for all models we use coeftest() together with vcovHC()

# robust coefficient summary for 1.
coeftest(cig_ivreg_diff1, vcov = vcovHC, type = "HC1")

t test of coefficients:

             Estimate Std. Error t value  Pr(>|t|)    
(Intercept) -0.117962   0.068217 -1.7292   0.09062 .  
pricediff   -0.938014   0.207502 -4.5205 4.454e-05 ***
incomediff   0.525970   0.339494  1.5493   0.12832    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# robust coefficient summary for 2.
coeftest(cig_ivreg_diff2, vcov = vcovHC, type = "HC1")

t test of coefficients:

             Estimate Std. Error t value  Pr(>|t|)    
(Intercept) -0.017049   0.067217 -0.2536    0.8009    
pricediff   -1.342515   0.228661 -5.8712 4.848e-07 ***
incomediff   0.428146   0.298718  1.4333    0.1587    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# robust coefficient summary for 3.
coeftest(cig_ivreg_diff3, vcov = vcovHC, type = "HC1")

t test of coefficients:

             Estimate Std. Error t value  Pr(>|t|)    
(Intercept) -0.052003   0.062488 -0.8322    0.4097    
pricediff   -1.202403   0.196943 -6.1053 2.178e-07 ***
incomediff   0.462030   0.309341  1.4936    0.1423    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We can now present a tabulated summary of the estimation results with stargazer() (Hlavac 2022):

# load stargazer 
library(stargazer)

# gather robust standard errors in a list
rob_se <- list(sqrt(diag(vcovHC(cig_ivreg_diff1, type = "HC1"))),
               sqrt(diag(vcovHC(cig_ivreg_diff2, type = "HC1"))),
               sqrt(diag(vcovHC(cig_ivreg_diff3, type = "HC1"))))

# generate table
stargazer(cig_ivreg_diff1, cig_ivreg_diff2, cig_ivreg_diff3,
  se = rob_se,
  type="html",
  omit.stat = "f", df=FALSE)

<table style="text-align:center"><tr><td colspan="4" style="border-bottom: 1px solid black"></td></tr><tr><td style="text-align:left"></td><td colspan="3"><em>Dependent variable:</em></td></tr>
<tr><td></td><td colspan="3" style="border-bottom: 1px solid black"></td></tr>
<tr><td style="text-align:left"></td><td colspan="3">packsdiff</td></tr>
<tr><td style="text-align:left"></td><td>(1)</td><td>(2)</td><td>(3)</td></tr>
<tr><td colspan="4" style="border-bottom: 1px solid black"></td></tr><tr><td style="text-align:left">pricediff</td><td>-0.938<sup>***</sup></td><td>-1.343<sup>***</sup></td><td>-1.202<sup>***</sup></td></tr>
<tr><td style="text-align:left"></td><td>(0.208)</td><td>(0.229)</td><td>(0.197)</td></tr>
<tr><td style="text-align:left"></td><td></td><td></td><td></td></tr>
<tr><td style="text-align:left">incomediff</td><td>0.526</td><td>0.428</td><td>0.462</td></tr>
<tr><td style="text-align:left"></td><td>(0.339)</td><td>(0.299)</td><td>(0.309)</td></tr>
<tr><td style="text-align:left"></td><td></td><td></td><td></td></tr>
<tr><td style="text-align:left">Constant</td><td>-0.118<sup>*</sup></td><td>-0.017</td><td>-0.052</td></tr>
<tr><td style="text-align:left"></td><td>(0.068)</td><td>(0.067)</td><td>(0.062)</td></tr>
<tr><td style="text-align:left"></td><td></td><td></td><td></td></tr>
<tr><td colspan="4" style="border-bottom: 1px solid black"></td></tr><tr><td style="text-align:left">Observations</td><td>48</td><td>48</td><td>48</td></tr>
<tr><td style="text-align:left">R<sup>2</sup></td><td>0.550</td><td>0.520</td><td>0.547</td></tr>
<tr><td style="text-align:left">Adjusted R<sup>2</sup></td><td>0.530</td><td>0.498</td><td>0.526</td></tr>
<tr><td style="text-align:left">Residual Std. Error</td><td>0.091</td><td>0.094</td><td>0.091</td></tr>
<tr><td colspan="4" style="border-bottom: 1px solid black"></td></tr><tr><td style="text-align:left"><em>Note:</em></td><td colspan="3" style="text-align:right"><sup>*</sup>p<0.1; <sup>**</sup>p<0.05; <sup>***</sup>p<0.01</td></tr>
</table>

In the table we observe different negative estimates for the coefficient on pricediff, all of them highly significant. How should we select the one to trust? This depends on the validity of the instruments employed. It would be useful to check for weak instruments.

4.5.1 Checking for Weak Instruments

Instruments that poorly explain changes in the endogenous regressor X are labeled as weak instruments. These weak instruments can lead to inaccurate estimates of the coefficient on the endogenous regressor.

Let’s simplify this concept by considering a scenario with only one endogenous regressor, X, and m instruments denoted as Z_1, \ldots, Z_m. If, in the population first-stage regression of a TSLS estimation, the coefficients for all instruments are zero, it implies that these instruments fail to explain any variation in X.

While encountering such a situation in practice is unlikely, there is a simple rule of thumb available for the most common situation in practice, the case of a single endogenous regressor.

Rule of Thumb for Checking for Weak Instruments

Compute the F-statistic which corresponds to the hypothesis that the coefficients on Z_1, \ldots, Z_m are all zero in the first-stage regression. If the F-statistic is less than 10, the instruments are weak, in which case the TSLS estimator is biased (also in large samples) and TSLS t-statistics and confidence intervals are unreliable.


In R this would be implemented by running the first-stage regression using lm() and computing the heteroskedasticity-robust F-statistic by means of linearHypothesis(). Let’s compute this for all three models:

# first-stage regressions
mod_relevance1 <- lm(pricediff ~ salestaxdiff + incomediff)
mod_relevance2 <- lm(pricediff ~ cigtaxdiff + incomediff)
mod_relevance3 <- lm(pricediff ~ incomediff + salestaxdiff + cigtaxdiff)

# check instrument relevance for model (1)
linearHypothesis(mod_relevance1, 
                 "salestaxdiff = 0", 
                 vcov = vcovHC, type = "HC1")
Linear hypothesis test

Hypothesis:
salestaxdiff = 0

Model 1: restricted model
Model 2: pricediff ~ salestaxdiff + incomediff

Note: Coefficient covariance matrix supplied.

  Res.Df Df      F    Pr(>F)    
1     46                        
2     45  1 28.445 3.009e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# check instrument relevance for model (2)
linearHypothesis(mod_relevance2, 
                 "cigtaxdiff = 0", 
                 vcov = vcovHC, type = "HC1")
Linear hypothesis test

Hypothesis:
cigtaxdiff = 0

Model 1: restricted model
Model 2: pricediff ~ cigtaxdiff + incomediff

Note: Coefficient covariance matrix supplied.

  Res.Df Df      F   Pr(>F)    
1     46                       
2     45  1 98.034 7.09e-13 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# check instrument relevance for model (3)
linearHypothesis(mod_relevance3, 
                 c("salestaxdiff = 0", "cigtaxdiff = 0"), 
                 vcov = vcovHC, type = "HC1")
Linear hypothesis test

Hypothesis:
salestaxdiff = 0
cigtaxdiff = 0

Model 1: restricted model
Model 2: pricediff ~ incomediff + salestaxdiff + cigtaxdiff

Note: Coefficient covariance matrix supplied.

  Res.Df Df      F    Pr(>F)    
1     46                        
2     44  2 76.916 4.339e-15 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

When coefficients are overidentified (m > k), like in our third model, we can apply the overidentifying restrictions test (also called the J-test), which is an approach to test the hypothesis that additional instruments are exogenous.

J-Statistic / Overidentifying Restrictions Test

Take \hat u^{TSLS}_i, i=1 \ldots , n, the residuals from TSLS estimation of the general IV regression model (5.5), and run the OLS regression to estimate the coefficients in

\hat u_{\text{TSL}S_i} = \delta_0 + \delta_1 Z_{1i} + \ldots + \delta_m Z_{mi} + \delta_{m+1} W_{1i} + \ldots + \delta_{m+r} W_{ri} + e_i \tag{5.10}

where e_i is the regression error term. Now test the joint hypothesis

H_0 : \delta_1 = 0, \ldots , \delta = 0 that states that all instruments are exogenous. Let F denote the homoskedasticity-only F-statistic testing the null hypothesis. The overidentifying restrictions test statistic is then

J =mF also called the J-statistic. Under the null hypothesis that all the instruments are exogenous, if e_i is homoskedastic, in large samples

J \sim \chi^2_{m-k}

where m - k is the degree of overidentification, or in other words, the number of instruments minus the number of endogenous regressors.


To conduct the overidentifying restrictions test for model three, which is the only model where the coefficient on the difference in log prices is overidentified (m=2, k=1), allowing computation of the J-statistic, we proceed as follows:

  1. We use the residuals stored in cig_ivreg_diff3 and regress them on both instruments and the presumably exogenous regressor incomediff.

  2. Once more, we employ linearHypothesis() to examine whether the coefficients on both instruments are zero, a prerequisite for fulfilling the exogeneity assumption. It’s important to note that we specify test = "Chisq" to obtain a chi-squared distributed test statistic instead of an F-statistic.

# compute the J-statistic
cig_iv_OR <- lm(residuals(cig_ivreg_diff3) ~ incomediff + salestaxdiff + cigtaxdiff)

cig_OR_test <- linearHypothesis(cig_iv_OR, 
                               c("salestaxdiff = 0", "cigtaxdiff = 0"), 
                               test = "Chisq")
cig_OR_test
Linear hypothesis test

Hypothesis:
salestaxdiff = 0
cigtaxdiff = 0

Model 1: restricted model
Model 2: residuals(cig_ivreg_diff3) ~ incomediff + salestaxdiff + cigtaxdiff

  Res.Df     RSS Df Sum of Sq Chisq Pr(>Chisq)  
1     46 0.37472                                
2     44 0.33695  2  0.037769 4.932    0.08492 .
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Caution! The p-Value provided by linearHypothesis() might be misleading, because the degrees of freedom are automatically set to 2. This differs from the degree of overidentification (m-k=2-1=1), making the J-statistic follow a \chi^2_1 distribution instead of the default assumption of \chi^2_2 distribution in linearHypothesis().

We can easily compute the correct p-Value using pchisq().

# compute correct p-value for J-statistic
pchisq(cig_OR_test[2, 5], df = 1, lower.tail = FALSE)
[1] 0.02636406

Since the reported value is smaller than 0.05, we reject the null hypothesis that both instruments are exogenous at the 5\% level. From this we can deduce that one of the following statements is true:

  1. The sales tax is an invalid instrument for the cigarettes package price.

  2. The cigarettes-specific sales tax is an invalid instrument for the cigarettes package price.

  3. Both instruments are invalid.

Stock and Watson (2020) argue that the case for the exogeneity of the general sales tax is stronger than that for the cigarette-specific tax, since the political process can link changes in the cigarette-specific tax to changes in the cigarette market and smoking policy.

Taking this into consideration, the IV estimate of the long-run elasticity of demand for cigarettes considered the most trustworthy would be -0.94, the TSLS estimate obtained using the general sales tax as the only instrument.

4.6 Summary

The instrument variable selected for our model is the general sales tax. The IV regression model making use of this instrument is

\begin{align} \log(Q_{i,1995}^{\text{cigarettes}}) - \log(Q_{i,1985}^{\text{cigarettes}}) &= -0.118 -0.938 \left[ \log(P_{i,1995}^{\text{cigarettes}}) - \log(P_{i,1985}^{\text{cigarettes}}) \right] \\ &\quad + 0.526 \left[ \log(\text{income}_{i,1995}) - \log(\text{income}_{i,1985}) \right] + u_i \tag{5.9} \end{align}

This estimate indicates that the cigarette consumption is elastic: over a 10-year period, an increase in the average price per package by 1\% is expected to reduce consumption on average by 0.94 percentage points. This suggests that, over the long term, rises in the price per pack can significantly decrease cigarette consumption.

We have seen how easy and straightforward it is to estimate IV regression models in R with the ivreg() function from the package AER. This facilitates and simplifies the implementation of the TSLS estimation approach.

Besides treating IV estimation, we have also discussed how important it is to to test for weak instruments and how to conduct the corresponding tests, including the overidentifying restrictions test when there are more instruments than endogenous regressors.

Furthermore, we have implemented a long-run analysis of the demand for cigarettes and its elasticity, being able to make a conclusion after selecting the most trustworthy instrumental variable.


4.7 References

Hlavac, Marek. 2022. Stargazer: Well-Formatted Regression and Summary Statistics Tables. Bratislava, Slovakia: Social Policy Institute. https://CRAN.R-project.org/package=stargazer.

Kleiber, Christian, and Achim Zeileis. 2008. Applied Econometrics with R. New York: Springer-Verlag. https://CRAN.R-project.org/package=AER.

Stock, J. H., and M. W. Watson. 2020. Introduction to Econometrics, Fourth Update, Global Edition. Pearson Education Limited.