9 Machine Learning Methods

OLS regression requires sufficient degrees of freedom (N-K)

Asymptotic theory assumes N - K \rightarrow \infty

Standard asymptotic results are invalid if K/N \rightarrow \kappa > 0

OLS estimation typically no/small bias but large variance

Performance is measured by the “risk”, typically mean-squared error:

E(b - \beta)^2 = \color{blue}{ \underbrace{ E\left\{[b - E(b)]^2 \right\}}_{\text{variance}} } \color{black}{+} \color{red}{ \underbrace{[E(b) - \beta]^2}_{\text{bias}^2} }

The variance represents the unsystematic error and the bias the systematic error. If the parameters are estimated only oncy, the distinction becomes irrelevant.

Ridge estimation

Introducing an L_2 penalty: S^R_\lambda = (y - X\beta)^\prime(y - X\beta) + \lambda \|\beta\|^2 where \|\beta\| = \sqrt{\sum_{j=1}^{K} \beta_j^2} denotes the Gaussian norm

minimizing S^R_\lambda yields the Ridge estimator

\widehat\beta_R = \left(X'X + \lambda I_K\right)^{-1}X'y

If K > N then X'X is singular but X'X + \lambda I_K is not!

Since X'X + \lambda I_K is “larger” than X'X (in a matrix sense) the coefficients in \widehat\beta_R “shrink towards zero” as \lambda gets large

Choice of \lambda is typically data driven (see below)

9.1 Lasso Regression

“Sparse regression”: Many of the coefficients are actually zero

Define the L_r norm as

\|\beta\|_r = \left( \sum_{j=1}^{K} |\beta_j|^r \right)^{1/r}

such that

S^L_\lambda = (y - X\beta)^\prime(y - X\beta) + \lambda \|\beta\|_1

L_1 penalty corresponds to the constraint:

\sum_{j=1}^{K} |\beta_j| \leq \tau

Solution of the minimization problem by means of quadratic programming

The solution typically involves zero coefficients

Some more details

The regressors and dependent variables are typically standardized:

\tilde{x}_{j,i} = (x_{j,i} - \bar{x}_j) / s_j where \bar{x}_j and s_j are the mean and standard deviation of x_j

Relationship to pre-test estimator: For a simple regression The first order condition is given by:

\begin{align*} \frac{2}{N} \left( \sum_{i=1}^{N} \tilde{x}_i \tilde{y}_i - \frac{\lambda}{N} \sum_{i=1}^{N} \tilde{x}_i^2 \beta \right) \pm \lambda &\stackrel{!}{=} 0 \\ \beta - \widehat\beta^L_\lambda \pm \frac{N}{2} \lambda &\stackrel{!}{=} 0 \end{align*}

and therefore:

\widehat\beta^L_\lambda = \begin{cases} b + \lambda^* & \text{if } b < -\lambda^* \\ 0 & \text{if } -\lambda^* \leq b \leq \lambda^* \\ b - \lambda^* & \text{if } b > \lambda^* \end{cases} where \lambda^* = \lambda \cdot N/2

Selecting the shrinkage parameter

Trade-off between bias (large \lambda) and variance (small \lambda).

Choose \lambda that minimizes the MSE = Bias^2 + Var

leave-one-out cross validation:

Drop one observation and forecast it based on the remaining N-1 observations

k-fold cross validation:

divide randomly the set of observations into k groups (folds) of approximately equal size. The first fold is treated as a validation set and the remaining k-1 folds are employed for parameter estimation

evaluate the loss (MSE) for each observation conditional on \lambda and compute the average loss as a function of \lambda

Minimize the loss (MSE) with respect to \lambda

k is typically between 5 - 10

Refinements

post-Lasso estimation: Re-estimate the parameters by OLS leaving out the coefficients that were set to zero by LASSO

oracle property: the asymptotic distribution of the estimator is the same as if we knew which coefficient is equal to zero

Original LASSO does not exhibit the oracle property

Adaptive LASSO: with weighted penalty \sum_{j=1}^{K} \widehat w_j |\beta_j| and \widehat w_j = 1/| b_j|^{\nu} \quad \quad \text{with some } v > 0 where b_j denotes the OLS estimate

if K > N replace b_j by the simple regression coefficient

Adaptive LASSO possesses the oracle property

Elastic net: hybrid method LASSO/Ridge

S^L_\lambda = (y - X\beta)^\prime \,(y - X\beta) + \lambda_1 \, \|\beta\|_1 \,+ \,\lambda_2 \,\|\beta\|_2^2

9.2 Dimension reduction techniques

if k is large, it is desirable to reduce the dimensionality by using linear combinations

Z_m = \sum_{j=1}^{k} \phi_{jm}X_j \quad \quad m = 1, \ldots, k

where \phi_{1m}, \ldots ,\phi_{km} as unknown constants such that M << k

Using these linear combinations (“common factors”, principal components) the regression becomes

y_i = \theta_0 + \sum_{m=1}^{M} \theta_m z_{im} + \epsilon_i

Inserting shows that the elements of \beta fulfills the restriction

\beta_j = \sum_{m=1}^{M} \theta_m \phi_{jm}

choose \phi_{1m}, \ldots ,\phi_{km} such that the “loss of information” is minimal

Principal component regression

choose the linear combinations Z_m such that they explain most of X_j:

X_j = \sum_{m=1}^{M} \alpha_{jm} Z_m + v_j such that the variance of v_j is minimized:

S^2(\alpha, \phi) = \sum_{j=1}^{k} \sum_{i=1}^{n} v_{ji}^2

The linear combination is obtained from the eigenvectors of the sample covariance matrix of X

The number of linear combinations, M, can be determined by considering the ordered eigenvalues

Another approach is the method of Partial Least-Squares, where the linear combinations are found sequentially by considering the covariance with the dependent variable

Computational Details

let X denote an N \times K matrix of regressors, then

X = ZA' + V where Z is N \times M where M < K. Inserting in the model for y yields

\begin{align*} y &= Z\underbrace{A'\beta} + \underbrace{u + V\beta} \\ &= \quad Z'\theta \; + \quad \epsilon \end{align*}

the estimates can be obtained from the “singular value decomposition”:

\begin{align*} X &= UDV' \\ \text{where} \quad U'U &= I_N \quad \text{and} \quad V'V = I_k \end{align*}

and D an N \times K diagonal matrix of the K singular values

the dimension reduction is obtained by dropping the last N - M and K - M columns of U and V respectively.

the matrix Z (loading matrix) is obtained as \color{red}{U_M} and \color{blue}{F = D_M V'_M} is the matrix of factors

9.3 Regression trees / Random forest

piecewise constant function:

f(x) = \sum_{j=1}^{J} \mathbf{1}(x \in R_j)c_j

where R_j = [s_{j-1},s_j) is some region in the real space

finding the best split point s for some splitting variable x_{j,i} is simple if there are only two regions (J = 2) such that

\min_s \left[ \sum_{i \in R_1} (y_i - \widehat c_1)^2 + \sum_{i \in R_2} (y_i - \widehat c_2)^2 \right]

we can proceed by searching for the next optimal split in the regions R_1 and R_2 and so on

how far do we need to extend this tree? Typically we stop if the region becomes dense (node size < 5, say)

Pruning the tree: Keep only the branches that result in a sufficient (involving some parameter \alpha) reduction of the objective function. \alpha is chosen by cross validation.

Bagging: Full sample Z = \{ (x'_1, y_1), (x'_2, y_2), \ldots, (x'_N, y_N) \}

The bootstrap sample: is obtained by drawing randomly from the set {1, 2, \cdots, N} such that Z^*_b = \{ (x_1^{*\prime}, y_1^*), (x_2^{*\prime}, y_2^*), \ldots, (x_N^{*\prime}, y_N^*) \} Let \widehat f_b^*(x) denote the prediction based on the regression fit based on Z^*_b. The aggregated prediction is obtained as

\hat f_{\text{bag}}(x) = \frac{1}{B} \sum_{b=1}^{B} \widehat f^*_{b}(x)

Bootstrap aggregation stabilizes the unstable outcome of a regression tree

Specifically, when growing a tree on a bootstrapped dataset: Before each split, select m < p, e.g. m = int( \sqrt p) of the input variables at random as candidates for splitting.

This reduces the correlation among the bootstrap draws