R

MLR

Review

Posted by Shsun on January 31, 2020

Multiple linear regression (MLR) is a statistic technique which uses some independent variables (IV) to predict a dependent variable (DV). Multiple regression is the extension of ordinary least-squares (OLS) regression that involves more than one explanatory variable.

Overfitting and Multicollinearity

Pre-work is needed to prevent overfitting and multicollinearity. To add more IV can explain more variation, but can lead to overfitting. Not only the independent variables potentially related to the dependent variable (DV), they are also potentially related to each other, which is called multicollinearity. The ideal is for all of the independent variables to be correlated with the dependent variable but NOT with each other.
First, we should check the relationship between each IV and the DV using scatterplots and correlations. Then check the relatiionships among the IVs using scatterplots and correlations. The non-redundant IVs can be used to find the best fitting model.

Multiple Linear Regression

Multiple regression model: equation
Multiple regression equation:![equation](https://latex.codecogs.com/gif.latex?E\left&space;(&space;y&space;\right&space;&space;=&space;\beta&space;{0}&space;+&space;\beta&space;{1}x_{1}&space;+&space;\beta&space;{2}&space;x{2}+&space;…&space;+&space;\beta&space;{p}x{p})
Estimated multiple regression equation: equation

Each coefficient is estimated change in y corresponding to one unit change in a variable, while all other variables are held constant.

Model Selection

Two types of MLR model selection: forward selection, backward selection. Forward Selection starts with the linear regression model with the most significant IV. Then we add the remaining IV one at a time. The process is stopped when no significant extensions are possible. Backward selection starts with a full model which includes all IVs. We remove the most insignificant IV one at a time. We stop when all p values are significant.

1
 summary(lm(Y ~ X1 + X2 + X3 + X4 + X5 + X6 + X7 + X8))

We can do VIF (Variance inflation factor) with vif function in package car to check multicollinearity.

The coefficient of determination (R-squared) is a statistical metric that is used to measure how much of the variation in outcome can be explained by the variation in the independent variables. R-squared always increases as more predictors are added to the MLR model even though the predictors may not be related to the outcome variable.

Relative Importance of Variables

Relative weight analysis (rwa) in R provides a rwa() function to calculate relative importance of predictors. Reference can be found here. Another method is Dominance Analysis (DA).

DA defines importance in a unique way by comparing pairs of predictors (from a selected model) across all subset models to make the determination of relative importance or dominance (Azen et al., 2009). In ordinary least squares regression with p predictors and a continous outcome, DA defines the additional contribution of any given predictor to a given subset model as the change in R2 when the predictor is added to the model. DA considers one predictor to completely dominate another if its additional contribution to every possible subset model is greater than that of the other predictor.

We can use the dominanceanalysis package in R. DA relies on the R-squared values of all possible combinations of predictors and measures relative importance by doing pairwise comparison of all predictors in the model as they relate to an outcome variable. DA is used to determine if a predctor is more significant than other predictors. There are three types of dominance criteria: complete, conditional, and general. These three criteria works in a hierachical way: complete dominance implies conditional dominance, conditional dominance implies general dominance. It is recommended to use complete dominance. To evaluate the robustness of DA results, we can use bootstrap analysis with bootDominanceAnalysis.

Reference
  1. Multiple Linear Regression, the very basics
  2. RWA Web: A Free, Comprehensive, Web-Based, and User-Friendly Tool for Relative Weight Analyses
  3. MLR definition
  4. A Dominance Analysis Approach to Determining Predictor Importance in Third, Seventh, and Tenth Grade Reading Comprehension Skills
  5. Using Dominace Analysis to Determine Predictor Importance in Logistic Regression