Basic statistics with R

Currently I am analysing meteorological impacts on stomatal conductance as well as how these interactions would affect surface ozone contrations. Previously I listed many variables and tried to find the patterns, using MERRA2 meteorology and surface ozone concentrations to analyse meteorology impacts on ozone in US. Given MERRA2 meteorology, SPEI data, surface, ozone concentrations, use following steps to practice basic data analysis in R:

Calculate the moving average of meteorological variables and get the deseasonalized daily values.
Do regression between ozone and meteorology.
Calculate PCA for each grid
Plot maps of PC with ggplot

Linear regression

Pre-work before regression or correlation is important. The first two steps include calculating moving averages (MA) and correlations. Moving averages can smooth out the noise of random outliers and emphasize long-term trends. In this case, we subtract MA from daily values to remove the long-term trend and focus on deseasonalized day-to-day variablities. Normalization of variables is also necessary, because the original variables may have different scales. Performing PCA on un-normalized variables will lead to insanely large loadings for variables with high variance. In turn, this would lead to dependence of a PC on the variable with high variance. This can easily be done in R.

Then we use the deseasonalized daily values to calculate correlations. Now we can use lm or glm for regression analysis. lm fits models with the form Y = Xb + e, while glm fits models with the form f(Y) = Xb + e. The glm generalizes the linear model into the expotential family, while lm assumes data following a specific distribution: Normal or Gauss. lm.beta adds the standarlized regression coefficients to objects created by lm.

reg = lm(O3 ~ T2M + QV2M + PRECTOT + GWETROOT + SLP + WIND + SWGDN + CLDTOT, na.action=na.exclude)
reg.st = lm.beta(reg)

Model selection

Forward selection

First we start with the linear regression model lm with the most significant independent variable. Then we extend the model with the remaining independent variables one at a time. We stop when there is no significant extensions anymore.

summary(lm(O3 ~ T2M + WIND))
summary(lm(O3 ~ T2M + WIND + QV2M))  # add variable and check p-value

Backward selection

Start with the full modeland remove the most insignificant independent variable. Finally we stop when all p values are significant.

PCA analysis

The fundamentals of PCA have been discussed over and over again. The aim of PCA is to create a new set of uncorrelated variables that are a linear combination of the initial variables and explain as much of the initial variation as possible. When analysing meteorological variables, we always use PCA to extract the uncorrelated principal components as original meteorolgical factors are usually correlated with each other. Basically, PC1 is a linear combination of orignial predictor variables which captures the maximum variance in the dataset. It determines the direction of highest variablity in the data. PC2 is also a linear combination of original predictors which captures the remaining variance in the dataset and is uncorrelated with PC1.

There are two general methods to perform PCA in R:

princomp(), the spectral decomposition approach which examines the covariances/correlations between variables

prcomp() and PCA(), the singular value decomposition (SVD) The simplified format of these two functions are:

prcomp(x, scale = FALSE) # x is a numeric matrix or data frame; scale is a logical value indicating whether the variables should be scaled to have unit variance before the analysis takes place
princomp(x, cor = FALSE, scores = TRUE) # cor is a logical value, if TRUE, the data will be centered and scaled before analysis;
# scores is a logical value, if TRUE, the coordinates on each principal component are calculated.

The output elements returned by the functions prcomp and princomp:

prcomp() name	princomp() name	Description
sdev	sdev	the standard deviations of principal components
rotation	loadings	the matrix of variable loadings (columns are eigenvectors)
center	center	the variable means (means that were substracted)
scale	scale	the variable standard deviations (the scaling applied to each variable)
x	scores	the coordinates of the individuals (observations) on the principal components

Here we focus on prcomp():

pca = prcomp(~X.data, scale=TRUE, na.action=na.exclude)
PC.sd = pca$sdev
PC.rot = pca$rotation
PC.data = pca$x
pca.var.per = round(pca.var/sum(pca.var)*100, 1)

By default, prcomp() expects samples (observations) to be rows and the variables to be columns. returns x, sdev, and rotation. x contains the PCs for grawing a graph. plot(pca$x[,1], pca$x[,2]) can be used to plot for the first two PCs. The x and y axis tells about he percentage of the variatioin in the original data that PC1 and PC2 account for. pca$rotation record loading scores for each PC. Loading scores can be used to determine which variables have the largest effects. pca$rotation[,1] is the loading scores for PC1.One of the disadventage of PC is interpreting the PCs. When the PC1 is highly correlated with all variables, we can say it summarizes the data well and that instead of using the meteorological variables to make prediction, we can just use PC1. If two meteorological variables are correlated with each other, perhaps usign one new variable (i.e. PC1) can summarize both variables.

PCR analysis

We also want to know whether the new PCs can predict the phenomena well after PCA analysis. We use lm() to see if PC1 ~ PC4 can predict Y.n (i.e., ozone) well.

PC.reg = lm(Y.n ~ -1 + PC1 + PC2 + PC3 + PC4, na.action = na.exclude)

To see how well this model can predict ozone, we can use fitted() function to see the prediction.

fitted(PC.reg)

Factor analysis

The intuition of PCA is to find new combination of variables which form larger variances, while FA is to find hidden variables which affect your observed variables by looking at the correlation. FA in R is easy to dy by a function factanal(), which fits a common factor model by the method of maximum likelihood. Note that the function can analyze either raw data or a correlation or covariance matrix. To begin with , here I use SynFlux data combined with FLUXNET meteorology and SPEI data with a 2 factor model:

ml = df[c('liq_precip','wind','vpd','qv2m','t2m','swgdn','gwetroot','cldtot','spei')]
factanal(ml, factor = 2)

Then we get output like this:

Call:
factanal(x = ml, factors = 2)

Uniquenesses:
liq_precip       wind        vpd       qv2m        t2m      swgdn   gwetroot     cldtot       spei
     0.983      0.999      0.005      0.005      0.005      0.790      0.876      0.723      1.000

Loadings:
           Factor1 Factor2
liq_precip  0.125         
wind                      
vpd                 0.998
qv2m        0.993         
t2m         0.799   0.598
swgdn       0.136   0.437
gwetroot   -0.182  -0.301
cldtot     -0.377  -0.367
spei                      

               Factor1 Factor2
SS loadings      1.834   1.782
Proportion Var   0.204   0.198
Cumulative Var   0.204   0.402

Test of the hypothesis that 2 factors are sufficient.
The chi square statistic is 2236.75 on 19 degrees of freedom.
The p-value is 0

At the bottom of the output, we can see that the significance level is very small, which indicates that the hypothesis of perfect model fit is reject.

The model gets improved if you have more variables, which shows the trade-off between the number of variables and the accuracy of the model. Similar to PCA, we can look at the cumulative portion of variance, and if that reaches some numbers, we can stop adding more factors. The Kaiser rule is to discard components whose eigenvalues are below 1.0.

ev = eigen(cor(data))
ev$values

By looking at how many values are over 1.0, we can decide the number of factors. There is no definitive way to determine the number of factors.

Another method is to use variance inflation factor (VIF,wiki). VIF measures how much the variance of an estimated regression coefficient increases if your predictors are correlated, which detects multicollinearity in regression analysis:
$VIF=\frac{1}{1-R^{2}}$
where R is correlation coefficients. A VIF value less than 1 is defined as non-correlated, while a VIF greater than 5 means highly correlated.

Notes about correlation, PCA, PCR