R PCA

PCA

More notes

Posted by Shsun on January 14, 2020

A simple way to do PCA in R:

1
2
3
4
pc = princomp(X)
plot(pc)
summary(pc)
loadings(pc)

PCA in R

An example with a demo dataset decathlon2 as described here.

The first column is individual names. To start with, we choose first 10 variables and 23 individuals: decathlon2.active. Here we use prcomp to do pca:

1
2
3
4
library(factoextra)
data(decathlon2)
decathlon2.active <- decathlon2[1:23, 1:10]
res.pca = prcomp(decathlon2.active, scale = T)

Then we can use fviz_eig to visualize eigenvalues, the percentage of variance explained by each principal component. Previously we calculate percentage of variance using standard deviations returned by prcomp, which produces same barplot as fviz_eig. Then we can use fviz_contrib to plot the contributions of each variable to PCs.

1
2
3
4
5
fviz_eig(res.pca, addlabels = T)
test = round(res.pca$sdev^2/sum(res.pca$sdev^2)*100, 1)    # to calculate eigenvalues
barplot(test)
fviz_contrib(res.pca, choice = 'var', axes = 1)
fviz_contrib(res.pca, choice = 'var', axes = 2)

We can also assess PCA results by get_eigenvalue, get_pca_var, and get_pca_ind.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
# Eigenvalues
eig.val <- get_eigenvalue(res.pca)
eig.val

        eigenvalue variance.percent cumulative.variance.percent
Dim.1   4.1242133        41.242133                    41.24213
Dim.2   1.8385309        18.385309                    59.62744
Dim.3   1.2391403        12.391403                    72.01885
Dim.4   0.8194402         8.194402                    80.21325
Dim.5   0.7015528         7.015528                    87.22878
Dim.6   0.4228828         4.228828                    91.45760
Dim.7   0.3025817         3.025817                    94.48342
Dim.8   0.2744700         2.744700                    97.22812
Dim.9   0.1552169         1.552169                    98.78029
Dim.10  0.1219710         1.219710                   100.00000

# Results for Variables
res.var <- get_pca_var(res.pca)
res.var$coord          # Coordinates
res.var$contrib        # Contributions to the PCs
res.var$cos2           # Quality of representation

# Results for individuals
res.ind <- get_pca_ind(res.pca)
res.ind$coord          # Coordinates
res.ind$contrib        # Contributions to the PCs
res.ind$cos2           # Quality of representation

get_eigenvalue gives eigenvalues (component variances) and the proportion of overall variance explained. At this step, you can decide on the number of first PCs you want to retain.

get_pca_var gives eigenvectors (cosines of rotation fo variables into components). This step also computes loadings. We may skip if we don’t need to interpret PCs. Loadings are eigenvectors normalized to respective eigenvalues: loading = eigenvector * sqrt(eigenvalue). Loadings are the covariance between variables and components.

More notes about eigenvalues. In PCA, covariance matrix is split into scale (eigenvalues) and direction (eigenvectors) part. Eigenvectors are coefficients of orthogonal transformation, and they are unit-scaled loadings. Loadings are the covariances/correlations between the original variables and the unit-scaled components. Eigenvalues are the variances explained by PCs.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
# Data for the supplementary individuals
ind.sup <- decathlon2[24:27, 1:10]
ind.sup[, 1:6]
ind.sup.coord <- predict(res.pca, newdata = ind.sup)
ind.sup.coord[, 1:4]

fviz_pca_ind(res.pca,
             col.ind = "cos2", # Color by the quality of representation
             gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
             repel = TRUE     # Avoid text overlapping
)

groups <- as.factor(decathlon2$Competition[1:23])  ## color individuals by groups
fviz_pca_ind(res.pca,
             col.ind = groups, # color by groups
             palette = c("#00AFBB",  "#FC4E07"),
             addEllipses = TRUE, # Concentration ellipses
             ellipse.type = "confidence",
             legend.title = "Groups",
             repel = TRUE
)

fviz_pca_var(res.pca,
             col.var = "contrib", # Color by contributions to the PC
             gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
             repel = TRUE     # Avoid text overlapping
)

fviz_pca_biplot(res.pca, repel = TRUE,
                col.var = "#2E9FDF", # Variables color
                col.ind = "#696969"  # Individuals color
)

Quick notes!

  • maximize the sum of squares of distance (SS)
  • PC1 has a slope of 0.25 means for every 4 units along x axis (original), we go up 1 unit along y axis (linear combination of x and y)
  • SVD: singular value decomposition
  • SS(distances for PC1) = Eigenvalue for PC1
  • squared(Eigenvalue for PC1) = singular value for PC1
  • SS(distances for PC1)/(n-1) = Variation for PC1
  • SS(distances for PC2)/(n-1) = Variation for PC2
  • total variation around both PC1 and PC2 = Variation for PC1 + Variation for PC2
  • PC1 accounts for: Variation for PC1 / total variation around both PCs
  • Sum of squares total = Sum of squares due to regression + sum of squares error(residual) (SST = SSR + SSE), graph explain
  • residual DF: sample size - regressor number(variables)
  • regression DF: regressor number - 1
  • mean square = sum of squares/DF
  • total DF = residual DF + regression DF
  • F score = mean square of regression/mean square of residual; F(2,7) = 10 -> 2 is numerator, 7 is denominator
  • T score = estimated coefficients/standard error
  • Z score = observation-mean/standard deviation
Reference
  1. Score coefficient in PCA and factor analysis
  2. Steps done in factor analysis compared to steps done in PCA
  3. PCA on correlation or covariance: does PCA on correlation ever make sense
  4. ANOVA table