A simple way to do PCA in R:
1
2
3
4
pc = princomp(X)
plot(pc)
summary(pc)
loadings(pc)
PCA in R
An example with a demo dataset decathlon2
as described here.
The first column is individual names. To start with, we choose first 10 variables and 23 individuals: decathlon2.active
. Here we use prcomp
to do pca:
1
2
3
4
library(factoextra)
data(decathlon2)
decathlon2.active <- decathlon2[1:23, 1:10]
res.pca = prcomp(decathlon2.active, scale = T)
Then we can use fviz_eig
to visualize eigenvalues, the percentage of variance explained by each principal component. Previously we calculate percentage of variance using standard deviations returned by prcomp
, which produces same barplot as fviz_eig. Then we can use fviz_contrib
to plot the contributions of each variable to PCs.
1
2
3
4
5
fviz_eig(res.pca, addlabels = T)
test = round(res.pca$sdev^2/sum(res.pca$sdev^2)*100, 1) # to calculate eigenvalues
barplot(test)
fviz_contrib(res.pca, choice = 'var', axes = 1)
fviz_contrib(res.pca, choice = 'var', axes = 2)
We can also assess PCA results by get_eigenvalue
, get_pca_var
, and get_pca_ind
.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
# Eigenvalues
eig.val <- get_eigenvalue(res.pca)
eig.val
eigenvalue variance.percent cumulative.variance.percent
Dim.1 4.1242133 41.242133 41.24213
Dim.2 1.8385309 18.385309 59.62744
Dim.3 1.2391403 12.391403 72.01885
Dim.4 0.8194402 8.194402 80.21325
Dim.5 0.7015528 7.015528 87.22878
Dim.6 0.4228828 4.228828 91.45760
Dim.7 0.3025817 3.025817 94.48342
Dim.8 0.2744700 2.744700 97.22812
Dim.9 0.1552169 1.552169 98.78029
Dim.10 0.1219710 1.219710 100.00000
# Results for Variables
res.var <- get_pca_var(res.pca)
res.var$coord # Coordinates
res.var$contrib # Contributions to the PCs
res.var$cos2 # Quality of representation
# Results for individuals
res.ind <- get_pca_ind(res.pca)
res.ind$coord # Coordinates
res.ind$contrib # Contributions to the PCs
res.ind$cos2 # Quality of representation
get_eigenvalue
gives eigenvalues (component variances) and the proportion of overall variance explained. At this step, you can decide on the number of first PCs you want to retain.
get_pca_var
gives eigenvectors (cosines of rotation fo variables into components). This step also computes loadings. We may skip if we don’t need to interpret PCs. Loadings are eigenvectors normalized to respective eigenvalues: loading = eigenvector * sqrt(eigenvalue)
. Loadings are the covariance between variables and components.
More notes about eigenvalues. In PCA, covariance matrix is split into scale (eigenvalues) and direction (eigenvectors) part. Eigenvectors are coefficients of orthogonal transformation, and they are unit-scaled loadings. Loadings are the covariances/correlations between the original variables and the unit-scaled components. Eigenvalues are the variances explained by PCs.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
# Data for the supplementary individuals
ind.sup <- decathlon2[24:27, 1:10]
ind.sup[, 1:6]
ind.sup.coord <- predict(res.pca, newdata = ind.sup)
ind.sup.coord[, 1:4]
fviz_pca_ind(res.pca,
col.ind = "cos2", # Color by the quality of representation
gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
repel = TRUE # Avoid text overlapping
)
groups <- as.factor(decathlon2$Competition[1:23]) ## color individuals by groups
fviz_pca_ind(res.pca,
col.ind = groups, # color by groups
palette = c("#00AFBB", "#FC4E07"),
addEllipses = TRUE, # Concentration ellipses
ellipse.type = "confidence",
legend.title = "Groups",
repel = TRUE
)
fviz_pca_var(res.pca,
col.var = "contrib", # Color by contributions to the PC
gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
repel = TRUE # Avoid text overlapping
)
fviz_pca_biplot(res.pca, repel = TRUE,
col.var = "#2E9FDF", # Variables color
col.ind = "#696969" # Individuals color
)
Quick notes!
- maximize the sum of squares of distance (SS)
- PC1 has a slope of 0.25 means for every 4 units along x axis (original), we go up 1 unit along y axis (linear combination of x and y)
- SVD: singular value decomposition
- SS(distances for PC1) = Eigenvalue for PC1
- squared(Eigenvalue for PC1) = singular value for PC1
- SS(distances for PC1)/(n-1) = Variation for PC1
- SS(distances for PC2)/(n-1) = Variation for PC2
- total variation around both PC1 and PC2 = Variation for PC1 + Variation for PC2
- PC1 accounts for: Variation for PC1 / total variation around both PCs
- Sum of squares total = Sum of squares due to regression + sum of squares error(residual) (SST = SSR + SSE), graph explain
- residual DF: sample size - regressor number(variables)
- regression DF: regressor number - 1
- mean square = sum of squares/DF
- total DF = residual DF + regression DF
- F score = mean square of regression/mean square of residual; F(2,7) = 10 -> 2 is numerator, 7 is denominator
- T score = estimated coefficients/standard error
- Z score = observation-mean/standard deviation