Principal Component Analysis

How I do Principal Component Analysis and choice of n factors

Marie Vaugoyeau

2 minutes read

With article about correlations, we saw data from airquality were correlated.
Sometimes it is need to use Principal Component Analysis (PCA) to determine non correlated variables in order to analyze data.
It is the subject of this blog article and especially, how many new variables were needed.

PCA

As previously I use airquality as data.
To do PCA, I use the package FactoMineR.

library(FactoMineR)
D<-airquality

pca<-PCA(D)

pca$eig
##        eigenvalue percentage of variance cumulative percentage of variance
## comp 1  2.3175145              38.625242                          38.62524
## comp 2  1.1646466              19.410776                          58.03602
## comp 3  0.9830994              16.384990                          74.42101
## comp 4  0.7904881              13.174802                          87.59581
## comp 5  0.4347422               7.245704                          94.84151
## comp 6  0.3095092               5.158486                         100.00000

The question is how much dimensions do we need to keep?

The wonderful package psycho of Dominique Makowski has the response. Thank him!

Number of factor retained by psycho::n_factors()

library(magrittr)
library(psycho)

choice <- D %>% psycho::n_factors()
choice
## The choice of 2 factors is supported by 4 (out of 10; 40%) methods (Eigenvalues (Kaiser Criterion), BIC, Sample Size Adjusted BIC, VSS Complexity 1).
summary(choice)
## # A tibble: 6 x 4
##   n.Factors n.Methods Eigenvalues Cum.Variance
##       <int>     <dbl>       <dbl>        <dbl>
## 1         1         3       2.43         0.406
## 2         2         4       1.17         0.601
## 3         3         1       0.997        0.768
## 4         4         0       0.790        0.899
## 5         5         1       0.407        0.967
## 6         6         1       0.198        1
plot(choice)

On the plot which shows the summary, you can see in yellow, the number of methods. The red line is the Eigenvalues and the blue line, the cumulative proportion of explained variance.
According to this method, we can keep the two first dimensions from PCA.

Extraction of the variables

dimdesc from FactoMineR gives correlations and p-value.
X is the new data comes from PCA.

dimdesc(pca, axes = 1:2)
## $Dim.1
## $Dim.1$quanti
##         correlation      p.value
## Temp      0.8657470 3.027143e-47
## Ozone     0.8283780 7.735036e-40
## Month     0.4466436 7.164874e-09
## Solar.R   0.3851781 8.816862e-07
## Wind     -0.7145176 3.380623e-25
## 
## 
## $Dim.2
## $Dim.2$quanti
##         correlation      p.value
## Month     0.5579040 6.798713e-14
## Day       0.5418723 4.714049e-13
## Wind     -0.1779546 2.775569e-02
## Solar.R  -0.7203875 9.044341e-26
X<-cbind(pca$ind$coord[,1], pca$ind$coord[,2]) %>% set_colnames(c("PC1", "PC2"))
head(X)
##          PC1        PC2
## 1 -0.5697737 -1.5388946
## 2 -0.6628665 -0.9220601
## 3 -1.5357042 -1.2459632
## 4 -1.5359488 -2.4670249
## 5 -2.1908721 -1.6677619
## 6 -1.9484779 -1.5487626

And you, how do you choice the number of factors kept from PCA?