Principal Component Analysis

How I do Principal Component Analysis and choice of n factors With article about correlations, we saw data from airquality were correlated.
Sometimes it is need to use Principal Component Analysis (PCA) to determine non correlated variables in order to analyze data.
It is the subject of this blog article and especially, how many new variables were needed.

PCA

As previously I use airquality as data.
To do PCA, I use the package FactoMineR.

library(FactoMineR)
D<-airquality

pca<-PCA(D)  pca$eig ## eigenvalue percentage of variance cumulative percentage of variance ## comp 1 2.3175145 38.625242 38.62524 ## comp 2 1.1646466 19.410776 58.03602 ## comp 3 0.9830994 16.384990 74.42101 ## comp 4 0.7904881 13.174802 87.59581 ## comp 5 0.4347422 7.245704 94.84151 ## comp 6 0.3095092 5.158486 100.00000 The question is how much dimensions do we need to keep? The wonderful package psycho of Dominique Makowski has the response. Thank him! Number of factor retained by psycho::n_factors() library(magrittr) library(psycho) choice <- D %>% psycho::n_factors() choice <<<<<<< HEAD ## The choice of 2 factors is supported by 5 (out of 10; 50%) methods (Parallel Analysis, Eigenvalues (Kaiser Criterion), BIC, Sample Size Adjusted BIC, VSS Complexity 1). ======= ## The choice of 2 factors is supported by 4 (out of 9; 44.44%) methods (Eigenvalues (Kaiser Criterion), BIC, Sample Size Adjusted BIC, VSS Complexity 1). >>>>>>> ed9bd07ee4759e13e3672809ecfdaf9fbf953c59 summary(choice) ## # A tibble: 6 x 4 ## n.Factors n.Methods Eigenvalues Cum.Variance ## <int> <dbl> <dbl> <dbl> ## 1 1 3 2.43 0.406 ## 2 2 5 1.17 0.601 ## 3 3 0 0.997 0.768 ## 4 4 0 0.790 0.899 ## 5 5 1 0.407 0.967 ## 6 6 0 0.198 1 plot(choice) On the plot which shows the summary, you can see in yellow, the number of methods. The red line is the Eigenvalues and the blue line, the cumulative proportion of explained variance. According to this method, we can keep the two first dimensions from PCA. Extraction of the variables dimdesc from FactoMineR gives correlations and p-value. X is the new data comes from PCA. dimdesc(pca, axes = 1:2) ##$Dim.1
## $Dim.1$quanti
##         correlation      p.value
## Temp      0.8657470 3.027143e-47
## Ozone     0.8283780 7.735036e-40
## Month     0.4466436 7.164874e-09
## Solar.R   0.3851781 8.816862e-07
## Wind     -0.7145176 3.380623e-25
##
##
## $Dim.2 ##$Dim.2$quanti ## correlation p.value ## Month 0.5579040 6.798713e-14 ## Day 0.5418723 4.714049e-13 ## Wind -0.1779546 2.775569e-02 ## Solar.R -0.7203875 9.044341e-26 X<-cbind(pca$ind$coord[,1], pca$ind\$coord[,2]) %>% set_colnames(c("PC1", "PC2"))
head(X)
##          PC1        PC2
## 1 -0.5697737 -1.5388946
## 2 -0.6628665 -0.9220601
## 3 -1.5357042 -1.2459632
## 4 -1.5359488 -2.4670249
## 5 -2.1908721 -1.6677619
## 6 -1.9484779 -1.5487626

And you, how do you choice the number of factors kept from PCA?