# Principal Component Analysis

How I do Principal Component Analysis and choice of n factors

With article about correlations, we saw data from airquality were correlated.

Sometimes it is need to use Principal Component Analysis (PCA) to determine non correlated variables in order to analyze data.

It is the subject of this blog article and especially, how many new variables were needed.

# PCA

As previously I use airquality as data.

To do PCA, I use the package **FactoMineR**.

```
library(FactoMineR)
D<-airquality
pca<-PCA(D)
```

`pca$eig`

```
## eigenvalue percentage of variance cumulative percentage of variance
## comp 1 2.3175145 38.625242 38.62524
## comp 2 1.1646466 19.410776 58.03602
## comp 3 0.9830994 16.384990 74.42101
## comp 4 0.7904881 13.174802 87.59581
## comp 5 0.4347422 7.245704 94.84151
## comp 6 0.3095092 5.158486 100.00000
```

The question is how much dimensions do we need to keep?

The wonderful package **psycho** of Dominique Makowski has the response. Thank him!

# Number of factor retained by psycho::n_factors()

```
library(magrittr)
library(psycho)
choice <- D %>% psycho::n_factors()
choice
```

<<<<<<< HEAD
`## The choice of 2 factors is supported by 5 (out of 10; 50%) methods (Parallel Analysis, Eigenvalues (Kaiser Criterion), BIC, Sample Size Adjusted BIC, VSS Complexity 1).`

=======
`## The choice of 2 factors is supported by 4 (out of 9; 44.44%) methods (Eigenvalues (Kaiser Criterion), BIC, Sample Size Adjusted BIC, VSS Complexity 1).`

>>>>>>> ed9bd07ee4759e13e3672809ecfdaf9fbf953c59
`summary(choice)`

```
## # A tibble: 6 x 4
## n.Factors n.Methods Eigenvalues Cum.Variance
## <int> <dbl> <dbl> <dbl>
## 1 1 3 2.43 0.406
## 2 2 5 1.17 0.601
## 3 3 0 0.997 0.768
## 4 4 0 0.790 0.899
## 5 5 1 0.407 0.967
## 6 6 0 0.198 1
```

`plot(choice)`

On the plot which shows the summary, you can see in yellow, the number of methods. The red line is the Eigenvalues and the blue line, the cumulative proportion of explained variance.

According to this method, we can keep the two first dimensions from PCA.

# Extraction of the variables

dimdesc from **FactoMineR** gives correlations and *p*-value.

X is the new data comes from PCA.

`dimdesc(pca, axes = 1:2)`

```
## $Dim.1
## $Dim.1$quanti
## correlation p.value
## Temp 0.8657470 3.027143e-47
## Ozone 0.8283780 7.735036e-40
## Month 0.4466436 7.164874e-09
## Solar.R 0.3851781 8.816862e-07
## Wind -0.7145176 3.380623e-25
##
##
## $Dim.2
## $Dim.2$quanti
## correlation p.value
## Month 0.5579040 6.798713e-14
## Day 0.5418723 4.714049e-13
## Wind -0.1779546 2.775569e-02
## Solar.R -0.7203875 9.044341e-26
```

```
X<-cbind(pca$ind$coord[,1], pca$ind$coord[,2]) %>% set_colnames(c("PC1", "PC2"))
head(X)
```

```
## PC1 PC2
## 1 -0.5697737 -1.5388946
## 2 -0.6628665 -0.9220601
## 3 -1.5357042 -1.2459632
## 4 -1.5359488 -2.4670249
## 5 -2.1908721 -1.6677619
## 6 -1.9484779 -1.5487626
```

And you, how do you choice the number of factors kept from PCA?

## Share this post

Twitter

Google+

Facebook

Reddit

LinkedIn

StumbleUpon

Pinterest

Email