Unsupervised learning: principal component analysis

SURE 2024

Department of Statistics & Data Science
Carnegie Mellon University

Quick notes on modeling

Be aware of confounding variables

A confounding variable (or lurking variable) is one that is associated with both the response and predictor variables

It can be tempting to immediately jump to conclusions when faced with a strong association between the variables
However, it is important to stop and think about whether there are confounding variables which could be explaining the association instead

Be aware of confounding variables

In studying the relationship between a predictor $X$ and a response variable $Y$,

we should adjust for confounding variables that can influence that relationship because they are associated both with $X$ and with $Y$
If the variable is a potential confounder, including it in the model may help to reduce bias in estimating relevant effects of key explanatory variables

Be aware of confounding variables

Example: Examine the impact of fertilizer on the yield of an apple orchard

Suppose researchers are studying the relationship between the amount of fertilizer used and the yield of apple orchards

They fit a model with the yield as the response and amount of fertilizer used as the single predictor

But, there are many possible confounding variables that might be associated with both the amount of fertilizer used and the yield (e.g., quality of soil, weather, etc.)

These variables, if included in the data, should also be adjusted for in the model

A quote from Essays on Data Analysis

I have observed people who explore the data extensively, making quick plots, sketches, and summaries, and then immediately draw a conclusion like “X causes Y” or “A is associated with B” […] This is problematic because often in the exploratory process we only look at rough sketches, often just bivariate plots or 2x2 tables which reveal strong relationships. Making inferences from this kind of exploratory work can be misleading without carefully considering second order factors like confounding […]. Often, those issues are more easily dealt with when using formal models or more detailed visualization.

Be aware of multicollinearity

But be careful… models can suffer from multicollinearity

A set of predictors exhibits multicollinearity when one or more of the predictors is strongly correlated with some combination of the other predictors in the set

Correlations among predictors could make it seem that no one variable is important when all the others are in the model

A variable may seem to have little effect because it overlaps considerably with others in the model, itself being predicted well by the other predictors

Make sure to justify the model choice

Why is the modeling technique appropriate for the problem?

For simplicity and explanability purpose? (Why do you prefer a simpler model over a black box?)
For complexity and flexibility purpose? (Why do you prefer a black box over a simpler model?)

How do you evaluate your model?

Did you compare it with different approaches?
Did you compare it with alternative specifications such as different set of predictors?
How do you choose the predictors?
- domain knowledge
- context
- extensive EDA
- model assessment based on holdout predictions

Make sure to quantify uncertainty for your estimates of interest

Don’t just say something like “we find an association between Y and X…”

Be more specific!
What do those coefficient estimates tell you?
How large/small is the effect size?
What’s the margin off error?
Did the model also adjust for other potential meaningful predictors?

Background on dimension reduction

Unsupervised learning

No response variable (i.e. data are not labeled)
Only given a set of features measured on a set of observations
Objective: understand the variation and grouping structure of a set of unlabeled data
- e.g., discover subgroups among the variables or among the observations
Difficult to know how “good” you are doing

Dimension reduction: the big picture

Key question: How do we visualize the structure of high-dimensional data?

Example: What if you’re given a dataset with 50 variables and are asked to make one visualization that best represents the data? What do you do?

Tedious task: Make a series of all $\displaystyle \binom{50}{2} = 1225$ pairs of plots? Or make a giant correlation heatmap?

Intuition: Take high-dimensional data and represent it in 2-3 dimensions, then visualize those dimensions

Dimension reduction: the big picture

It’s usually really hard to visualize many dimensions at the same time

Often, it makes a lot of sense to choose 2-3 of the “most important dimensions” and just plot those

PCA is a very common way to define “most important dimensions”

PCA provides the linear combinations of variables that capture the most variation in the data

It’s common to plot the first two principal components in a scatterplot

It’s very useful to plot principal components with a biplot
- Adds interpretability to the principal components, and helps reveal relationships among the variables

What is the goal of dimension reduction?

We have $p$ variables (columns) for $n$ observations (rows) BUT which variables are interesting?

Can we find a smaller number of dimensions that captures the interesting structure in the data?

Could examine all pairwise scatterplots of each variable - tedious, manual process
Clustering of variables based on correlation
Can we find a combination of the original $p$ variables?

Dimension reduction:

Focus on reducing the dimensionality of the feature space (i.e., number of columns)
Retain most of the information / variability in a lower dimensional space (i.e., reducing the number of columns)
The process: (big) $n \times p$ matrix $\longrightarrow$ dimension reduction method $\longrightarrow$ (smaller) $n \times k$ matrix

Principal components analysis (PCA)

TL;DR

PCA replaces the original $p$ explanatory variables by fewer linear combinations of them (the “principal components”) that are uncorrelated while also accounting for most of their variabillity

\[ \begin{pmatrix} & & \text{really} & & \\ & & \text{wide} & & \\ & & \text{matrix} & & \end{pmatrix} \rightarrow \text{matrix algebra razzmatazz} \rightarrow \begin{pmatrix} \text{much} \\ \text{thinner} \\ \text{matrix} \end{pmatrix} \]

PCA explores the covariance between variables, and combines variables into a smaller set of uncorrelated variables called principal components (PCs)
- Turn a $n \times p$ matrix of correlated variables into a $n \times k$ matrix of uncorrelated variables

Principal components analysis (PCA)

Each of the $k$ columns in the right-hand matrix are principal components (PCs), all uncorrelated with each other
- PCs are weighted, linear combinations of the original variables
- Weights reveal how different variables are loaded into the PCs
First column accounts for most variation in the data, second column for second-most variation, and so on

Intuition: we want a small number of PCs (first few PCs) to account for most of the information/variance in the data

What are principal components?

Assume $\boldsymbol{X}$ is a $n \times p$ matrix that is centered and stardardized

Total variation $= p$, since $\text{Var}(\boldsymbol{x}_j)$ = 1 for all $j = 1, \dots, p$ (due to stardardization)

PCA will give us $p$ principal components that are $n$-length columns, denoted by $Z_1, \dots, Z_p$

The first principal component is the linear combination that has the largest possible variance
Each succeeding component has the largest possible variance under the constraint that it is uncorrelated with the preceding components
A small number of principal components often explains a high percentage of the original variability

First principal component

\[Z_1 = \phi_{11} X_1 + \phi_{21} X_2 + \dots + \phi_{p1} X_p\]

$\phi_{j1}$ are the weights indicating the contributions of each variable $j \in 1, \dots, p$
Weights are normalized $\displaystyle \sum_{j=1}^p \phi_{j1}^2 = 1$
$\phi_{1} = (\phi_{11}, \phi_{21}, \dots, \phi_{p1})$ is the loading vector for $\text{PC}_1$

$Z_1$ is a linear combination of the $p$ variables that has the largest variance

Second principal component

\[Z_2 = \phi_{12} X_1 + \phi_{22} X_2 + \dots + \phi_{p2} X_p\]

$\phi_{j2}$ are the weights indicating the contributions of each variable $j \in 1, \dots, p$
Weights are normalized $\displaystyle \sum_{j=1}^p \phi_{j1}^2 = 1$
$\phi_{2} = (\phi_{12}, \phi_{22}, \dots, \phi_{p2})$ is the loading vector for $\text{PC}_2$
$Z_2$ is a linear combination of the $p$ variables that has the largest variance
- Subject to constraint it is uncorrelated with $Z_1$

Repeat this process to create $p$ principal components

Uncorrelated: Each ($Z_j, Z_{j'}$) is uncorrelated with each other
Ordered Variance: $\text{Var}(Z_1) > \text{Var}(Z_2) > \dots > \text{Var}(Z_p)$
Total Variance: $\displaystyle \sum_{j=1}^p \text{Var}(Z_j) = p$

Visualizing PCA in two dimensions

Key idea: provide low-dimensional linear surfaces that are closest to the observations

The above is the minimizing projection residuals viewpoint of PCA

There’s another viewpoint: maximizing variance
- if we project all points onto the solid orange line, we maximize the variance of the resulting projected points across all such orange lines

Examples

Data: nutritional information of Starbucks drinks

library(tidyverse)
theme_set(theme_light())
starbucks <- read_csv(
  "https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2021/2021-12-21/starbucks.csv"
) |>
  # convert columns to numeric that were saved as character
  mutate(trans_fat_g = as.numeric(trans_fat_g), fiber_g = as.numeric(fiber_g))
glimpse(starbucks)

Rows: 1,147
Columns: 15
$ product_name    <chr> "brewed coffee - dark roast", "brewed coffee - dark ro…
$ size            <chr> "short", "tall", "grande", "venti", "short", "tall", "…
$ milk            <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 2, …
$ whip            <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ serv_size_m_l   <dbl> 236, 354, 473, 591, 236, 354, 473, 591, 236, 354, 473,…
$ calories        <dbl> 3, 4, 5, 5, 3, 4, 5, 5, 3, 4, 5, 5, 3, 4, 5, 5, 35, 50…
$ total_fat_g     <dbl> 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1,…
$ saturated_fat_g <dbl> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,…
$ trans_fat_g     <dbl> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,…
$ cholesterol_mg  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 10,…
$ sodium_mg       <dbl> 5, 10, 10, 10, 5, 10, 10, 10, 5, 5, 5, 5, 5, 5, 5, 5, …
$ total_carbs_g   <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 5, 5, …
$ fiber_g         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ sugar_g         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 5, 5, …
$ caffeine_mg     <dbl> 130, 193, 260, 340, 15, 20, 25, 30, 155, 235, 310, 410…

Implement PCA

Use the prcomp() function (based on SVD) for PCA on centered and scaled data

feat <- starbucks |> 
  select(serv_size_m_l:caffeine_mg)
starbucks_pca <- prcomp(feat, center = TRUE, scale. = TRUE)
summary(starbucks_pca)

Importance of components:
                          PC1    PC2    PC3     PC4     PC5     PC6    PC7
Standard deviation     2.4748 1.3074 1.0571 0.97919 0.67836 0.56399 0.4413
Proportion of Variance 0.5568 0.1554 0.1016 0.08716 0.04183 0.02892 0.0177
Cumulative Proportion  0.5568 0.7122 0.8138 0.90093 0.94276 0.97168 0.9894
                           PC8     PC9    PC10    PC11
Standard deviation     0.28123 0.16874 0.08702 0.04048
Proportion of Variance 0.00719 0.00259 0.00069 0.00015
Cumulative Proportion  0.99657 0.99916 0.99985 1.00000

Computing principal components

Extract the matrix of principal components (dimension will match original data)

starbucks_pc_matrix <- starbucks_pca$x
head(starbucks_pc_matrix)

           PC1        PC2        PC3           PC4         PC5         PC6
[1,] -3.766852 -1.0023657  0.2482698 -0.1521871448  0.24739830 -0.11365847
[2,] -3.633234 -0.6946439  1.2059943 -0.3720566566  0.06052789 -0.06406410
[3,] -3.518063 -0.3981399  2.2165170 -0.5967175941 -0.13122572 -0.01937237
[4,] -3.412061 -0.1067045  3.3741594 -0.8490378243 -0.26095965 -0.00899485
[5,] -3.721426 -0.9868147 -1.0705094  0.0949330091 -0.27181508  0.17491809
[6,] -3.564899 -0.6712499 -0.7779083 -0.0003019903 -0.72054963  0.37005543
             PC7         PC8        PC9       PC10        PC11
[1,] -0.02812472 0.006489978 0.05145094 0.06678083 0.019741873
[2,]  0.05460952 0.021148978 0.07094211 0.08080545 0.023480029
[3,]  0.09050806 0.031575955 0.08901403 0.09389227 0.028669251
[4,]  0.11585507 0.037521689 0.11287190 0.11582260 0.034691142
[5,]  0.07009414 0.037736197 0.02892317 0.03631676 0.005775410
[6,]  0.20236484 0.068154160 0.03705252 0.03497690 0.002469611

Columns are uncorrelated, such that $\text{Var}(Z_1) > \text{Var}(Z_2) > \dots > \text{Var}(Z_p)$

Visualizing first two principal components

starbucks <- starbucks |> 
  mutate(pc1 = starbucks_pc_matrix[,1], 
         pc2 = starbucks_pc_matrix[,2])
starbucks |> 
  ggplot(aes(x = pc1, y = pc2)) +
  geom_point(alpha = 0.5) +
  labs(x = "PC 1", y = "PC 2")

Principal components are not interpretable
Make a biplot with arrows showing the linear relationship between one variable and other variables

Making PCs interpretable with biplots

Biplot displays both the space of observations and the space of variables

Check out the factoextra package

library(factoextra)
# fviz_pca_var(): projection of variables
# fviz_pca_ind(): display observations with first two PCs
starbucks_pca |> 
  fviz_pca_biplot(label = "var",
                  alpha.ind = 0.25,
                  alpha.var = 0.75,
                  col.var = "darkblue",
                  repel = TRUE)

Arrow direction: “as the variable increases…”
Arrow angles: correlation
- $90^{\circ}$: uncorrelated
- $< 90^{\circ}$: positively correlated
- $> 90^{\circ}$: negatively correlated
Arrow length: strength of relationship with PCs

How many principal components to use?

Intuition: Additional principal components will add smaller and smaller variance

Keep adding components until the added variance drops off

summary(starbucks_pca)

Importance of components:
                          PC1    PC2    PC3     PC4     PC5     PC6    PC7
Standard deviation     2.4748 1.3074 1.0571 0.97919 0.67836 0.56399 0.4413
Proportion of Variance 0.5568 0.1554 0.1016 0.08716 0.04183 0.02892 0.0177
Cumulative Proportion  0.5568 0.7122 0.8138 0.90093 0.94276 0.97168 0.9894
                           PC8     PC9    PC10    PC11
Standard deviation     0.28123 0.16874 0.08702 0.04048
Proportion of Variance 0.00719 0.00259 0.00069 0.00015
Cumulative Proportion  0.99657 0.99916 0.99985 1.00000

Create a scree plot (or elbow plot)

starbucks_pca |> 
  fviz_eig(addlabels = TRUE) +
  geom_hline(yintercept = 100 * (1 / ncol(starbucks_pca$x)), linetype = "dashed", color = "darkred")

Rule of thumb: horizontal line at $1/p$

PCA output

# str(starbucks_pca)

Examine the output after running prcomp()

starbucks_pca$sdev: singular values ($\sqrt{\lambda_j}$)
starbucks_pca$rotation: loading matrix ($V$)
starbucks_pca$x: principal component scores matrix ($Z=XV$)

Can use the broom package for tidying prcomp()

tidy(starbucks_pca, matrix = "eigenvalues") # equivalent to starbucks_pca$sdev
tidy(starbucks_pca, matrix = "rotation") # equivalent to starbucks_pca$rotation
tidy(starbucks_pca, matrix = "scores") # equivalent to starbucks_pca$x

Proportion of variance explained

library(broom)
starbucks_pca |>
  tidy(matrix = "eigenvalues") |>
  ggplot(aes(x = PC, y = percent)) +
  geom_line() + 
  geom_point() +
  geom_hline(yintercept = 1 / ncol(feat), color = "darkred", linetype = "dashed") +
  scale_x_continuous(breaks = 1:ncol(starbucks_pca$x))

Cumulative proportion of variance explained

library(broom)
starbucks_pca |>
  tidy(matrix = "eigenvalues") |>
  ggplot(aes(x = PC, y = cumulative)) +
  geom_line() + 
  geom_point() +
  scale_x_continuous(breaks = 1:ncol(starbucks_pca$x))

Remember the spelling…

folks. it's principal (not principle) components analysis.
— Stephanie Hicks, PhD (@stephaniehicks) February 17, 2023

Appendix

PCA: singular value decomposition (SVD)

\[ X = U D V^T \]

$U$ and $V$: matrices containing the left and right singular vectors of scaled matrix $X$
$D$: diagonal matrix of the singular values

SVD simplifies matrix-vector multiplication as rotate, scale, and rotate again

$V$: loading matrix for $X$ with $\phi_{j}$ as columns

$Z = X V$: PC matrix

Bonus: Eigenvalue decomposition (or spectral decomposition)

$V$: eigenvectors of $X^TX$ (covariance matrix, $^T$: transpose)
$U$: eigenvectors of $XX^T$
The singular values (diagonal of $D$) are square roots of the eigenvalues of $X^TX$ or $XX^T$
Meaning that $Z = UD$

Eigenvalues guide dimension reduction

We want to choose $p^* < p$ such that we are explaining variation in the data

Eigenvalues $\lambda_j$ for $j \in 1, \dots, p$ indicate the variance explained by each component

$\displaystyle \sum_j^p \lambda_j = p$, meaning $\lambda_j \geq 1$ indicates $\text{PC}j$ contains at least one variable’s worth in variability
$\displaystyle \frac{\lambda_j}{p}$: proportion of variance explained by $\text{PC}j$
Arranged in descending order so that $\lambda_1$ is largest eigenvalue and corresponds to PC1

Compute the cumulative proportion of variance explained (CVE) with $p^*$ components \[\text{CVE}_{p^*} = \sum_j^{p*} \frac{\lambda_j}{p}\] Use scree plot to plot eigenvalues and guide choice for $p^* <p$ by looking for “elbow” (rapid to slow change)