Unsupervised learning: principal component analysis


SURE 2025

Department of Statistics & Data Science
Carnegie Mellon University

Background

Motivating example: Starbucks drinks

library(tidyverse)
theme_set(theme_light())
starbucks <- read_csv(
  "https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2021/2021-12-21/starbucks.csv"
) |>
  # convert columns to numeric that were saved as character
  mutate(trans_fat_g = as.numeric(trans_fat_g), fiber_g = as.numeric(fiber_g))
glimpse(starbucks)
Rows: 1,147
Columns: 15
$ product_name    <chr> "brewed coffee - dark roast", "brewed coffee - dark ro…
$ size            <chr> "short", "tall", "grande", "venti", "short", "tall", "…
$ milk            <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 2, …
$ whip            <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ serv_size_m_l   <dbl> 236, 354, 473, 591, 236, 354, 473, 591, 236, 354, 473,…
$ calories        <dbl> 3, 4, 5, 5, 3, 4, 5, 5, 3, 4, 5, 5, 3, 4, 5, 5, 35, 50…
$ total_fat_g     <dbl> 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1,…
$ saturated_fat_g <dbl> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,…
$ trans_fat_g     <dbl> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,…
$ cholesterol_mg  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 10,…
$ sodium_mg       <dbl> 5, 10, 10, 10, 5, 10, 10, 10, 5, 5, 5, 5, 5, 5, 5, 5, …
$ total_carbs_g   <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 5, 5, …
$ fiber_g         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ sugar_g         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 5, 5, …
$ caffeine_mg     <dbl> 130, 193, 260, 340, 15, 20, 25, 30, 155, 235, 310, 410…
starbucks_feat <- starbucks |> 
  select(serv_size_m_l:caffeine_mg)
starbucks_pca <- prcomp(starbucks_feat, center = TRUE, scale. = TRUE)
summary(starbucks_pca)

Unsupervised learning

  • No response variable (i.e., data are not labeled)
  • Only given a set of features measured on a set of observations
  • Unsupervised learning is more subjective than supervised learning (difficult to tell how “good” you are doing)
  • There is no simple goal for the analysis, such as prediction of a response in supervised learning
  • Unsupervised learning can be useful as a pre-processing step for supervised learning

Think of unsupervised learning as an extension of EDA—there’s no unique right answer!

Fundamental problems in unsupervised learning

  1. Dimension reduction: reduce the original feature dimension of the data to something smaller so we can explore/visualize the data
  • Methods: PCA (this lecture), ICA, t-SNE, UMAP,…
  1. Clustering: group the observations in the data into different clusters
  • Methods: hard clustering (\(k\)-means, hierarchical clustering,…), soft clustering (mixture model)
  1. Density estimation: estimate the underlying distribution of the data (in a way that makes sense)

Dimension reduction: the big picture

Key question: How do we visualize the structure of high-dimensional data?

Example: What if you’re given a dataset with 50 variables and are asked to make one visualization that best represents the data? What do you do?

Tedious task: Make a series of all \(\displaystyle \binom{50}{2} = 1225\) pairs of plots? Or make a giant correlation heatmap?

Intuition: Take high-dimensional data and represent it in 2-3 dimensions, then visualize those dimensions

Dimension reduction: the big picture

  • It’s usually really hard to visualize many dimensions at the same time
  • Often, it makes a lot of sense to choose 2-3 of the “most important dimensions” and just plot those
  • PCA is a very common way to define “most important dimensions”
  • PCA provides the linear combinations of variables that capture the most variation in the data
  • It’s common to plot the first two principal components in a scatterplot
  • It’s very useful to plot principal components with a biplot

    • Adds interpretability to the principal components, and helps reveal relationships among the variables

What is the goal of dimension reduction?

Given a dataset with \(p\) variables (columns) for \(n\) observations (rows), can we find a smaller number of dimensions that captures the interesting structure in the data?

Dimension reduction:

  • Focus on reducing the dimensionality of the feature space (i.e., number of columns)

  • Retain most of the information / variability in a lower dimensional space (i.e., reducing the number of columns)

Principal components analysis (PCA)

  • PCA produces a low-dimensional representation of a dataset

  • PCA finds a set of variables that are linear combinations of the features that have maximal variance and are mutually uncorrelated

  • The set of uncorrelated variables is called the principal components (PCs)

Principal components

  • PCs are weighted, normalized linear combinations of the original variables

  • Weights reveal how different variables are loaded into the PCs

  • First PC accounts for most variation in the data, second PC for second-most variation, and so on

Principal components (a little more formal)

  • Assume \(\boldsymbol{X}\) is a \(n \times p\) matrix that is centered and stardardized
  • PCA will give us \(p\) principal components that are \(n\)-length columns, denoted by \(Z_1, \dots, Z_p\)

    • The first PC is the linear combination that has the largest possible variance

    • Each succeeding component has the largest possible variance under the constraint that it is uncorrelated with the preceding components

    • A small number of principal components often explains a high proportion of the variability in the data

First principal component

\[Z_1 = \phi_{11} X_1 + \phi_{21} X_2 + \dots + \phi_{p1} X_p\]

  • \(\phi_{11}, \phi_{21}, \dots, \phi_{p1}\) are the loadings of the first PC

  • These are weights indicating the contributions of each feature

  • Weights are normalized \(\displaystyle \sum_{j=1}^p \phi_{j1}^2 = 1\) (else, arbitraily large variance)

  • Together, the loadings make up the loading vector for the first PC, \(\phi_{1} = (\phi_{11}, \phi_{21}, \dots, \phi_{p1})\)

  • \(Z_1\) is a linear combination of the \(p\) variables that has the largest variance (among all PCs)

Second principal component

\[Z_2 = \phi_{12} X_1 + \phi_{22} X_2 + \dots + \phi_{p2} X_p\]

  • \(\phi_{12}, \phi_{22}, \dots, \phi_{p2}\) are the loadings of the second PC

  • Weights are normalized \(\displaystyle \sum_{j=1}^p \phi_{j2}^2 = 1\)

  • Together, the loadings make up the loading vector for the second PC, \(\phi_{2} = (\phi_{12}, \phi_{22}, \dots, \phi_{p2})\)

  • \(Z_2\) is a linear combination of the \(p\) variables that has the largest variance but subject to constraint it is uncorrelated with \(Z_1\)

Obtaining the principal components

Repeat the same process to create \(p\) principal components

  • Uncorrelated: Each (\(Z_j, Z_{j'}\)) is uncorrelated with each other

  • Ordered Variance: \(\text{Var}(Z_1) > \text{Var}(Z_2) > \dots > \text{Var}(Z_p)\)

  • Total Variance: \(\displaystyle \sum_{j=1}^p \text{Var}(Z_j) = p\)

PCA can be

  • formulated as an optimization problem

  • solved via linear algebra techniques (e.g. eigen decomposition or singular value decomposition (SVD))

Geometry of PCA

  • The loading vector \(\phi_1\) with elements \(\phi_{11}, \phi_{21}, \dots, \phi_{p1}\) defines a direction in feature space along which the data vary the most

  • If we project the observations in our dataset onto this direction, the projected values are called the principal component scores

Geometry of PCA in 2D

Geometry of PCA in 2D

Geometric interpretation of PCA in 2D

Visualizing PCA in two dimensions

Key idea: provide low-dimensional linear surfaces that are closest to the observations

  • The above is the minimizing projection residuals viewpoint of PCA

  • There’s another viewpoint—maximizing variance. That is, if we project all points onto the solid blue line, we maximize the variance of the resulting projected points across all such blue lines

Examples

Data: nutritional information of Starbucks drinks

library(tidyverse)
theme_set(theme_light())
starbucks <- read_csv(
  "https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2021/2021-12-21/starbucks.csv"
) |>
  # convert columns to numeric that were saved as character
  mutate(trans_fat_g = as.numeric(trans_fat_g), fiber_g = as.numeric(fiber_g))
glimpse(starbucks)
Rows: 1,147
Columns: 15
$ product_name    <chr> "brewed coffee - dark roast", "brewed coffee - dark ro…
$ size            <chr> "short", "tall", "grande", "venti", "short", "tall", "…
$ milk            <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 2, …
$ whip            <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ serv_size_m_l   <dbl> 236, 354, 473, 591, 236, 354, 473, 591, 236, 354, 473,…
$ calories        <dbl> 3, 4, 5, 5, 3, 4, 5, 5, 3, 4, 5, 5, 3, 4, 5, 5, 35, 50…
$ total_fat_g     <dbl> 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1,…
$ saturated_fat_g <dbl> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,…
$ trans_fat_g     <dbl> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,…
$ cholesterol_mg  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 10,…
$ sodium_mg       <dbl> 5, 10, 10, 10, 5, 10, 10, 10, 5, 5, 5, 5, 5, 5, 5, 5, …
$ total_carbs_g   <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 5, 5, …
$ fiber_g         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ sugar_g         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 5, 5, …
$ caffeine_mg     <dbl> 130, 193, 260, 340, 15, 20, 25, 30, 155, 235, 310, 410…

Implementing PCA

Use the prcomp() function (based on SVD) for PCA on centered and scaled data

starbucks_feat <- starbucks |> 
  select(serv_size_m_l:caffeine_mg)
starbucks_pca <- prcomp(starbucks_feat, center = TRUE, scale. = TRUE)
summary(starbucks_pca)
Importance of components:
                          PC1    PC2    PC3     PC4     PC5     PC6    PC7
Standard deviation     2.4748 1.3074 1.0571 0.97919 0.67836 0.56399 0.4413
Proportion of Variance 0.5568 0.1554 0.1016 0.08716 0.04183 0.02892 0.0177
Cumulative Proportion  0.5568 0.7122 0.8138 0.90093 0.94276 0.97168 0.9894
                           PC8     PC9    PC10    PC11
Standard deviation     0.28123 0.16874 0.08702 0.04048
Proportion of Variance 0.00719 0.00259 0.00069 0.00015
Cumulative Proportion  0.99657 0.99916 0.99985 1.00000

Computing principal components

Extract the matrix of principal components (dimension will match original data)

Note: columns are uncorrelated, such that \(\text{Var}(Z_1) > \text{Var}(Z_2) > \dots > \text{Var}(Z_p)\)

starbucks_pc_matrix <- starbucks_pca$x
head(starbucks_pc_matrix)
           PC1        PC2        PC3           PC4         PC5         PC6
[1,] -3.766852 -1.0023657  0.2482698 -0.1521871448  0.24739830 -0.11365847
[2,] -3.633234 -0.6946439  1.2059943 -0.3720566566  0.06052789 -0.06406410
[3,] -3.518063 -0.3981399  2.2165170 -0.5967175941 -0.13122572 -0.01937237
[4,] -3.412061 -0.1067045  3.3741594 -0.8490378243 -0.26095965 -0.00899485
[5,] -3.721426 -0.9868147 -1.0705094  0.0949330091 -0.27181508  0.17491809
[6,] -3.564899 -0.6712499 -0.7779083 -0.0003019903 -0.72054963  0.37005543
             PC7         PC8        PC9       PC10        PC11
[1,] -0.02812472 0.006489978 0.05145094 0.06678083 0.019741873
[2,]  0.05460952 0.021148978 0.07094211 0.08080545 0.023480029
[3,]  0.09050806 0.031575955 0.08901403 0.09389227 0.028669251
[4,]  0.11585507 0.037521689 0.11287190 0.11582260 0.034691142
[5,]  0.07009414 0.037736197 0.02892317 0.03631676 0.005775410
[6,]  0.20236484 0.068154160 0.03705252 0.03497690 0.002469611

Extracting loading matrix

starbucks_pca$rotation
                        PC1         PC2         PC3          PC4         PC5
serv_size_m_l    0.20078297  0.44103545  0.35053466 -0.117331692 -0.71633828
calories         0.39488151  0.10314156 -0.01048587  0.055814030  0.11487225
total_fat_g      0.35254969 -0.31687231  0.06598414  0.046196797  0.07677253
saturated_fat_g  0.33929914 -0.30565133  0.05310592 -0.003731227  0.16145662
trans_fat_g      0.29974182 -0.39855899  0.01855869 -0.092804122 -0.35695525
cholesterol_mg   0.33049434 -0.37077805  0.01219867 -0.105617624 -0.18815364
sodium_mg        0.33573598  0.24647412 -0.09107538 -0.083512068  0.34969486
total_carbs_g    0.34858318  0.34483762 -0.09623296  0.002842153  0.12386718
fiber_g          0.11351848  0.04137855  0.17814948  0.956078124 -0.04719036
sugar_g          0.34234584  0.35100839 -0.13314389 -0.109371714  0.12108189
caffeine_mg     -0.03085327 -0.01056235  0.89572768 -0.167846419  0.35265479
                        PC6         PC7         PC8         PC9         PC10
serv_size_m_l    0.30806678  0.13668394  0.04039275  0.01194522  0.001076764
calories        -0.01331210 -0.18521073  0.09836135 -0.45551398 -0.744248239
total_fat_g      0.37698224 -0.03833030  0.03871096 -0.58859673  0.518643989
saturated_fat_g  0.57285718 -0.06553378 -0.26369346  0.56257742 -0.209355859
trans_fat_g     -0.50043224  0.15197176 -0.58086994 -0.05398876 -0.032105721
cholesterol_mg  -0.26449384 -0.04594580  0.74615325  0.27703165  0.032124871
sodium_mg       -0.06228905  0.82317144  0.06292570  0.04230447  0.037304757
total_carbs_g   -0.17619489 -0.34490217 -0.08588651  0.12501079  0.148886253
fiber_g         -0.11365528  0.06192955  0.01207815  0.10654914  0.061378250
sugar_g         -0.16729497 -0.33345131 -0.10758116  0.14408095  0.321644156
caffeine_mg     -0.19600402 -0.06671121 -0.02122274  0.01530108  0.020691492
                         PC11
serv_size_m_l   -0.0053899973
calories         0.1070327163
total_fat_g     -0.0489644534
saturated_fat_g  0.0152794817
trans_fat_g     -0.0069417249
cholesterol_mg   0.0004710159
sodium_mg       -0.0185545403
total_carbs_g   -0.7347049650
fiber_g          0.0730283725
sugar_g          0.6635335478
caffeine_mg      0.0094861578

Visualizing first two principal components

starbucks <- starbucks |> 
  mutate(pc1 = starbucks_pc_matrix[,1], 
         pc2 = starbucks_pc_matrix[,2])
starbucks |> 
  ggplot(aes(x = pc1, y = pc2)) +
  geom_point(alpha = 0.5) +
  labs(x = "PC 1", y = "PC 2")
  • Principal components are not interpretable

  • Make a biplot with arrows showing the linear relationship between one variable and other variables

Making PCs interpretable with biplots

Biplot displays both the space of observations and the space of variables

Check out the factoextra package

library(factoextra)
# fviz_pca_var(): projection of variables
# fviz_pca_ind(): display observations with first two PCs
starbucks_pca |> 
  fviz_pca_biplot(label = "var",
                  alpha.ind = 0.25,
                  alpha.var = 0.75,
                  labelsize = 5,
                  col.var = "darkblue",
                  repel = TRUE)

Interpreting a biplot

  • Arrow direction

  • Arrow angles

  • Arrow length

Interpreting a biplot

Interpreting the direction of a particular arrow

  • For example, the far left arrow for caffeine_mg suggests that, as caffeine_mg increases, \(Z_1\) and \(Z_2\) tend to decrease

    • in other words, within the definition of \(Z_1\) and \(Z_2\), the coefficient for caffeine_mg is negative
  • In contrast, the arrow for serv_size_m_l is pointing to the upper right, indicating that as serv_size_m_l increases, both \(Z_1\) and \(Z_2\) also increase

Interpreting a biplot

The angle of between the vectors is indicative of the correlation between different variables

  • \(90^{\circ}\) (right angle): uncorrelated (e.g. serv_size_m_l and saturated_fat_g)

  • \(< 90^{\circ}\) (vectors in similar directions):
    positively correlated
    (e.g. sugar_g and total_carbs_g)

  • \(> 90^{\circ}\) (vectors in different directions):
    negatively correlated
    (e.g. caffeine_mg and calories)

Interpreting a biplot

The arrow length indicates how strongly related the principal components are with the individual variables

  • For example, serv_size_m_l has a fairly long arrow because it has a large positive coefficient in the rotation matrix (see below)

  • Meanwhile, caffeine_mg has a relatively short arrow because its coefficients are relatively small.

Intuition on how many PCs to use

summary(starbucks_pca)
Importance of components:
                          PC1    PC2    PC3     PC4     PC5     PC6    PC7
Standard deviation     2.4748 1.3074 1.0571 0.97919 0.67836 0.56399 0.4413
Proportion of Variance 0.5568 0.1554 0.1016 0.08716 0.04183 0.02892 0.0177
Cumulative Proportion  0.5568 0.7122 0.8138 0.90093 0.94276 0.97168 0.9894
                           PC8     PC9    PC10    PC11
Standard deviation     0.28123 0.16874 0.08702 0.04048
Proportion of Variance 0.00719 0.00259 0.00069 0.00015
Cumulative Proportion  0.99657 0.99916 0.99985 1.00000
  • PC1 accounts for the most variation in our data, PC2 accounts for the next most, and so on
  • Each time we add a new PC dimension, we capture a higher proportion of the variability/information in the data
  • But that increase in proportion decreases for each new dimension we add (i.e., additional PCs will add smaller and smaller variance)
  • It is recommended to keep adding PCs until the marginal gain levels off (i.e. decreases to the point that it isn’t too beneficial to add another dimension to the data)

Create a scree plot (or elbow plot)

starbucks_pca |> 
  fviz_eig(addlabels = TRUE) +
  geom_hline(
    yintercept = 100 * (1 / ncol(starbucks_pca$x)), 
    linetype = "dashed", 
    color = "darkred",
  )

  • Shows proportion of variation explained by each PC

  • Visual rule-of-thumb: look for the “elbow” (where the proportion of variation starts to become flat)

  • A heuristic rule of thumb: horizontal line at \(1/p\)

  • Based on this plot, we can make a strong argument to use first 3 PCs (but maybe go up to first 5 PCs for another substantial drop in the elbow)

Discussion

starbucks_pca |> 
  fviz_eig(addlabels = TRUE) +
  geom_hline(
    yintercept = 100 * (1 / ncol(starbucks_pca$x)), 
    linetype = "dashed", 
    color = "darkred",
  )

  • Let’s say we decide to use first 3 PCs (to capture satisfactory amount of the variation in the data)

  • Our biplot only plots the first 2 PCs, and so we are hiding some data information that we are better off plotting in some way if possible

    • Specifically, we are hiding about 30% of the information (i.e. the total amount of information captured by PCs 3, 4, …, 11)

    • About a third of this remaining information is captured by that third component that we are not plotting

  • This means that, theoretically, we should plot three quantitative variables (e.g., point size, transparency, or even a 3D scatterplot)

  • Alternatively, we could make three scatterplots (one for each pair of PCs)

Remember the spelling…

Appendix

PCA output

# str(starbucks_pca)

Examine the output after running prcomp()

  • starbucks_pca$sdev: singular values (\(\sqrt{\lambda_j}\))

  • starbucks_pca$rotation: loading matrix (\(V\))

  • starbucks_pca$x: principal component scores matrix (\(Z=XV\))

Can use the broom package for tidying prcomp()

tidy(starbucks_pca, matrix = "eigenvalues") # equivalent to starbucks_pca$sdev
tidy(starbucks_pca, matrix = "rotation") # equivalent to starbucks_pca$rotation
tidy(starbucks_pca, matrix = "scores") # equivalent to starbucks_pca$x

Proportion of variance explained

library(broom)
starbucks_pca |>
  tidy(matrix = "eigenvalues") |>
  ggplot(aes(x = PC, y = percent)) +
  geom_line() + 
  geom_point() +
  geom_hline(yintercept = 1 / ncol(starbucks_feat), color = "darkred", linetype = "dashed") +
  scale_x_continuous(breaks = 1:ncol(starbucks_pca$x))

Cumulative proportion of variance explained

library(broom)
starbucks_pca |>
  tidy(matrix = "eigenvalues") |>
  ggplot(aes(x = PC, y = cumulative)) +
  geom_line() + 
  geom_point() +
  scale_x_continuous(breaks = 1:ncol(starbucks_pca$x))

Singular value decomposition (SVD)

\[ X = U D V^T \]

  • \(U\) and \(V\): matrices containing the left and right singular vectors of scaled matrix \(X\)

  • \(D\): diagonal matrix of the singular values

  • SVD simplifies matrix-vector multiplication as rotate, scale, and rotate again

\(V\): loading matrix for \(X\) with \(\phi_{j}\) as columns

  • \(Z = X V\): PC matrix

Bonus: Eigenvalue decomposition (or spectral decomposition)

  • \(V\): eigenvectors of \(X^TX\) (covariance matrix, \(^T\): transpose)

  • \(U\): eigenvectors of \(XX^T\)

  • The singular values (diagonal of \(D\)) are square roots of the eigenvalues of \(X^TX\) or \(XX^T\)

  • Meaning that \(Z = UD\)

Eigenvalues guide dimension reduction

We want to choose \(p^* < p\) such that we are explaining variation in the data

Eigenvalues \(\lambda_j\) for \(j \in 1, \dots, p\) indicate the variance explained by each component

  • \(\displaystyle \sum_j^p \lambda_j = p\), meaning \(\lambda_j \geq 1\) indicates \(\text{PC}j\) contains at least one variable’s worth in variability

  • \(\displaystyle \frac{\lambda_j}{p}\): proportion of variance explained by \(\text{PC}j\)

  • Arranged in descending order so that \(\lambda_1\) is largest eigenvalue and corresponds to PC1

  • Compute the cumulative proportion of variance explained (CVE) with \(p^*\) components \[\text{CVE}_{p^*} = \sum_j^{p*} \frac{\lambda_j}{p}\] Use scree plot to plot eigenvalues and guide choice for \(p^* <p\) by looking for “elbow” (rapid to slow change)