Simulation


SURE 2024

Department of Statistics & Data Science
Carnegie Mellon University

Background

Why simulation?

  • We’re in the 21st century!

  • Simulations can often be

    • easier than hand calculations

    • made more realistic than hand calculations

  • R provides unique access to great (statistical) simulation tools (compared to other languages)

Sampling from a given vector

  • To sample from a given vector, use sample()
# base R built in English alphabet
# letters
sample(letters, size = 5) # sample without replacement, by default
[1] "t" "r" "x" "j" "p"
sample(c(0, 1), size = 7, replace = TRUE) # sample with replacement
[1] 0 0 1 1 0 0 0
# 5 (independent) coin tosses
coin <- c("H", "T")
sample(coin, 5, replace = TRUE)
[1] "T" "T" "T" "H" "H"
sample(1:100, 1) # sample a random integer between 1 and 100
[1] 57

Probability distributions

A distribution is a mathematical function \(f(x \mid \theta)\) where

  • \(x\) may take on continuous or discrete values over the domain (i.e. all possible inputs) of \(f(x \mid \theta)\)
  • \(\theta\) is a set of parameters governing the shape of the distribution
    • e.g. \(\theta = \{\mu, \sigma ^2 \}\) for a normal (Gaussian) distribution
  • the \(\mid\) symbol means that the shape of the distribution is conditional on the values of \(\theta\)

Let \(f\) denote the distribution for its

  • probability density function (PDF) if \(x\) is continuous
  • probability mass function (PMF) if \(x\) is discrete

Note:

  • \(f(x \mid \theta) \geq 0\) for all \(x\)
  • \(\displaystyle \sum_x f(x \mid \theta) = 1\) (discrete case) or \(\displaystyle \int_x f(x \mid \theta) = 1\) (continuous case)

Normal distribution

  • PDF: \(\displaystyle f(x \mid \mu, \sigma^2)=\frac{1}{\sigma \sqrt{2 \pi}} e^{-\frac{1}{2}\left(\frac{x-\mu}{\sigma}\right)^{2}}\); \(x \in (- \infty, \infty)\)

  • We write \(X \sim N(\mu, \sigma^2)\)

  • Standard normal distribution: \(N(0, 1)\)

library(tidyverse)
theme_set(theme_light())
tibble(x = c(-5, 5)) |> 
  ggplot(aes(x)) +
  stat_function(fun = dnorm, color = "blue",
                args = list(mean = 0, sd = 1)) +
  stat_function(fun = dnorm, color = "red",
                args = list(mean = -2, sd = sqrt(0.5)))

Binomial distribution

  • PMF: \(\displaystyle f(x \mid n, p)= \binom{n}{x} p^{x}(1-p)^{n-x}\); \(x = 0, 1, \dots, n\)

  • Model for the probability of \(x\) successes in \(n\) independent trials, each with success probability \(p\)

  • We write \(X \sim \text{Binomial}(n,p)\)

tibble(x = 0:20) |>
  mutate(binom1 = dbinom(x, size = 20, prob = 0.5),
         binom2 = dbinom(x, size = 20, prob = 0.1)) |>
  ggplot(aes(x)) + 
  geom_point(aes(y = binom1), color = "blue", size = 4) +
  geom_point(aes(y = binom2), color = "red", size = 4)

Poisson distribution

  • PMF: \(\displaystyle f(x \mid \lambda)= \frac{ e^{-\lambda} \lambda^x}{x!}\); \(x = 0, 1, 2, \dots\) and \(\lambda > 0\)

  • Model for the counts of an event in a fixed period of time, with a rate of occurrence parameter \(\lambda\)

  • We write \(X \sim \text{Poisson}(\lambda)\)

tibble(x = 0:10) |> 
  mutate(y = dpois(x, 1)) |> 
  ggplot(aes(x, y)) +
  geom_point(size = 4)

Random number generation

Example: Sampling from a normal distribution

  • rnorm(): generate normal random variables

  • pnorm(): normal cumulative distribution function

  • dnorm(): normal density function

  • qnorm(): normal quantile function

Note: Replace “norm” with the name of another distribution, all the same functions apply.

  • E.g., “t”, “exp”, “gamma”, “chisq”, “binom”, “pois”, “unif”, etc.

See this manual for more details

Random number generation

# these are the defaults for mean and sd
z <- rnorm(1000, mean = 0, sd = 1)
# check: the sample mean is approximately 0
mean(z) 
[1] -0.0209367
# check: the sample sd is approximately 1
sd(z)
[1] 1.001932

Revisiting ECDF

Recall that we can use the ECDF to estimate the true cumulative distribution function

z_ecdf <- ecdf(z)
z_ecdf(0) # should get close to 1/2
[1] 0.513
class(z_ecdf)
[1] "ecdf"     "stepfun"  "function"
normal_tbl <- tibble(z = sort(z)) |> 
  mutate(empirical = z_ecdf(z),
         true = pnorm(z))
normal_tbl
# A tibble: 1,000 × 3
       z empirical    true
   <dbl>     <dbl>   <dbl>
 1 -3.03     0.001 0.00122
 2 -2.99     0.002 0.00140
 3 -2.94     0.003 0.00164
 4 -2.74     0.004 0.00307
 5 -2.66     0.005 0.00388
 6 -2.56     0.006 0.00527
 7 -2.52     0.007 0.00582
 8 -2.51     0.008 0.00609
 9 -2.49     0.009 0.00637
10 -2.44     0.01  0.00744
# ℹ 990 more rows
normal_tbl |> 
  pivot_longer(!z, 
               names_to = "method", 
               values_to = "val") |> 
  ggplot(aes(x = z, y = val, color = method)) +
  geom_line()

Stick breaking problem

If a stick of unit length is broken in two at random, what is the average ratio of the smaller length to the larger?

First… any guesses? (Also, any guess on what the average length of the smaller (or larger) piece is?)

x <- runif(10000)
smaller <- ifelse(x < 0.5, x, 1 - x)
ratio <- smaller / (1 - smaller)
# get a distribution
# hist(ratio)
mean(ratio) # exact answer: 2 * log(2) - 1
[1] 0.3851465

How would you do this by hand…?

Estimating \(\pi\) using Monte Carlo simulation

  • Monte Carlo methods rely on repeated random sampling to obtain numerical results

  • Use randomness to solve problems that might be deterministic in principle

  • Example: Estimating \(\pi\)

    • Simulate random \((x, y)\) points with domain as a square of side \(2r\) units centered at the origin

    • Consider a circle inside the same domain with same radius \(r\) and inscribed into the square

    • Calculate the ratio of number points inside the circle and total number of generated points (Why?)

Estimating \(\pi\) using Monte Carlo simulation

n_points <- 10000
x <- runif(n_points, -1, 1)
y <- runif(n_points, -1, 1)
inside <- ifelse(x^2 + y^2 <= 1, 1, 0)
4 * mean(inside)
[1] 3.122
tibble(x, y, inside) |> 
  ggplot(aes(x, y, color = factor(inside))) +
  geom_point(show.legend = FALSE) +
  coord_equal()

Pseudorandomness and seeds

Same command, different results?

Not surprisingly, we get different sample draws each time we call rnorm()

mean(rnorm(100))
[1] 0.005849515
mean(rnorm(100))
[1] 0.01915132
mean(rnorm(100))
[1] 0.2726232

Is it really random?


Random numbers generated in R (or any language) are not truly random; they are what we call pseudorandom

  • These are numbers generated by computer algorithms that very closely mimick truly random numbers

  • The default algorithm in R is called the Mersenne Twister

Setting the random seed

  • All pseudorandom number generators depend on what is called a seed value

  • This puts the random number generator in a well-defined state, so that the numbers it generates, from then on, will be reproducible

  • The seed is just an integer, and can be set with set.seed()

  • The reason we set it: so that when someone else runs our simulation code, they can see the same—albeit, still random—results that we do

  • Note: set.seed() will be helpful later on for things like cross-validation, \(k\)-means clustering, etc. — basically anything that involves randomly sampling of the data

Setting the random seed

Same seed, same results

set.seed(1999)
rnorm(3)
[1]  0.73267249 -0.03782971  1.20300914
set.seed(1999)
rnorm(3)
[1]  0.73267249 -0.03782971  1.20300914
set.seed(1999)
rnorm(3)
[1]  0.73267249 -0.03782971  1.20300914

Iteration and simulation

Example: drug effect model

Suppose we have a model for the way a drug affected certain patients

  • All patients will undergo chemotherapy. We believe those who aren’t given the drug experience a reduction in tumor size of percentage \(X_{\mathrm{no\,drug}} \sim 100 \cdot \mathrm{Exponential}(R)\), where \(R \sim \mathrm{Uniform}(0,1)\)

  • And those who were given the drug experience a reduction in tumor size of percentage \(X_{\mathrm{drug}} \sim 100 \cdot \mathrm{Exp}(2)\)

Suppose some scientist collaborators are wondering how many patients would we need to have in each group (drug, no drug), in order to reliably see that the average reduction in tumor size is large…

What would you do?

  • Before: get out your pen and paper, make some approximations
  • Now: just simulate from the model, no approximations

Example: drug effect model

# suppose each group has 50 subjects
set.seed(100)
n_subjects <- 50 
mean_drug <- 2
mean_nodrug <- runif(n_subjects, 0, 1)
x_drug <- 100 * rexp(n_subjects, 1 / mean_drug) 
x_nodrug <- 100 * rexp(n_subjects, 1 / mean_nodrug)

tibble(x_drug, x_nodrug) |> 
  pivot_longer(everything(),
               names_to = "group",
               names_prefix = "x_",
               values_to = "reduction") |> 
  ggplot(aes(x = reduction, y = after_stat(density), 
             color = group)) +
  geom_histogram(aes(fill = group), 
                 alpha = 0.5, color = "black",
                 position = "identity") +
  geom_density(aes(color = group))

Good practices

Repetition and reproducibility

  • One single simulation is not always trustworthy (depends on the situation, of course)

  • In general, simulations should be repeated and aggregate results reported — requires iteration!

  • To make random number draws reproducible, we must set the seed with set.seed()

  • More than this, to make simulation results reproducible, we need to follow good programming practices (see this for example)

  • Gold standard: any time you show a simulation result (a figure, a table, etc.), you have code that can be run (by anyone) to produce exactly the same result

Iteration and simulation

  • Writing a function to complete a single run of your simulation/analysis is often very helpful

  • This allows the simulation itself to be intricate (e.g., intricate steps, several simulation parameters), but makes running the simulation simple

  • Then you can use iteration to run your simulation/analysis over and over again

Iteration and simulation

Example: Revisiting \(k\)-means clustering with gapminder data - compute the total within-cluster variation for different values of \(k\)

library(dslabs)
clean_gapminder <- gapminder |>
  filter(year == 2011, !is.na(gdp)) |>
  mutate(std_log_gdp = as.numeric(scale(log(gdp), center = TRUE, scale = TRUE)),
         std_life_exp = as.numeric(scale(life_expectancy, center = TRUE, scale = TRUE)))
# function to perform clustering for each value of k
gapminder_kmeans <- function(k) {
  kmeans_results <- clean_gapminder |>
    select(std_log_gdp, std_life_exp) |>
    kmeans(centers = k, nstart = 30)
  
  kmeans_out <- tibble(clusters = k,
                       total_wss = kmeans_results$tot.withinss)
  return(kmeans_out)
}

# number of clusters to search over
n_clusters_search <- 2:12

# iterate over each cluster value to compute total wss
kmeans_search <- n_clusters_search |> 
  map(gapminder_kmeans) |> 
  bind_rows()

Pre-allocation

Example: When 100 coins are tossed, what is the probability that exactly 50 are heads?

library(tictoc)
n_runs <- 500000
a <- c()
tic()
for (i in 1:n_runs) {
  tosses <- sample(0:1, size = 100, replace = TRUE)
  a[i] <- sum(tosses)
}
toc()

b <- rep(NA, n_runs)
tic()
for (i in 1:n_runs) {
  tosses <- sample(0:1, size = 100, replace = TRUE)
  b[i] <- sum(tosses)
}
toc()

# exact: (factorial(100) / (factorial(50) * factorial(50))) * (1 / 2) ^ 100
mean(b == 50)

Pre-allocation

  • Not only computations take time, memory allocations do too

  • Changing the size of a vector takes just about as long as creating a new vector does

  • Each time the size changes, R needs to reconsider its allocation of memory to the object

  • Never reallocate a vector after each iteration

Exporting and importing data

Reading/writing from/to a file

Sometimes simulations/analyses take a long time to run, and we want to save intermediate or final output, for quick reference later

Introducing the readr package (part of tidyverse; automatically loaded)

  • write_*() functions: exporting data

  • read_*() functions: importing data

Reading/writing from/to a file

  • write_csv() / read_csv(): export / import single R data frames or tibbles in .csv format

  • write_rds() / read_rds(): export / import single R objects (like a vector, matrix, list, data frame, etc.) in .rds format

Note that by default, the file will be written to the working directory (i.e. if you just specify the file name)

example_df <- tibble(x = rnorm(100), y = rnorm(100))
write_rds(example_df, "INSERT PATH/example_df.csv")
df <- read_csv("INSERT PATH/example_df.csv")

example_obj <- matrix(rnorm(25), 5, 5)
write_rds(example_obj, "INSERT PATH/example_obj.rds")
obj <- read_rds("INSERT PATH/example_obj.rds")

Example: saving \(k\)-means clustering results from earlier

# saving to a folder named "data" in the working directory
write_csv(kmeans_search, "data/kmeans_search_results.csv")

File path and working directory

  • To read in a file, you need to use the correct path, which should be relative and NOT absolute (or a path pointing to a location outside of the project directory) (read more here)

  • The key to getting paths to work is to understand working directory. In R, use the function getwd()

    • Note: NEVER use setwd() to change working directory. It’s a bad practice. (Here’s why)

File path and working directory

  • Special paths

    • . is the working directory

    • ~ is the home directory (e.g., on Quang’s laptop: /Users/qntkhvn)

    • .. is the parent directory. (e.g., ../steve.csv refers to a file called steve.csv in the directory that is one level above the working directory)

  • Common issue: By default, the working directory for an R Markdown or Quarto document is the directory in which that document is stored. This is NOT necessarily the working directory of your current R session.

  • Use list.files() to see what files are available in the working directory (or any other directory)

list.files()
list.files("~")

Resources

In reality…