R Lab (day 3): Data exploration, Principal Component Analysis

Download datasets here or from Canvas.

R script: Code

Lab Lecture: Slides

Exercise 1: Food

In the first exercise, we explore a low-dimensional dataset, Food.txt.

(This example is not a genomics example, however it is useful to illustrate important concepts in data exploration and dimensional reduction via PCA.)

Food.txt contains data on food consumption of a variety of food categories in European counries. First load the dataset.

food <- read.table('data/Food.txt', header=T)
# we change the name from pulses to a more common name, legume
colnames(food)[7] <- 'Legume'
head(food) # print first 6 lines 
               Meat Pigs Eggs Milk Fish Cereals Legume Fruit
Albania        10.1  1.4  0.5  8.9  0.2    42.3    5.5   1.7
Austria         8.9 14.0  4.3 19.9  2.1    28.0    1.3   4.3
Belg.Lux.      13.5  9.3  4.1 17.5  4.5    26.6    2.1   4.0
Bulgaria        7.8  6.0  1.6  8.3  1.2    56.7    3.7   4.2
Czechoslovakia  9.7 11.4  2.8 12.5  2.0    34.3    1.1   4.0
Denmark        10.6 10.8  3.7 25.0  9.9    21.9    0.7   2.4

Explore the dataset

The first thing to do after loading the dataset is to get to know its content better.

Try to find out the following information about the data:

  • the dimension (numbers of columns and rows)
  • the column names, and row names

We can also look at one food in particular, Fish: find the mean, maximum and minimum value for this variable. This is to investigate one column of the data.

dim(food)
[1] 25  8
colnames(food)
[1] "Meat"    "Pigs"    "Eggs"    "Milk"    "Fish"    "Cereals" "Legume" 
[8] "Fruit"  
rownames(food)
 [1] "Albania"        "Austria"        "Belg.Lux."      "Bulgaria"      
 [5] "Czechoslovakia" "Denmark"        "East.Germany"   "Finland"       
 [9] "France"         "Greece"         "Hungary"        "Ireland"       
[13] "Italy"          "Netherlands"    "Norway"         "Poland"        
[17] "Portugal"       "Romania"        "Spain"          "Sweden"        
[21] "Switzerland"    "United.Kingdom" "USSR"           "West.Germany"  
[25] "Yugoslavia"    
# explore one variable
mean(food$Fish)
[1] 4.284
max(food$Fish)
[1] 14.2
min(food$Fish)
[1] 0.2

Now we investigate the data by row. Each row is a country. We can extract the data (‘filter’) for Norway, for exmaple.

It is also easy to get data for more than one country, such as from Denmark and Sweden.

# subsetting
food[rownames(food) == 'Norway',]
       Meat Pigs Eggs Milk Fish Cereals Legume Fruit
Norway  9.4  4.7  2.7 23.4  9.7      23    1.6   2.7
food[rownames(food) %in% c('Norway', 'Denmark', 'Sweden'),]
        Meat Pigs Eggs Milk Fish Cereals Legume Fruit
Denmark 10.6 10.8  3.7 25.0  9.9    21.9    0.7   2.4
Norway   9.4  4.7  2.7 23.4  9.7    23.0    1.6   2.7
Sweden   9.9  7.8  3.5 24.7  7.5    19.5    1.4   2.0

Now we can make some simple visualizations. We can look at the distribution for each variable by making a histogram.

hist(food$Fish, main = 'Histogram of Fish')

hist(food$Fish, breaks = 20, main = 'Histogram of Fish')

hist(food$Meat, main = 'Histogram of Meat')

A box plot is another commonly used tool for presenting data distribution, for a few variables together. This way you can compare different foods.

# las rotates the axis text 90 degrees
boxplot(food, las = 2,
        main = 'Food consumption across European countries')

We can look at pairs of food, to identify the whether there is some kind of relationship between them. For example, do countries that consume more fish also consume more meat? What about cereal and fruit?

A scatter plot with one food on its x-axis and another on the y-axis can be useful.

We can add the country names on top of the scatter plot to provide more information.

# two variables
plot(food$Fish, food$Meat, pch=20, main = 'Fish vs Meat')

# to add country: run this line as well
# text(food$Fish, food$Meat, labels = rownames(food))

# choose aother pair of food, add country label
plot(food$Cereals, food$Fruit, main = 'Cereals vs Fruit')
text(food$Cereals, food$Fruit, labels = rownames(food))

You can make pair-wise scatter plots for all the food pairs. However when you have many variables, these plots become less easy to read.

# pair-wise scatter
plot(food, main = 'Pair-wise scatter plot')

Principal component analysis

We can use principal component analysis PCA to explore the dataset, and reveal more information beyond the original data points.

The command we use is prcomp(data, scale).

  • data is the dataset to carry out PCA. Sometimes you need to do some processing such as centering and scaling.
  • scale is an argument that asks the program to scale the data for us automatically. We specify it to be TRUE.

After running the command, you can print out the results using summary.

# need to scale the data
pc_food <- prcomp(food, scale=TRUE)
# pc_food
summary(pc_food)
Importance of components:
                          PC1    PC2    PC3    PC4     PC5     PC6     PC7
Standard deviation     1.9251 1.2073 1.0595 0.9315 0.57322 0.52889 0.35617
Proportion of Variance 0.4632 0.1822 0.1403 0.1085 0.04107 0.03497 0.01586
Cumulative Proportion  0.4632 0.6454 0.7857 0.8942 0.93527 0.97024 0.98609
                           PC8
Standard deviation     0.33354
Proportion of Variance 0.01391
Cumulative Proportion  1.00000

Examine the loading (rotation). This is a \(p \times n_{pc}\) matrix where each row corresponds to one feature from the original data, and each column corresponds to one principal component PC.

loading_food <- pc_food$rotation
# print out the result, but only keep 2 digits
round(loading_food, digits = 2)
          PC1   PC2   PC3   PC4   PC5   PC6   PC7   PC8
Meat    -0.33 -0.05 -0.20 -0.72 -0.48 -0.20 -0.12 -0.21
Pigs    -0.31  0.15  0.68  0.20 -0.06 -0.06 -0.37 -0.48
Eggs    -0.44 -0.07  0.23 -0.26  0.27  0.53  0.57 -0.06
Milk    -0.41  0.15 -0.36 -0.03  0.70 -0.38 -0.15 -0.13
Fish    -0.13 -0.67 -0.32  0.40 -0.13  0.02  0.11 -0.49
Cereals  0.45  0.29  0.05 -0.13  0.06 -0.32  0.52 -0.56
Legume   0.43 -0.17 -0.05 -0.36  0.34  0.46 -0.46 -0.32
Fruit    0.13 -0.62  0.45 -0.25  0.24 -0.46  0.06  0.23

Scores (projections) are produced by multiplying the original \(n \times p\) data matrix and \(p \times n_{pc}\) loading matrix. The result is a \(n \times n_{pc}\) matrix where rows correspond to the rows in the original data (country names), and columns are the principal components.

You can visualize the points using a few selected PCs rather than the original data (next section).

scores_food <- pc_food$x
round(scores_food, digits = 2)
                 PC1   PC2   PC3   PC4   PC5   PC6   PC7   PC8
Albania         2.94  1.40 -1.60 -0.52 -1.02  0.22 -0.73  0.52
Austria        -1.61  0.65  1.60  0.30  0.41  0.11  0.19 -0.03
Belg.Lux.      -1.43 -0.13  0.18 -0.70 -0.46  0.27  0.28 -0.08
Bulgaria        2.70  1.02  0.32 -0.10 -0.50 -0.63  0.69 -0.26
Czechoslovakia -0.23  0.78  1.09  0.36 -0.80 -0.38  0.15  0.17
Denmark        -2.36 -0.34 -0.72  1.25 -0.11  0.07  0.08 -0.73
East.Germany   -1.03 -0.04  1.01  1.07 -0.82  0.55  0.38  0.20
Finland        -1.49  0.92 -2.26  0.91  0.87 -0.55  0.00  0.21
France         -1.46 -1.17  0.27 -1.73 -0.78 -1.10 -0.28 -0.40
Greece          1.97 -1.50 -0.65 -1.44  1.16  0.21 -0.10 -0.52
Hungary         1.50  0.86  1.84  0.26  0.44  0.83 -0.45 -0.36
Ireland        -2.46  0.84 -0.06 -0.92  0.28  0.30  0.20  0.03
Italy           1.22 -0.88  0.39 -0.70  0.43 -0.24  0.37  0.53
Netherlands    -1.73  0.69  0.94  0.40  0.47 -0.02 -0.62  0.03
Norway         -0.94 -0.66 -1.85  1.17 -0.02 -0.05  0.12  0.10
Poland          0.22 -0.20  1.22  0.45  0.75 -1.10  0.18  0.34
Portugal        1.83 -3.70  0.15  1.49 -0.64 -0.15 -0.35 -0.19
Romania         2.68  1.32 -0.10  0.24  0.04  0.19 -0.16 -0.30
Spain           1.71 -2.26  0.19 -0.27  0.44  0.77  0.23  0.63
Sweden         -1.81  0.04 -1.17  0.95  0.12  0.40 -0.05  0.03
Switzerland    -1.24  0.18  0.26 -0.79  0.21 -0.69 -0.61  0.22
United.Kingdom -1.78 -0.21 -0.96 -2.03 -0.38  0.78  0.31 -0.08
USSR            1.23  0.86 -0.90 -0.05 -0.12 -0.25  0.32 -0.06
West.Germany   -2.01  0.20  0.88  0.03 -0.22  0.41 -0.29  0.24
Yugoslavia      3.57  1.35 -0.05  0.37  0.26  0.07  0.13 -0.24

(Optional) You can double check that the score is indeed the product of the original data (after scaling) with loading. Take Albania for example (first row of the data). Its score for PC1 is 2.94. Let us try to reproduce this number from formula.

food[1,] 
        Meat Pigs Eggs Milk Fish Cereals Legume Fruit
Albania 10.1  1.4  0.5  8.9  0.2    42.3    5.5   1.7
# scale the data
food_s <- scale(food)
round(food_s[1,], digits = 2) # after centering and scaling
   Meat    Pigs    Eggs    Milk    Fish Cereals  Legume   Fruit 
   0.08   -1.83   -2.24   -1.16   -1.20    0.92    1.22   -1.35 
# loading for PC1 is the first column
round(loading_food[,1], digits = 2)
   Meat    Pigs    Eggs    Milk    Fish Cereals  Legume   Fruit 
  -0.33   -0.31   -0.44   -0.41   -0.13    0.45    0.43    0.13 
# let us muliple element by element, then sum together
prod_by_element <- as.numeric(food_s[1,]) * loading_food[, 1]
round(prod_by_element, digits = 2)
   Meat    Pigs    Eggs    Milk    Fish Cereals  Legume   Fruit 
  -0.03    0.58    1.00    0.47    0.15    0.42    0.53   -0.17 
# sum all elements together
sum(prod_by_element)
[1] 2.943309

Another important result from the prcomp() command is the explained variance (PVE, proportion of variance explained). All PCs explain 100% of the variance; while the first PC explains the largest proportion (46%), second PC explains the second largest proportion (18.2%) and so on.

You can find the variance (squared standard deviation, sdev) using the following commands. You can compare with the results from summary(), see if they correspond.

# variance explained by each PC
pc_food_var <- pc_food$sdev^2
# proportion
pc_food_pve <- pc_food_var/sum(pc_food_var)
# print out, keep 3 digits
round(pc_food_pve, digits = 3)
[1] 0.463 0.182 0.140 0.108 0.041 0.035 0.016 0.014
# cumulative of 1st, 2nd, ... 8th PC
cumsum(pc_food_pve)
[1] 0.4632302 0.6454168 0.7857372 0.8941976 0.9352708 0.9702365 0.9860940
[8] 1.0000000

Visualize PCA results

Biplot of the principal components displays two principal components with their labels. By default, it plots the first 2 PC; but you can also select PC1, PC3 with the choices = c(1,3) argument.

The country labels indicate the position where the original datapoints are projected onto PC1 and PC2. The closer points are clustered together, the more similar the points are. You can find the arrows pointing towards different directions, which tells additional information as in what way countries are similar.

# by default, pc1 pc2
# set x-axis limit
# same as biplot(pc_food, choices = c(1,2)) 
biplot(pc_food, xlim = c(-0.5, 0.5))

# choose pc1 and pc3
biplot(pc_food, choices = c(1,3)) 

Now we can visualize the variance explained by the PCs, and compare with the original data features (columns). We use bar plots on the variance already computed from the previous section.

It can be seen that the first PC corresponds to the largest variance explained, the second PC corresponds to the second largest variance explained, etc.

On the other hand, the original data features have different variances, and do not seem to have obvious patterns.

par(mfrow = c(1, 2))
# variance of the PCs
barplot(pc_food_var, las=2,
        main='Principal components', ylab='Variances',
        names.arg = paste('PC ',1:length(pc_food_var),sep=''))

# variance of the original data variables
# apply command here computes sd by column
barplot(apply(food, 2, sd)^2, las=2, main='Original Variables', ylim=c(0,150), ylab='Variances')

Lastly, you can visualize the cumulative variance explained well.

par(mfrow = c(1,1))
# PVE
plot(cumsum(pc_food_pve), type='b', axes=F, xlab='number of components',
     ylab='contribution to total variance', ylim=c(0,1))
abline(h=1, col='blue')
abline(h=0.8, lty=2, col='blue')
box()
axis(2,at=0:10/10,labels=0:10/10)
axis(1,at=1:ncol(food),labels=1:ncol(food),las=2)

Exercise 2: NCI60

Now we move to a typical large-scale biological data set.

We have already seen the NCI60 cancer cell line microarray data set, consisting of 6830 gene expression measurements on 64 cancer cell lines.

library(ISLR)
nci.labs <- NCI60$labs # Sample labels (tissue type)
nci.data <- NCI60$data # Gene expression data set

nci.data contains the gene expression level measured on 6830 genes. We can find out the dimension with dim() command.

nci.labs contains the labels for the cancer types. Note that some cancer types have more than one cell line, we can get a quick overview using table(). This counts how many times each cancer type is present in the labels.

Since the focus of today’s topic is unsupervised learning, we do not need to pay too much attention to the labels. However we can use the labels in the end to check whether the PCA and clustering produce meaningful results.

dim(nci.data)
[1]   64 6830
table(nci.labs)
nci.labs
     BREAST         CNS       COLON K562A-repro K562B-repro    LEUKEMIA 
          7           5           7           1           1           6 
MCF7A-repro MCF7D-repro    MELANOMA       NSCLC     OVARIAN    PROSTATE 
          1           1           8           9           6           2 
      RENAL     UNKNOWN 
          9           1 
# what if I would like to compute the mean of each gene within each tissue type?
tissue.means <- apply(nci.data, 2, function(x){tapply(x, nci.labs, mean)})
dim(tissue.means)
[1]   14 6830

You can and should carry out some exploratory analysis on the data, such as finding out the means and extreme values; this is left for home practice.

Carry out PCA

The dataset is a high-dimensional one, that is the number of columns (features) are much larger than the number of rows (measurements). It is inpractical to analyze and search for hidden structures by looking at each one of the feature, and PCA is a useful tool to move forward.

We use prcomp() to produce PCA results. Set the scale = T to make sure the data is standardized to have standard deviation equal to 1 among each column.

# PCA analysis after scaling the variables to standard deviation one:
pr.out <- prcomp(nci.data, scale=TRUE)

Print the summary output with summary(). Note that the result is quite long, so we didn’t print out everything; but you can execute the line and see what it does.

summary(pr.out)

Now we investigate the variance explained by the PCs. The y-axis in the plots below corresponds to the percentage (from 0 to 100%) explained.

pr.var <- pr.out$sdev^2
pve <- pr.var/sum(pr.var)
pve <- 100*pve # display percentage

par(mfrow=c(1,2))
plot(pve,  type="o", ylab="PVE (%)", xlab="Principal Component", col="blue")
plot(cumsum(pve), type="o", ylim = c(0, 100),ylab="Cumulative PVE (%)", xlab="Principal Component", col="brown3")

How many PCs to keep?

The maximum number of PC computed in principal component analysis is the \(min\{n, p\}\). This means, for a high dimensional problem like this one, we have at most 64 PCs.

For the purpose of dimensional reduction, we do not need to keep all the PCs. We can check how many PCs explain at least a certain amount of the variance. For example, we found out that the first 24 PC and first 32 PC explained 70% and 80% of the variance. We can mark it on the plots from before.

mysel70 <- which(cumsum(pve) > 70)[1] # explains 70% 
mysel80 <- which(cumsum(pve) > 80)[1] # explains 80% 
c(mysel70, mysel80)
[1] 24 32
par(mfrow=c(1,2)) # plot contains two smaller plots next to each other
plot(pve,  type="o", ylab="PVE", xlab="Principal Component", col="blue")
abline(v = mysel80)
abline(v = mysel70, col=3)
plot(cumsum(pve), type="o", ylab="Cumulative PVE", xlab="Principal Component", col="brown3")
abline(v = mysel80)
abline(h = 80)
abline(v = mysel70, col=3)
abline(h = 70, col=3)

If we decide to only keep the principal components that explains 70% of the variance, we end up with 24 components, which we can further analyse to better understand the relationships between the variables. For simplicity we only look at the first few components.

We plot the first few principal component score vectors, to visualize the results. The observations (cell lines) corresponding to a given cancer type will be plotted in the same colour.

Cols=function(vec){
  cols=rainbow(length(unique(vec)))
  return(cols[as.numeric(as.factor(vec))])
}

# Plot the first vs second and first vs third principal component score vectors,
# with colors associated to labels (using the Cols() helper function)
par(mfrow=c(1,2))
plot(pr.out$x[,1:2], col=Cols(nci.labs), pch=19,xlab="PC 1",ylab=" PC 2")
plot(pr.out$x[,c(1,3)], col=Cols(nci.labs), pch=19,xlab="PC 1",ylab=" PC 3")
legend('topleft', col=rainbow(length(unique(nci.labs))), legend=unique(nci.labs), bty='n', lwd=2, cex=.6)

Why not use biplot?

You might have realized that the plots above are essentially what biplot did in the first exercise: plotting PC1, PC2 againt each other. The reason we do not use it here is simple: by default biplot displays also the loading vectors (rotation). Given that we have 64 PCs, the final figure would be unreadable.

You can try to remove the loading vectors from biplot as an exercise.

Exercise 3: Gene expression data

(CH12Ex13 from statistical learning)

Consider again the gene expression data set “Ch12Ex13.csv” (which can be also found on the book website) that consists of 40 tissue samples with measurements on 1,000 genes.

The first 20 samples are from healthy patients, while the second 20 are from a diseased group.

Load in the data using read.csv(). You will need to select header=F.

Alternatively: load in the data using “Import dataset” in the upper right window, and click “no” on the “Heading” option.

Perform a PCA of these data and visualize the results.

Note: remember to check if the variables (genes) are on the columns in the dataset before running the PCA. If they are not: use t() to transform the dataset.

# set the path to your own!
exp.data <-  read.csv("data/Ch12Ex13.csv",header=FALSE)

# I want each row to represent a sample, and each column a gene
exp.data <- t(exp.data)
dim(exp.data)
[1]   40 1000
# should have n=40 samples/rows, and 1000 columns --> OK!
groups <- c(rep(1,20), rep(2,20)) # group variable

Carry out PCA

# PCA
pr.exp <- prcomp(exp.data, scale=TRUE)

# Plot proportion of variance explained
pr.var <- pr.exp$sdev^2
pve <- pr.var/sum(pr.var)
pve <- 100*pve
par(mfrow=c(1,2))
plot(pve, type="o", ylab="PVE", xlab="Principal Component", col="blue")
plot(cumsum(pve), type="o", ylab="Cumulative PVE", xlab="Principal Component", col="red")

Looks like most of the principal components are needed to explain the data well. Maybe we can decide to keep 25-30 components?

Can also plot some of the first principal components

# Remember the use the helper-function to get colours
Cols=function(vec){
  cols=rainbow(length(unique(vec)))
  return(cols[as.numeric(as.factor(vec))])
}

par(mfrow=c(1,2)) # plot-window has two small plots next to each other
plot(pr.exp$x[,1:2], col=Cols(groups), pch=19, xlab="PC 1", ylab=" PC 2")
plot(pr.exp$x[,c(1,3)], col=Cols(groups), pch=19,xlab="PC 1",ylab=" PC 3")
legend('topleft', col=rainbow(length(unique(groups))), legend=paste('group ',unique(groups),sep=''), bty='n', lwd=2, cex=.6)