Self Organizing Maps (SOM): Example using RNAseq reads

Part 2: Running PCA and SOM

Principle Component Analysis

When running SOM, it is sometimes helpful to first run PCA to see the general spread of your dataset. Later you can map your SOM cluster results back onto the PCA and see if your PCA clusters can be further defined by the gene expression patterns resulting from you SOM results.

Required Libraries

library(ggplot2)
library(reshape)
library(kohonen)
library(RColorBrewer)

Read in data

First read in file that came from part 1 allGeneList.csv. This is a list of genes from all DE analysis in WT that were performed previously. They were all cancatenated, then duplicate genes were removed. In addition the mean was calculated from the replicates of each type.

The first step is to get the data into the right format. First column being the genes, while the subsequent columns are the different libraries (type).

mostDEgenes <- read.csv("../data/allGeneList_WTonly.csv")
head(mostDEgenes)
##     type genotype N       mean         sd         se               gene
## 2   Ambr       wt 3  17.193379  21.125010 12.1965304 Solyc00g005070.1.1
## 3 Aother       wt 5   4.523554   1.835113  0.8206874 Solyc00g005070.1.1
## 4   Bmbr       wt 4  12.645609  18.655600  9.3277998 Solyc00g005070.1.1
## 5 Bother       wt 4   3.461582   1.789494  0.8947468 Solyc00g005070.1.1
## 6   Cmbr       wt 6   4.180826   2.818555  1.1506704 Solyc00g005070.1.1
## 7 Cother       wt 3 105.966982 168.647913 97.3689180 Solyc00g005070.1.1
mostDEgenes <- mostDEgenes[c(7, 1, 4)] #keep only needed columns (gene, type, mean)

#Change from long to wide data format
mostDEgene.long <- cast(mostDEgenes, gene ~ type, value.var = mean, fun.aggregate = "mean")  
## Using mean as value column.  Use the value argument to cast to override this choice
mostDEgene.long <- as.data.frame(mostDEgene.long) #transformation. 
names(mostDEgene.long) #check
## [1] "gene"   "Ambr"   "Aother" "Bmbr"   "Bother" "Cmbr"   "Cother"
# set up for PCA
scale_data <- as.matrix(t(scale(t(mostDEgene.long[c(2:7)]))))

#Principle Component Analysis
pca <- prcomp(scale_data, scale=TRUE) 

summary(pca) 
## Importance of components:
##                           PC1    PC2    PC3    PC4    PC5      PC6
## Standard deviation     1.3989 1.1164 1.0575 0.9617 0.8679 9.21e-16
## Proportion of Variance 0.3262 0.2077 0.1864 0.1541 0.1255 0.00e+00
## Cumulative Proportion  0.3262 0.5339 0.7203 0.8745 1.0000 1.00e+00
pca.scores <- data.frame(pca$x)

I put all the information from each analysis back together throughout the process. Makes visualiztion easier down the line.

# Add back to original so everything is together.
data.val <- cbind(mostDEgene.long, scale_data, pca.scores) 
head(data.val)
##                 gene         Ambr      Aother        Bmbr     Bother
## 1 Solyc00g005070.1.1   17.1933786   4.5235545  12.6456091   3.461582
## 2 Solyc00g005080.1.1   16.0214526  10.1276589   7.5621826   8.495561
## 3 Solyc00g005840.2.1   19.7457984  14.0743283  11.4830294  83.513288
## 4 Solyc00g005870.1.1    0.1335935   0.6413574   3.1240810   2.778641
## 5 Solyc00g006470.1.1 2455.5110955 605.8620487 108.0946960 360.481814
## 6 Solyc00g006670.2.1   66.9760013   5.9946868   0.6022608   7.070824
##         Cmbr     Cother       Ambr     Aother        Bmbr     Bother
## 1   4.180826 105.966982 -0.1857292 -0.5008022 -0.29882308 -0.5272113
## 2   8.413141  25.028796  0.5010396 -0.3641074 -0.74069236 -0.6036822
## 3  15.811238  13.981084 -0.2380746 -0.4399334 -0.53216284  2.0315363
## 4   1.780475  12.461880 -0.7372084 -0.6255713 -0.07971921 -0.1556676
## 5 499.447526 390.760289  2.0023986 -0.1524161 -0.73230794 -0.4382806
## 6  11.065297   4.677613  2.0225935 -0.4000473 -0.61427544 -0.3572950
##         Cmbr     Cother        PC1        PC2         PC3        PC4
## 1 -0.5093251  2.0218909  0.4123820  1.9720828  0.62320725 0.23135121
## 2 -0.6157807  1.8232230  0.4094212  1.4972020  1.14022326 0.75216342
## 3 -0.3781133 -0.4432521 -0.9687199 -0.9405744 -0.45944789 0.47066745
## 4 -0.3751246  1.9732912  0.2002039  2.0773175  0.08514379 0.01114138
## 5 -0.2763875 -0.4030066  1.1163748 -1.0607219  0.54496613 0.93810521
## 6 -0.1986042 -0.4523715  1.3269062 -1.0423991  0.32411614 0.97528268
##          PC5           PC6
## 1  0.1378721 -1.110223e-15
## 2 -0.1591747 -1.332268e-15
## 3  1.6558976  1.498801e-15
## 4  0.5230315 -9.992007e-16
## 5 -0.4747755 -5.551115e-16
## 6 -0.2106054 -1.665335e-16

Visualizing the PCA

By eye you can see my data is clustered into three major clusters.

p <- ggplot(data.val, aes(PC1, PC2)) 
p + geom_point()