Self Organizing Maps (SOM): Example using RNAseq reads

Part 2b: Super Self Organizing Maps

Super self organizing maps allows you to add another level to your analysis. For example in this project, I ran my RNAseq analysis on two different genotypes, tf-2 and wild type. Instead of clustering each gene one cluster, it forces the same gene from each genotype into the same cluster. Therefore, each gene is represented twice in this superSOM analysis (one for each genotype).

Required Libraries

library(ggplot2)
library(reshape)
library(kohonen)
library(RColorBrewer)

Read in data and clean up.

mostDEgenes <- read.csv("../data/allGeneListBothGenotypes.csv")

#keep only needed columns (gene, genotype, type, mean)
mostDEgenes <- mostDEgenes[c(7, 2, 1, 4)] 

#Change from long to wide data format
mostDEgene.long <- cast(mostDEgenes, genotype + gene ~ type, value.var = mean, fun.aggregate = "mean")  
## Using mean as value column.  Use the value argument to cast to override this choice
# to keep attributes associated
mostDEgene.long <- as.data.frame(mostDEgene.long)

PCA

mostDEgene.long <- as.data.frame(mostDEgene.long) 
names(mostDEgene.long)
## [1] "genotype" "gene"     "Ambr"     "Aother"   "Bmbr"     "Bother"  
## [7] "Cmbr"     "Cother"
scale_data <- as.matrix(t(scale(t(mostDEgene.long[c(3:8)]))))
head(scale_data) 
##         Ambr     Aother       Bmbr     Bother       Cmbr      Cother
## 1  0.6682306 -1.5249843 -0.8142005  1.0677854  0.6500786 -0.04690986
## 2 -0.3039380 -0.3362605  2.0337657 -0.4956094 -0.3798599 -0.51809787
## 3  0.4553713 -0.8054291  1.6148298 -0.5755183  0.3508907 -1.04014435
## 4 -0.5191322  1.6854463 -0.9561593 -0.1061458 -0.7444119  0.64040284
## 5 -0.1488347 -0.5336037  2.0228620 -0.4697011 -0.4084674 -0.46225509
## 6 -0.4345752 -0.4641784  2.0365771 -0.4264810 -0.2745967 -0.43674578
#Principle Component Analysis
pca <- prcomp(scale_data, scale=TRUE) 

summary(pca) 
## Importance of components:
##                           PC1    PC2   PC3    PC4    PC5       PC6
## Standard deviation     1.3056 1.1248 1.036 1.0207 0.9563 2.895e-15
## Proportion of Variance 0.2841 0.2109 0.179 0.1736 0.1524 0.000e+00
## Cumulative Proportion  0.2841 0.4949 0.674 0.8476 1.0000 1.000e+00
pca.scores <- data.frame(pca$x)

data.val <- cbind(mostDEgene.long, scale_data, pca.scores) 

Visualizing the PCA

Looks to be three major clusters.

p <- ggplot(data.val, aes(PC1, PC2)) 
p + geom_point()