When running SOM, it is sometimes helpful to first run PCA to see the general spread of your dataset. Later you can map your SOM cluster results back onto the PCA and see if your PCA clusters can be further defined by the gene expression patterns resulting from you SOM results.
library(ggplot2)
library(reshape)
library(kohonen)
library(RColorBrewer)
First read in file that came from part 1 allGeneList.csv
. This is a list of genes from all DE analysis in WT that were performed previously. They were all cancatenated, then duplicate genes were removed. In addition the mean was calculated from the replicates of each type.
The first step is to get the data into the right format. First column being the genes, while the subsequent columns are the different libraries (type).
mostDEgenes <- read.csv("../data/allGeneList_WTonly.csv")
head(mostDEgenes)
## type genotype N mean sd se gene
## 2 Ambr wt 3 17.193379 21.125010 12.1965304 Solyc00g005070.1.1
## 3 Aother wt 5 4.523554 1.835113 0.8206874 Solyc00g005070.1.1
## 4 Bmbr wt 4 12.645609 18.655600 9.3277998 Solyc00g005070.1.1
## 5 Bother wt 4 3.461582 1.789494 0.8947468 Solyc00g005070.1.1
## 6 Cmbr wt 6 4.180826 2.818555 1.1506704 Solyc00g005070.1.1
## 7 Cother wt 3 105.966982 168.647913 97.3689180 Solyc00g005070.1.1
mostDEgenes <- mostDEgenes[c(7, 1, 4)] #keep only needed columns (gene, type, mean)
#Change from long to wide data format
mostDEgene.long <- cast(mostDEgenes, gene ~ type, value.var = mean, fun.aggregate = "mean")
## Using mean as value column. Use the value argument to cast to override this choice
mostDEgene.long <- as.data.frame(mostDEgene.long) #transformation.
names(mostDEgene.long) #check
## [1] "gene" "Ambr" "Aother" "Bmbr" "Bother" "Cmbr" "Cother"
# set up for PCA
scale_data <- as.matrix(t(scale(t(mostDEgene.long[c(2:7)]))))
#Principle Component Analysis
pca <- prcomp(scale_data, scale=TRUE)
summary(pca)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6
## Standard deviation 1.3989 1.1164 1.0575 0.9617 0.8679 9.21e-16
## Proportion of Variance 0.3262 0.2077 0.1864 0.1541 0.1255 0.00e+00
## Cumulative Proportion 0.3262 0.5339 0.7203 0.8745 1.0000 1.00e+00
pca.scores <- data.frame(pca$x)
I put all the information from each analysis back together throughout the process. Makes visualiztion easier down the line.
# Add back to original so everything is together.
data.val <- cbind(mostDEgene.long, scale_data, pca.scores)
head(data.val)
## gene Ambr Aother Bmbr Bother
## 1 Solyc00g005070.1.1 17.1933786 4.5235545 12.6456091 3.461582
## 2 Solyc00g005080.1.1 16.0214526 10.1276589 7.5621826 8.495561
## 3 Solyc00g005840.2.1 19.7457984 14.0743283 11.4830294 83.513288
## 4 Solyc00g005870.1.1 0.1335935 0.6413574 3.1240810 2.778641
## 5 Solyc00g006470.1.1 2455.5110955 605.8620487 108.0946960 360.481814
## 6 Solyc00g006670.2.1 66.9760013 5.9946868 0.6022608 7.070824
## Cmbr Cother Ambr Aother Bmbr Bother
## 1 4.180826 105.966982 -0.1857292 -0.5008022 -0.29882308 -0.5272113
## 2 8.413141 25.028796 0.5010396 -0.3641074 -0.74069236 -0.6036822
## 3 15.811238 13.981084 -0.2380746 -0.4399334 -0.53216284 2.0315363
## 4 1.780475 12.461880 -0.7372084 -0.6255713 -0.07971921 -0.1556676
## 5 499.447526 390.760289 2.0023986 -0.1524161 -0.73230794 -0.4382806
## 6 11.065297 4.677613 2.0225935 -0.4000473 -0.61427544 -0.3572950
## Cmbr Cother PC1 PC2 PC3 PC4
## 1 -0.5093251 2.0218909 0.4123820 1.9720828 0.62320725 0.23135121
## 2 -0.6157807 1.8232230 0.4094212 1.4972020 1.14022326 0.75216342
## 3 -0.3781133 -0.4432521 -0.9687199 -0.9405744 -0.45944789 0.47066745
## 4 -0.3751246 1.9732912 0.2002039 2.0773175 0.08514379 0.01114138
## 5 -0.2763875 -0.4030066 1.1163748 -1.0607219 0.54496613 0.93810521
## 6 -0.1986042 -0.4523715 1.3269062 -1.0423991 0.32411614 0.97528268
## PC5 PC6
## 1 0.1378721 -1.110223e-15
## 2 -0.1591747 -1.332268e-15
## 3 1.6558976 1.498801e-15
## 4 0.5230315 -9.992007e-16
## 5 -0.4747755 -5.551115e-16
## 6 -0.2106054 -1.665335e-16
By eye you can see my data is clustered into three major clusters.
p <- ggplot(data.val, aes(PC1, PC2))
p + geom_point()
Clustering is performed using the som()
function on the scaled gene expression values.
# Check where the scaled gene expression values are.
names(data.val)
## [1] "gene" "Ambr" "Aother" "Bmbr" "Bother" "Cmbr" "Cother"
## [8] "Ambr" "Aother" "Bmbr" "Bother" "Cmbr" "Cother" "PC1"
## [15] "PC2" "PC3" "PC4" "PC5" "PC6"
#Subset for SOM in a matrix.
#som() only works on matrices NOT dataframes
#subset only the scaled gene expression values
som.data <- as.matrix(data.val[,c(8:13)])
# Set seed, just make sure you keep the same.
# Has to do with the randomization process.
set.seed(2)
#This is where you change the size of the map
som <- som(data=som.data, somgrid(6,6,"hexagonal"))
summary(som)
## som map of size 6x6 with a hexagonal topology.
## Training data included; dimension is 2249 by 6
## Mean distance to the closest unit in the map: 0.4130783
This shows a hundred iterations. Training decreases with iterations and plateaus at around 40 iterations. Ideally you want the the training to reach a minimum plateau.
plot(som, type ="changes")
The the code plot shows each cluster and the node wieght vectors or “codes” associated with each node. These are made up of the original normalized values of the original values used to generate the map. You should see patterns of clustering.
The fan chart in the center of the clusters reveals the characteristics that define how the genes were clustered into each particular cluster. For instance if one cluster has only one large fan piece, say for Bother, this is telling us that most of the genes in this cluster were grouped because of similar normalized gene count value of the Bother region. We do not know the degree, it could mean all these genes are up-regulated or down-regulated in the Bother region, but we do not know which at this point.
plot(som, type = "codes")
This tells you how many genes are in each of the clusters. The count plot can be used as a quality check. Ideally you want a uniform distribution. If there are some peaks in certain areas, this means you should likely increase the map size. If you have empty nodes you should decrease the map size [1].
plot(som, type = "counts")
This is sometimes called the “U-Matrix”, it can help identify further clustering. Areas of low neighbour distance indicate groups of nodes that are similar and the further apart nodes indicate natural “borders” in the map.
plot(som, type="dist.neighbours")
This shows the distribution of each type
#changed to dataframe to extract column names easier.
som$data <- data.frame(som$data)
#This is just a loop that plots the distribution of each tissue type across the map.
for (i in 1:6){
plot(som, type = "property", property = som$codes[,i], main=names(som$data)[i])
print(plot)
}
## function (x, y, ...)
## UseMethod("plot")
## <bytecode: 0x7f9a4b87be78>
## <environment: namespace:graphics>
## function (x, y, ...)
## UseMethod("plot")
## <bytecode: 0x7f9a4b87be78>
## <environment: namespace:graphics>
## function (x, y, ...)
## UseMethod("plot")
## <bytecode: 0x7f9a4b87be78>
## <environment: namespace:graphics>
## function (x, y, ...)
## UseMethod("plot")
## <bytecode: 0x7f9a4b87be78>
## <environment: namespace:graphics>
## function (x, y, ...)
## UseMethod("plot")
## <bytecode: 0x7f9a4b87be78>
## <environment: namespace:graphics>
## function (x, y, ...)
## UseMethod("plot")
## <bytecode: 0x7f9a4b87be78>
## <environment: namespace:graphics>
You can further split the clusters into a smaller set of clusters using hierarchical clustering.
## use hierarchical clustering to cluster the codebook vectors
som_cluster <- cutree(hclust(dist(som$codes)), 5)
# plot these results:
plot(som, type="mapping", bgcol = som_cluster, main = "Clusters")
add.cluster.boundaries(som, som_cluster)
# Attach the hierchal cluster to the larger dataset data.val.
som_clusterKey <- data.frame(som_cluster)
som_clusterKey$unit.classif <- c(1:36)
data.val <- cbind(data.val,som$unit.classif,som$distances)
head(data.val)
## gene Ambr Aother Bmbr Bother
## 1 Solyc00g005070.1.1 17.1933786 4.5235545 12.6456091 3.461582
## 2 Solyc00g005080.1.1 16.0214526 10.1276589 7.5621826 8.495561
## 3 Solyc00g005840.2.1 19.7457984 14.0743283 11.4830294 83.513288
## 4 Solyc00g005870.1.1 0.1335935 0.6413574 3.1240810 2.778641
## 5 Solyc00g006470.1.1 2455.5110955 605.8620487 108.0946960 360.481814
## 6 Solyc00g006670.2.1 66.9760013 5.9946868 0.6022608 7.070824
## Cmbr Cother Ambr Aother Bmbr Bother
## 1 4.180826 105.966982 -0.1857292 -0.5008022 -0.29882308 -0.5272113
## 2 8.413141 25.028796 0.5010396 -0.3641074 -0.74069236 -0.6036822
## 3 15.811238 13.981084 -0.2380746 -0.4399334 -0.53216284 2.0315363
## 4 1.780475 12.461880 -0.7372084 -0.6255713 -0.07971921 -0.1556676
## 5 499.447526 390.760289 2.0023986 -0.1524161 -0.73230794 -0.4382806
## 6 11.065297 4.677613 2.0225935 -0.4000473 -0.61427544 -0.3572950
## Cmbr Cother PC1 PC2 PC3 PC4
## 1 -0.5093251 2.0218909 0.4123820 1.9720828 0.62320725 0.23135121
## 2 -0.6157807 1.8232230 0.4094212 1.4972020 1.14022326 0.75216342
## 3 -0.3781133 -0.4432521 -0.9687199 -0.9405744 -0.45944789 0.47066745
## 4 -0.3751246 1.9732912 0.2002039 2.0773175 0.08514379 0.01114138
## 5 -0.2763875 -0.4030066 1.1163748 -1.0607219 0.54496613 0.93810521
## 6 -0.1986042 -0.4523715 1.3269062 -1.0423991 0.32411614 0.97528268
## PC5 PC6 som$unit.classif som$distances
## 1 0.1378721 -1.110223e-15 30 0.02231733
## 2 -0.1591747 -1.332268e-15 36 0.29893594
## 3 1.6558976 1.498801e-15 1 0.06933208
## 4 0.5230315 -9.992007e-16 29 0.24555383
## 5 -0.4747755 -5.551115e-16 21 0.05962820
## 6 -0.2106054 -1.665335e-16 21 0.04081959
#Merge data.val with som_clusterKey
##change data.val to match som_cluster key
names(data.val)[20] <- "unit.classif"
data.val <- merge(data.val, som_clusterKey, by.x = "unit.classif" ) #ignore warning, this is what you want. You are essentially filling in the empties with the value from som_clusterKey
## Warning in merge.data.frame(data.val, som_clusterKey, by.x =
## "unit.classif"): column names 'Ambr', 'Aother', 'Bmbr', 'Bother', 'Cmbr',
## 'Cother' are duplicated in the result
# Write out your data at the end to save your results of the SOM
#write.table(data.val, file="../data/analysis1.som.data.txt")
# read in data
# data.val <- read.table("../data/analysis1.som.data.txt",header=TRUE)
names(data.val)
## [1] "unit.classif" "gene" "Ambr" "Aother"
## [5] "Bmbr" "Bother" "Cmbr" "Cother"
## [9] "Ambr" "Aother" "Bmbr" "Bother"
## [13] "Cmbr" "Cother" "PC1" "PC2"
## [17] "PC3" "PC4" "PC5" "PC6"
## [21] "som$distances" "som_cluster"
dim(data.val)
## [1] 2249 22
p <- ggplot(data.val, aes(PC1, PC2, colour=factor(unit.classif)))
p + geom_point() + theme_bw()
Notice I am plot.data$unit.classif, which is the clusters generated when we used som()
. You can also use the assignments made from the hierarchical clustering.
p <- ggplot(data.val, aes(PC1, PC2, colour=factor(som_cluster)))
p + geom_point() + theme_bw()
The size of the map is something that may cause differences in the genes that are clustered. The only way to see how this is affects what we see is to compare the clusters of a small and large map.
Make a smaller SOM map using this data and visualize the differences on the PCA.
Using a small map size (3,2), I found they cluster according to tissue type. This makes the interpretation of the results pretty straight forward. My only worry is that the map might not be large enough, considering the Kohenen Package Manual Vignette [1] suggests that you pick the size of the map based on count distribution, the goal being an even distribution, with no “peak” counts in any one cluster while also having no empty clusters.
References: