Clustering was not intuitive to me when I first started hearing about it. Now it is a go to tool of mine for understanding my data. It only really made sense when I learned how to visualize my data. Here I present a workflow for approaching a clustering problem using the kohonen
Clustering: Intuitively, clustering is the problem of partitioning a finite set of points in a multidimensional space into classes (called clusters) so that 1. the points belonging to the same class are similar and 2. the points belonging to different classes are dissimilar
# Install missing libraries
library(RCurl) # to access datafile
is a useful tool to access data files on Github.
tip: Make sure you use the “raw” data URL link if you ever plan on using this tool another time.
x <- getURL("")
titanic <- read.csv(text = x)
# # write out to save dataset, so you can access without internet.
# write.csv(titanic, "./titanic.csv")
# titanic <- read.csv("./titanic.csv")
You can read about the data here.
## pclass survived name sex
## 1 1 1 Allen, Miss. Elisabeth Walton female
## 2 1 1 Allison, Master. Hudson Trevor male
## 3 1 0 Allison, Miss. Helen Loraine female
## 4 1 0 Allison, Mr. Hudson Joshua Creighton male
## 5 1 0 Allison, Mrs. Hudson J C (Bessie Waldo Daniels) female
## 6 1 1 Anderson, Mr. Harry male
## age sibsp parch ticket fare cabin embarked boat body
## 1 29.0000 0 0 24160 211.3375 B5 S 2 NA
## 2 0.9167 1 2 113781 151.5500 C22 C26 S 11 NA
## 3 2.0000 1 2 113781 151.5500 C22 C26 S NA
## 4 30.0000 1 2 113781 151.5500 C22 C26 S 135
## 5 25.0000 1 2 113781 151.5500 C22 C26 S NA
## 6 48.0000 0 0 19952 26.5500 E12 S 3 NA
## home.dest
## 1 St Louis, MO
## 2 Montreal, PQ / Chesterville, ON
## 3 Montreal, PQ / Chesterville, ON
## 4 Montreal, PQ / Chesterville, ON
## 5 Montreal, PQ / Chesterville, ON
## 6 New York, NY
## 'data.frame': 1310 obs. of 14 variables:
## $ pclass : int 1 1 1 1 1 1 1 1 1 1 ...
## $ survived : int 1 1 0 0 0 1 1 0 1 0 ...
## $ name : Factor w/ 1308 levels "","Abbing, Mr. Anthony",..: 23 25 26 27 28 32 47 48 52 56 ...
## $ sex : Factor w/ 3 levels "","female","male": 2 3 2 3 2 3 2 3 2 3 ...
## $ age : num 29 0.917 2 30 25 ...
## $ sibsp : int 0 1 1 1 1 0 1 0 2 0 ...
## $ parch : int 0 2 2 2 2 0 0 0 0 0 ...
## $ ticket : Factor w/ 930 levels "","110152","110413",..: 189 51 51 51 51 126 94 17 78 827 ...
## $ fare : num 211 152 152 152 152 ...
## $ cabin : Factor w/ 187 levels "","A10","A11",..: 45 81 81 81 81 151 147 17 63 1 ...
## $ embarked : Factor w/ 4 levels "","C","Q","S": 4 4 4 4 4 4 4 4 4 2 ...
## $ boat : Factor w/ 28 levels "","1","10","11",..: 13 4 1 1 1 14 3 1 28 1 ...
## $ body : int NA NA NA 135 NA NA NA NA NA 22 ...
## $ home.dest: Factor w/ 370 levels "","?Havana, Cuba",..: 310 232 232 232 232 238 163 25 23 230 ...
## Engineer new feature
titanic$FamilySize <- 1 + titanic$sibsp + titanic$parch
## What kind of NAs we got going on?
## Remove columns that could be trouble or you feel will not add to cluster
## Identity.
rem_cols <- c("body", "ticket", "name", "cabin", "boat", "home.dest", "age", "sibsp", "parch")
titanic <- titanic %>%
## check
## [1] "pclass" "survived" "sex" "fare" "embarked"
## [6] "FamilySize"
First, what are my data types again?
sapply(titanic, typeof)
## pclass survived sex fare embarked FamilySize
## "integer" "integer" "integer" "double" "integer" "double"
Right away I notice that two columns I want to include are being treated as a data type that is not appropriate (survived and pclass). When you input data into R, R often treats all number data as a number, even if the number is a factor. This can cause many problems, so I usually spot check right away.
## Set numbered factor columns to factor data type
titanic$pclass <- as.factor(titanic$pclass)
titanic$survived <- as.factor(titanic$survived)
## Set double to numeric, which will end up rounding
## ticket price to nearest doller
titanic$fare <- as.numeric(titanic$fare)
titanic$FamilySize <- as.numeric(titanic$FamilySize)
Once the data types are all set, I like to just be curious about the data and visualize a few things to understand the dataset. Let’s just play around a bit, we earned it!
## [1] "pclass" "survived" "sex" "fare" "embarked"
## [6] "FamilySize"
## What is the male female breakdown of survials?
titanic %>%
ggplot(., aes(x = sex, fill = survived)) +
## What is the male female breakdown of survials in each class?
titanic %>%
ggplot(., aes(x = sex, fill = survived)) +
geom_bar() +
## What kind of fare was everyone paying anyway?
ggplot(titanic, aes(pclass, fare)) +
geom_jitter(alpha = .2) +
The kohonen::supersom()
function accepts multilayered data sets, a named list in which each element consists of a matrix with an equal number of observations(rows). The function lets us specify which layer to include and lets us chose difference distance measures for each layer.
But first, we need to deal with the factors. Super Organized Maps (and neural networks in general) can not deal in category (factor) data types. You have to be creative on how you input them. In this case we are going to use kohonen::classvec2classmat()
function which turns a factor column vector into a class matrix and vice versa. Differently from functions that convert a factor column vector into dummy variables, this function adds a single binary column per factor level i.e. one hot encoding.
## let's start by giving our dataset a fresh name.
data_val <- titanic
## identify which columns are numeric and which are factors
numerics = summarise_all(data_val, is.numeric ) %>%
factors = names(data_val) %>%
numerics = names(data_val) %>%
## Set up for loop
data_list = list()
distances = vector()
## This takes each factor column and makes into a matrix. Then you will have a list of matrices.
for (fac in factors){
data_list[[fac]] = kohonen::classvec2classmat( data_val[[fac]] )
distances = c(distances, 'tanimoto')
## This scales all the numeric data and
data_list[['numerics']] = scale(data_val[,numerics])
distances = c(distances, 'euclidean')
## Check out the data
## List of 5
## $ pclass : num [1:1310, 1:3] 1 1 1 1 1 1 1 1 1 1 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : NULL
## .. ..$ : chr [1:3] "1" "2" "3"
## $ survived: num [1:1310, 1:2] 0 0 1 1 1 0 0 1 0 1 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : NULL
## .. ..$ : chr [1:2] "0" "1"
## $ sex : num [1:1310, 1:3] 0 0 0 0 0 0 0 0 0 0 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : NULL
## .. ..$ : chr [1:3] "" "female" "male"
## $ embarked: num [1:1310, 1:4] 0 0 0 0 0 0 0 0 0 0 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : NULL
## .. ..$ : chr [1:4] "" "C" "Q" "S"
## $ numerics: num [1:1310, 1:2] 3.44 2.28 2.28 2.28 2.28 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : NULL
## .. ..$ : chr [1:2] "fare" "FamilySize"
## ..- attr(*, "scaled:center")= Named num [1:2] 33.3 1.88
## .. ..- attr(*, "names")= chr [1:2] "fare" "FamilySize"
## ..- attr(*, "scaled:scale")= Named num [1:2] 51.76 1.58
## .. ..- attr(*, "names")= chr [1:2] "fare" "FamilySize"
## 1 2 3
## [1,] 1 0 0
## [2,] 1 0 0
## [3,] 1 0 0
## [4,] 1 0 0
## [5,] 1 0 0
## [6,] 1 0 0
## [1] "pclass" "survived" "sex" "embarked" "numerics"
## num [1:1310, 1:2] 3.44 2.28 2.28 2.28 2.28 ...
## - attr(*, "dimnames")=List of 2
## ..$ : NULL
## ..$ : chr [1:2] "fare" "FamilySize"
## - attr(*, "scaled:center")= Named num [1:2] 33.3 1.88
## ..- attr(*, "names")= chr [1:2] "fare" "FamilySize"
## - attr(*, "scaled:scale")= Named num [1:2] 51.76 1.58
## ..- attr(*, "names")= chr [1:2] "fare" "FamilySize"
Now we finally get to performing the SOM.
How do you know how many dimensions my grid will be?
How do you know how many iterations you need?
## Setting the map dimension and iterations.
map_dimension = 8
n_iterations = 1000
## Set up SOM grid
## This is the actual grid that the clustering will occur
som_grid = kohonen::somgrid(xdim = map_dimension,
ydim = map_dimension,
topo = "hexagonal")
## I always set my seed to 8, becasue it is my favorite number
m = kohonen::supersom(data_list,
grid = som_grid,
rlen = n_iterations,
alpha = 0.05,
whatmap = c(factors, 'numerics'),
dist.fcts = distances,
maxNA.fraction = .5)
The Kohonen plotting functions are really amazing. They are simple to use and are extremely informative.
As the SOM training iterations progress, the distance from each node’s weights to the samples represented by that node is reduced. Ideally, this distance should reach a minimum plateau. This plot option shows the progress over time. If the curve is continually decreasing, more iterations are required.
plot(m, type = "changes")