TP 6: Cluster analysis

Cluster analysis is commonly used to search for groups in data; it is most effective when the groups are not already known. Here you will get a light introduction to cluster analysis in R by clustering tumor samples from the publicly available data set analyzed by . As usual, you should always make sure you read the help documentation for each function you do not already know.

Load the cluster package into R:

library(cluster)

Download the Bittner et al. data here. The original images are not publicly available, but image-processed signals are. Read this dataset into R and check that you have the right number of rows and columns (there should be 8067 rows (spots) and 38 columns (arrays)).

mel<-matrix(scan("melanoma.txt"),ncol=38,byrow=TRUE)
dim(mel)

In order to replicate their results, you will need to do the same preprocessing steps described in the paper (they used only melanoma samples, not controls; they truncated ratios at 0.02 and 50; log2; global (across an entire slide) median normalization):

mel.bittner <- mel[1:3613,1:31]
mel.bittner[mel.bittner < 0.02] <- 0.02
mel.bittner[mel.bittner > 50] <- 50
mel.data <- sweep(data.matrix(mel.bittner),2,mel.bittner.med)

You could check that the median for each of the 31 experiments is 0 as follows:

apply(mel.data,2,median)

Now you are ready to perform some clustering. Cluster the melanoma samples (arrays) using the (hierarchical) You should try varying between-cluster dissimilarity measures by modifying the 'method' argument in hclust. Also try different dissimilarities by modifying the 'method' argument in the function dist. Some examples (read the help information ?hclust and ?dist so that you can try others - Ward's method can be especially useful):

Correlation distance with average linkage (what Bittner et al. did):

clust.cor.average <- hclust(as.dist(1-cor(mel.data)), method = "average")
plot(clust.cor.average)

Correlation distance with complete linkage:

clust.cor.complete <- hclust(as.dist(1-cor(mel.data)), method = "complete")
plot(clust.cor.complete)

Euclidean distance with single linkage:

clust.euclid.single <- hclust(dist(t(mel.data)), method = "single")
plot(clust.euclid.single)

Now repeat the above analysis with a partitioning clustering algorithm kmeans). For this algorithm you will need to specify the number of clusters; keep it small here, from 2 to 5 say. Find the cluster assignments of the 31 melanoma samples from each clustering. Here's one example:

clust.km.2 <- kmeans(t(mel.data),2)
clust.km.2$cluster

You can also try out the partitioning clustering algorithm PAM (Partitioning Around Medioids). See ?pam and ?plot.partition for information about using the pam function and making a cluster plot or silhouette plot.

Now do the clustering with the controls as well. The easiest way to set up the data is to change the number of columns above from 31 to 38, then do all the other steps. Are the controls clustering together, and separately from the melanoma samples?

Filter the genes according to some criterion of your choice, so that you end up with about 50 – 100 genes. This is so that you can practice making a heat diagram. You might choose select those with most variability across the samples, those with highest standardized difference between the 2 groups that Bittner et al. distinguished or 2 clusters that you determine yourself, there are many possibilities here.

Choose one of the clusterings that you carried out above and redo it using only the selected (50 – 100) genes. As an illustration, here's a simple example using just the first 100 genes:

mel.newdata <- mel.data[1:100,]
samples.cor.av <- as.dendrogram (hclust (as.dist (1-cor (mel.newdata)), method = "average"))
genes.cor.av <- as.dendrogram (hclust (as.dist (1-cor (t(mel.newdata))), method = "average"))
heatmap(mel.newdata,genes.cor.av,samples.cor.av)

The 31 samples are clustered along the columns, and the 100 genes are in the rows. You will notice that the default color scheme is not very nice. You can change this though, by using a different palette or making your own. If you are lucky, you might find an interesting pattern!

heatmap(mel.newdata, genes.cor.av, samples.cor.av, col=topo.colors(32))

You can experiment making heat diagrams for different clusterings, and explore some of the heatmap options. The function heatmap.2 in the gplots package provides many extensions to the standard heatmap function, you might like to try using that as well.

(Optional) It might also be interesting to explore different ways of estimating the number of clusters using the cluster.stats function from the fpc package. To use this function, you cannot use all the genes, you should do some gene filtering first.

Based on all of the above, what do you think about the reported discovery of a new subclass of melanoma?

For practice (remember that exam??), you can send me a short report including the purpose of the investigation, an explanation of the clustering method you finally decide on along with graphical displays (heatmap, dendrogram, silhouette, ... depending on what is appropriate for the choices you made). This report should not be more than about 4 pages.

As usual, your report can be in English or French. Please send your report as a pdf file, and follow the naming convention: lastname06.pdf (e.g. my report would be goldstein06.pdf). Please upload your report on moodle by Tuesday 30 April (any time).