Tutorial for hybridHclust

The package hybridHclust provides R functions for carrying out hybrid hierarchical clustering using "mutual custers". Such clustering is useful when different levels of clustering resolution are required. For example, interest may focus both on tightly clustered small sets of observations and larger, broad clusters.

  1. You first need to install a recent version of the R statistical package. This is free, and can be found at The R Project. Follow the instructions. Click on http://cran.r-project.org/mirrors.html under Download. Pick a mirror closest to you. You probably want a pre-compiled version. For windows, choose the Windows (95 or later) link, select base , and download SetupR.exe . Installation takes 5-10 minutes. R is a great package, and is worth knowing about!
  2. hybridHclust can be installed either directly from CRAN or can be downloaded and installed as described below.
  3. Download the appropriate Unix/Linux version or Windows version for your platform from the hybridHclust distribution site. Often, browsers offer you a choice of opening the file or saving it. Elect to save it and remember the name of the folder where you saved it.
  4. For Unix/Linux type
R CMD INSTALL -l mylib hybridHclust_1.0-0.tar.gz

For Unix/Linux, start R and type

library(hybridHclust,lib.loc="mylib")

In Windows, while in R pull down the Packages menu item and select Load package . Select hybridHclust . In Windows you will find it helpful to go the Misc menu and turn off buffered output. This forces R output to be immediately written to the console.

You are now ready to use hybridHclust.

Note: The R code in the demo below can be executed by typing demo(hybridHclust) in R once the library has been loaded.

First, load the package and data:

library(hybridHclust) data(sorlie) data(sorlielabels)

We will cluster 85 samples (columns, each with 456 genes), using a correlation distance measure defined as 1.0-correlation. This "correlation" distance commonly used in clustering microarray data is equivalent to squared Euclidean distance once each sample has been scaled to have mean 0 and sd 1.

Currently the code relies on integers as row and column names of distance matrix.
dimnames(dmat) 

Identify the mutual clusters. The plot=TRUE option displays a bottom-up dendrogram with the mutual clusters identified.

This generates the following jpg plot, and output:
1 : 43 44 2 : 45 46 3 : 24 25 4 : 16 20 5 : 7 8 6 : 10 12 7 : 37 60 8 : 35 36 9 : 29 30 10 : 31 32 33 34 11 : 50 51 12 : 52 55 13 : 54 56 14 : 82 83 15 : 78 79 16 : 63 64 17 : 65 66

Note that the mutualCluster function can take as an input either the original data matrix x (in which case Euclidean distance is used) or a distance matrix (as above). As indicated previously, this demo uses a correlation-based distance, which is equivalent to squared Euclidean distance after scaling each observation.

Next, show distances between points belonging to mutual clusters:

get.distances(mc1)

The printed output below shows the distance between each mutual cluster. In all cases except one, the mutual cluster is a pair of points.

[[1]] 44 43 0.6573144 [[2]] 46 45 0.6722525 [[3]] 25 24 0.4642499 [[4]] 20 16 0.5953205 [[5]] 8 7 0.4090045 [[6]] 12 10 0.4578701 [[7]] 60 37 0.6275236 [[8]] 36 35 0.4341527 [[9]] 30 29 0.3366078 [[10]] 31 32 33 32 0.2992227 33 0.1934151 0.2052296 34 0.2491986 0.2662368 0.1512860 [[11]] 51 50 0.6134367 [[12]] 55 52 0.7042852 [[13]] 56 54 0.7609525 [[14]] 83 82 0.5660623 [[15]] 79 78 0.6416045 [[16]] 64 63 0.5237544 [[17]] 66 65 0.554218

Hybrid hierarchical clustering:

Fit the hybrid hierarchical model. The matrix x is transposed because we want to cluster its columns, and hybridHclust clusters rows. The trace option is used to provide reassurance that something is happening.

hyb1 

The plot command generates the following jpg plot. The last line above "cuts" the tree so that 5 clusters are generated, and compares the resultant cluster labels with the labels found in Sorlie et. al. (2001). Note that some clusters agree reasonably well (eg hyb2 sorlie3; hyb3 sorlie4) while others disagree more (eg sorlie5 is mostly split across hyb4 and hyb5).

sorlie hyb 1 2 3 4 5 1 14 9 0 1 0 2 0 1 11 0 1 3 0 1 1 13 2 4 0 0 1 1 20 5 0 0 0 0 9

Comparison of different hierarchical clustering methods

Calculate and store the bottom-up clustering object using distances found in "dmat". We'll use this later for a comparison of top-down, bottom-up, and hybrid methods.

Fit a top-down model for the sake of comparison. We transpose x because we want to cluster columns (ie samples) and tsvq will cluster rows by default. Note that the trace=TRUE option traces out the steps as they are executed. This can be reassuring when working on large problems.

Comparative analysis of the three dendrograms produced above. First we put the trees into a single object, and then extract a set of nested partitions for each of the three trees. Comparisons will be made in terms of the similarity of partitions of equal size, and within-group sums of squares.

treelist 
Four "helper functions" are defined:
make.tree.partitions thenames list(parts,thenames,prange) > calc.partition.dists tnames dpart return(d/choose(length(p1),2)) > calc.sum.dist sum.d[k] sum.d >

Now use these helper functions to compare the resultant trees, via comparison of partitions, and within-group sum of distances

partitions 
Plot the Hamming distances between the partitions of the 3 trees, for various numbers of clusters:
whichplot 
The plot generated looks like this (jpg).

Now plot within group distances vs number of clusters.

tot.dist.from.mean par(mfrow=c(1,2)) mylty
The plot generated looks like this (jpg).

Comparison with sorlie labels:

for (i in 1:3) cat(treelist[[i]]$name, dpart(sorlielabels,cutree(treelist[[i]]$tr,5)),'\n')
This produces the following text output:
Bottom-up 0.1327731 Top-down 0.1686275 Hybrid 0.1703081