Performing Complete Linkage Clustering

Quick Overview
  1. Dereplicate sequences
  2. Calculate the distance between sequences
  3. Group sequences into clusters by the complete linkage clustering method

Aligned sequence file(s) in FASTA format

(Sample input files)

Dereplicated sequence file in FASTA format with id and sample file

A folder named "clustering" that contains the actual *.clust output file and for each sample:

  1. A summary text file listing cluster distance cutoff vs. clusters (OTUs)
  2. A graphical representation of the summary data

(Sample output files)

The Process:

The RDP mcClust complete linkage clustering tool (from the FunGene Pipeline) works for both nucleotide and protein sequences, whereas the cluster tool on RDP's Pyrosequencing Pipeline site only works for nucleotide sequences. This tutorial uses mcClust to illustrate how the cluster tools work.

This complete linkage clustering tool allows you to make a cluster file based on one or more aligned sequence files, the output from RDP Infernal Aligner or HMMER3 Aligner. The sequence file must be an aligned FASTA file. If a submission contains multiple aligned files they should be aligned to the same model. As with the RDP Aligner, multiple files may be compressed (zipped) in to a single file before submission.

If you are following the tutorial you will have four aligned files to upload. To upload these fasta files to be clustered together, first compress the files in to a single compressed file and then upload this file.

Users may choose the maximum distance (to specify 3% distance, enter 0.03) and step size (the increment between the cluster distances) for their clustering run by entering values into the boxes. For the tutorial example we use Distance Cutoff of 0.1 and Step of 0.01. Users also have the option of clustering all submitted FASTA files together or separately. Once a clustering job is finished, the compressed results file is emailed to the address provided.

clustering in progress

A clustering job in progress.

The results file . . .

  • A summary text file (*.txt) listing cluster distance cutoff vs. clusters (OTUs)
  • A graphical representation of the summary data in png format (see below)
  • The actual clustering results (*.clust)
  • A minimal dense biom (version 1.0) format for each distance level (*.biom)

contents of output folder

Output files from clustering run.

contents of output folder

Clustering results graph for Native_1_2_A sample

The *.clust file . . .
may be submitted to other RDP tools to calculate

  • the Shannon and Chao1 Index,
  • Jaccard and/or Sorensen Indices,
  • Rarefaction, or
  • to convert to an R-compatible community data matrix

