Classifier Help

From Ribosomal Database Project Wiki
Jump to: navigation, search

Contents

Taxonomic Hierarchy Model

The RDP naïve Bayesian Classifier now offers two hierarchy models for 16S rRNA and Fungal LSU genes. The current hierarchy model used by the 16S rRNA Classifier comes from that proposed in the new phylogenetically consistent higher-order bacterial taxonomy with some minor changes for lineage with few cultivated members.

The Fungal LSU Classifier (large subunit rRNA gene) hierarchy model and training set were provided by Andrea Porras-Alfaro, Gary Xie, and Cheryl Kuske (supported through a DOE Science Focus Area grant to Los Alamos National Laboratory). The fungal training set consists of 8506 high-quality public Sanger sequences spanning the first 1400 bp of the LSU gene. This dataset and taxonomic hierarchy were hand-curated for taxonomic accuracy. Their results showed the Fungal LSU Classifier was computationally faster (over 460 fold) than BLASTN and provided equal or superior classification accuracy [K-L. Liu, C. R. Kuske, A. Porras-Alfaro, S. Eichorst, G. Xie. 2011. Accurate, rapid taxonomic classification of fungal large subunit rRNA genes].

From the highest to the lowest, the major formal taxonomic ranks are: domain, phylum, class, order, family and genus. There are occasional intermediate ranks such as “subclass” and “suborder”.

Classification Algorithm

Classification algorithm has been published in Naïve Bayesian Classifier for Rapid Assignment of rRNA Sequences into the New Bacterial Taxonomy.

Each rRNA query sequence is assigned to a set of hierarchical taxa using a naïve Bayesian rRNA classifier. The classifier is trained on the known type strain 16S sequences (and a small number of other sequences representing regions of bacterial diversity with few named organisms). The frequencies of all sixty-four thousand possible eight-base subsequences (words) are calculated for the training set sequences in each of the approximately 880 genera. (Actually, the probabilities are modified slightly by the addition of a 'prior' to prevent any probability values of 0% or 100%.)

When a query sequence is submitted, the joint probability of observing all the words in the query can be calculated separately for each genus from the training set probability values. Using the naïve Bayesian assumption, the query is most likely a member of the genera with the highest probability. In the actual analysis, we randomly select only a subset of the words to include in the joint probability calculation, and the random selection and probability calculation is repeated for 100 trials. The number of times a genus is most likely out of the 100 bootstrap trials gives an estimate of the confidence in the assignment to that genus. For higher-order assignments, we sum the results for all genera under each taxon.

Input Format

Currently Fasta, GenBank and EMBL formats are allowed. Sequence strings in uppercase or lowercase are allowed as are multiline and multirecord Fasta files.

Input Sequence Limitation

The classifier requires a sequence have at least 50 bases to get a good classification result. There are three ways to run RDP Classifier: 1. the web interactive version, which limits he number of query sequences no more than 100,000; 2. Batch submission after logging in to RDPTools, which limits to 500,000 sequences; 3. Command-line version, which you can download (http://sourceforge.net/projects/rdp-classifier/), install and run on your own computer without the limit on the number of sequences, please contact rdpstaff@msu.edu if you have questions.

View Results in Hierarchy

After the sequences are classified, the classification results are shown in a taxonomic hierarchy. The lineage will display the ancestors of the current root taxon, starting from the highest to the lowest rank.

The hierarchy view displays all the taxon nodes with sequences assigned to them in the hierarchical order. Each line contains the taxon rank, name, and the number of sequences assigned to that taxon with the estimated confidence above the confidence cutoff value. The top taxon is the current taxon root. If you click the “show details” link on that line, a detailed classification result of the sequences that are assigned to that taxon will be displayed. Clicking any other taxon node will make that node display as the root and will update the hierarchy view.

Each taxon may have an unclassified node below it.

Confidence Threshold

For each rank assignment, the Classifier automatically estimates the classification reliability using bootstrapping. Ranks where sequences could not be assigned with a bootstrap confidence estimate above the threshold are displayed under an artificial 'unclassified' taxon. The default threshold is 80%. For partial sequences of length shorter than 250 bps (longer than 50 bps), a bootstrap cutoff of 50% was shown to be sufficient to accurately classify sequences at the genus level, and to provide genus level assignments for higher percentage of sequences.

Classifier bootstrap (1).gif

(Reference: Comparative Analysis of Pyrosequencing and a Phylogenetic Microarray for Exploring Microbial Community Structures in the Human Distal Intestine).

Display Depth

Display depth controls the number of ranks displayed in the hierarchy. With the default "Auto" setting, the program automatically adjusts the depth to display a reasonable number of lines in the browser. Increase the depth to see more ranks at the same time.

View Details

To view the detailed classification results for individual queries assigned to a taxon, click the link “[show details]” beside the current root taxon. Note the symbol "-" after a sequence name means the results are obtained using the reverse complement of that query sequence. This indicates the orientation of that query sequence is reversed.

Classifier Limitation

For regions of less-well-studied bacterial diversity, query classification is often not well supported, even for higher taxonomic ranks. We have found that a high percentage of sequences from some environmental clone libraries are classified with less than 80% confidence, even at the phylum level. Such low confidence classification results may identify sequences where a thorough phylogenetic analysis is warranted.

Download the allrank/fixrank Result

On the assignment detail page, click the button "download allrank result" or "download fixrank result" to download the results to your local hard disk. You can save the results as a semi-colon-delimited text file and import it into a spreadsheet program such as Excel. The "allrank" format outputs the results for all ranks applied for each sequence. The "fixrank" format only outputs the results for a list of selected ranks in the following order: domain, phylum, class, order, family and genus. In case of missing ranks in the lineage, the bootstrap value and the taxon name from the immediate lower rank will be reported. This eliminates the gaps in the lineage, but also introduces non-existing taxon name and rank. Interpret the "fixrank" results with caution.

Try our new "Using the RDP Classifier" process tutorial

Personal tools
Namespaces
Variants
Actions
Navigation
Toolbox