Classifier Help

Help Topics:

:: Taxonomic Hierarchy Model :: Input Sequence Limitation :: Display Depth
:: Classification Algorithm :: View Results in Hierarchy :: View Details
:: Input Format :: Confidence Threshold :: Classifier Limitation
:: Download allrank/fixrank Result :: Gene Copy Number Adjustment

 

:: Taxonomic Hierarchy Model

The RDP naïve Bayesian Classifier now offers multiple hierarchy models for 16S rRNA, Fungal LSU, and Fungal ITS genes.

The current hierarchy model used by the 16S rRNA Classifier comes from that proposed in the new phylogenetically consistent higher-order bacterial taxonomy with some minor changes for lineage with few cultivated members. From the highest to the lowest, the major formal taxonomic ranks are: domain, phylum, class, order, family and genus. There are occasional intermediate ranks such as “subclass” and “suborder”.

The Fungal LSU Classifier (large subunit rRNA gene) hierarchy model and training set were provided by Andrea Porras-Alfaro, Gary Xie, and Cheryl Kuske (supported through a DOE Science Focus Area grant to Los Alamos National Laboratory). The fungal training set consists of 8506 high-quality public Sanger sequences spanning the first 1400 bp of the LSU gene. This dataset and taxonomic hierarchy were hand-curated for taxonomic accuracy. Their results showed the Fungal LSU Classifier was computationally faster (over 460 fold) than BLASTN and provided equal or superior classification accuracy [K-L. Liu, C. R. Kuske, A. Porras-Alfaro, S. Eichorst, G. Xie. 2012. Accurate, rapid taxonomic classification of fungal large subunit rRNA genes. Appl. Environ. Microbiol. 78(5): 1523-1533].

Two Fungal ITS training set are provided. Warcup is an version from an active curatorial effort kindly provided by Paul Greenfield, Vinita Deshpande and colleagues of the Australian CSIRO [V. Deshpande, Q. Wang, P. Greenfield, M. Charleston, A. Porras-Alfaro, C. R. Kuske, J. R. Cole, D. J. Midgley, and N. Tran-Dinh. 2015. Fungal identification using a Bayesian Classifier and the 'Warcup' training set of Internal Transcribed Spacer sequences. Mycologia (In press)]. UNITE is a set consisting of UNITE core sequences for each dynamic species hypothesis provided by Kessy Abarenkov of UNITE. See RDP's technical report Comparison of Three Fugal ITS Reference Sets for detailed analysis.

:: Classification Algorithm

Classification algorithm has been published in Naïve Bayesian Classifier for Rapid Assignment of rRNA Sequences into the New Bacterial Taxonomy.

Each rRNA query sequence is assigned to a set of hierarchical taxa using a naïve Bayesian rRNA classifier. The classifier is trained on the known type strain 16S sequences (and a small number of other sequences representing regions of bacterial diversity with few named organisms). The frequencies of all sixty-four thousand possible eight-base subsequences (words) are calculated for the training set sequences in each of the approximately 880 genera. (Actually, the probabilities are modified slightly by the addition of a 'prior' to prevent any probability values of 0% or 100%.)

When a query sequence is submitted, the joint probability of observing all the words in the query can be calculated separately for each genus from the training set probability values. Using the naïve Bayesian assumption, the query is most likely a member of the genera with the highest probability. In the actual analysis, we randomly select only a subset of the words to include in the joint probability calculation, and the random selection and probability calculation is repeated for 100 trials. The number of times a genus is most likely out of the 100 bootstrap trials gives an estimate of the confidence in the assignment to that genus. For higher-order assignments, we sum the results for all genera under each taxon.

:: Input Format

Currently Fasta, GenBank and EMBL formats are allowed. Either uppercase or lowercase format is allowed.

:: Input Sequence Limitation

The classifier requires a sequence have at least 50 bases to get a good classification result. The number of query sequences is limited to 100000. If you need more than 100000 sequences classified, please contact rdpstaff@msu.edu.

:: View Results in Hierarchy

After the sequences are classified, the classification results are shown in a taxonomic hierarchy.

The lineage will display the ancestors of the current root taxon, starting from the highest to the lowest rank.

The hierarchy view displays all the taxon nodes with sequences assigned to them in the hierarchical order. Each line contains the taxon rank, name and the number of sequences assigned to that taxon with the estimated confidence above the confidence cutoff value. The top taxon is the current taxon root. If you click the “show details” link on that line, a detailed classification result of the sequences that are assigned to that taxon will be displayed. Clicking any other taxon node will make that node display as the root and will update the hierarchy view.

Each taxon may have an unclassified node below it. See “Confidence Threshold” for detail.

:: Confidence Threshold

For each rank assignment, the Classifier automatically estimates the classification reliability using bootstrapping. Ranks where sequences could not be assigned with a bootstrap confidence estimate above the threshold are displayed under an artificial 'unclassified' taxon. The default threshold is 80%.

For partial sequences of length shorter than 250 bps (longer than 50 bps), a bootstrap cutoff of 50% was shown to be sufficient to accurately classify sequences at the genus level, and to provide genus level assignments for higher percentage of sequences. See table "Fractions of variable regions that were correctly classified by the RDP Classifier" below.

classifier bootstrap
(Reference: Comparative Analysis of Pyrosequencing and a Phylogenetic Microarray for Exploring Microbial Community Structures in the Human Distal Intestine).

:: Display Depth

Controls the number of ranks displayed in the hierarchy. With the default "Auto" setting, the program automatically adjusts the depth to display a reasonable number of lines in the browser. Increase the depth to see more ranks at the same time.

:: View Details

To view the detail classification results for individual queries assigned to a taxon, click the link “[show details]” beside the current root taxon. Note the symbol "-" after a sequence name means the results are obtained using the reverse complement of that query sequence. This indicates the orientation of that query sequence is reverse.

:: Classifier Limitation

For regions of less-well-studied bacterial diversity, query classification is often not well supported even for higher taxonomic ranks. We have found that a high percentage of sequences from some environmental clone libraries are classified with less than 80% confidence, even at the phylum level. Such low confidence classification results may identify sequences where a thorough phylogenetic analysis is warranted.

:: Download allrank/fixrank Result

On the assignment detail page, click the button "download allrank result" or "download fixrank result" to download the results to your local hard disk. You can save the results as a semi-colon-delimited text file and import it into a spreadsheet program such as Excel. The "allrank" format outputs the results for all ranks applied for each sequence. The "fixrank" format only outputs the results for a list of selected ranks in the following order: domain, phylum, class, order, family and genus. In case of missing ranks in the lineage, the bootstrap value and the taxon name from the immediate lower rank will be reported. This eliminates the gaps in the lineage, but also introduces non-existing taxon name and rank. User should interpret the "fixrank" results with caution.

:: Gene Copy Number Adjustment

Without adjustment, the assignment count displayed next to a taxon reflects the number of sequences assigned to that taxon. This count can be adjusted based on the 16S gene copy number. for that taxon for 16S gene sequences to better estimate relative species abundance. When enabled, each sequence is weighted as "1 / (mean gene copy number)" of the lowest rank taxa with confidence above the threshold.

The precompiled Classifier was trained with the 16S gene copy number data from bacterial and archaeal genome sequences provided by rrnDB (Nucleic Acids Research 2014; doi: 10.1093/nar/gku1201). For each taxon, the mean gene copy number (if available) of the immediate child taxa was used as the mean copy number for that taxon. For any taxon without copy number data, the mean copy number of its parent was used for that taxon.

The Classifier can be trained with user-provided gene copy number data. See How to Train the Classifier on RDP GitHub.

 

Questions/comments: rdpstaff@msu.edu
Creative Commons License: Attribution-ShareAlike

 Move to Toptop topMove to Top