Help Topics:
:: Taxonomic Hierarchy Model
The current hierarchy model used by the naïve Bayesian rRNA classifier comes from that proposed in the new phylogenetically consistent higher-order bacterial taxonomy with some minor changes for lineage with few cultivated members. From the highest to the lowest, the major formal taxonomic ranks are: domain, phylum, class, order, family, genus and species. There are occasional intermediate ranks such as “subclass” and “suborder”.
:: Classification Algorithm
Classification algorithm has been published in Naïve Bayesian Classifier for Rapid Assignment of rRNA Sequences into the New Bacterial Taxonomy.
Each rRNA query sequence is assigned to a set of hierarchical taxa using a naïve Bayesian rRNA classifier. The classifier is trained on the known type strain 16S sequences (and a small number of other sequences representing regions of bacterial diversity with few named organisms). The frequencies of all sixty-four thousand possible eight-base subsequences (words) are calculated for the training set sequences in each of the approximately 880 genera. (Actually, the probabilities are modified slightly by the addition of a 'prior' to prevent any probability values of 0% or 100%.)
When a query sequence is submitted, the joint probability of observing all the words in the query can be calculated separately for each genus from the training set probability values. Using the naïve Bayesian assumption, the query is most likely a member of the genera with the highest probability. In the actual analysis, we randomly select only a subset of the words to include in the joint probability calculation, and the random selection and probability calculation is repeated for 100 trials. The number of times a genus is most likely out of the 100 bootstrap trials gives an estimate of the confidence in the assignment to that genus. For higher-order assignments, we sum the results for all genera under each taxon.
:: Input Format
Currently Fasta, GenBank and EMBL formats are allowed. Either uppercase or lowercase format is allowed.
:: Input Sequence Limitation
The classifier requires a sequence have at least 200 bases to get a good classification result. The number of query sequences is limited to 40000. If you need more than 40000 sequences classified, please contact rdpstaff@msu.edu.
:: View Results in Hierarchy
After the sequences are classified, the classification results are shown in a taxonomic hierarchy.
The lineage will display the ancestors of the current root taxon, starting from the highest to the lowest rank.
The hierarchy view displays all the taxon nodes with sequences assigned to them in the hierarchical order. Each line contains the taxon rank, name and the number of sequences assigned to that taxon with the estimated confidence above the confidence cutoff value. The top taxon is the current taxon root. If you click the “show details” link on that line, a detailed classification result of the sequences that are assigned to that taxon will be displayed. Clicking any other taxon node will make that node display as the root and will update the hierarchy view.
Each taxon may have an unclassified node below it. See “Confidence Threshold” for detail.
:: Confidence Threshold
For each rank assignment, the Classifier automatically estimates the classification reliability using bootstrapping. Ranks where sequences could not be assigned with a bootstrap confidence estimate above the threshold are displayed under an artificial 'unclassified' taxon. The default threshold is 80%.
:: Display Depth
Controls the number of ranks displayed in the hierarchy. With the default "Auto" setting, the program automatically adjusts the depth to display a reasonable number of line in the browser. Increase the depth to see more ranks at the same time.
:: View Details
To view the detail classification results for individual queries assigned to a taxon, click the link “[show details]” beside the current root taxon. Note the symbol "-" after a sequence name means the results are obtained using reverse complement of that query sequence. This indicates the orientation of that query sequence is reverse.
:: Classifier Limitation
For regions of less-well-studied bacterial diversity, query classification is often not well supported, even for higher taxonomic ranks. We have found that a high percentage of sequences from some environmental clone libraries are classified with less than 80% confidence, even at the phylum level. Such low confidence classification results may identify sequences where a thorough phylogenetic analysis is warranted.
:: Download as Text File
On the assignment detail page, click the button "download as text file" to download it to your local hard disk. You can save the results as a semi-colon-delimited text file and import it into a spreadsheet program such as Excel.

