Seq Match Help
Contents |
Taxonomy Choices
Select the taxonomy in which sequences will be displayed. This option is available only on the start page.
The Nomenclatural taxonomy displays sequences in a hierarchy based on a schema closely matching that proposed in the new phylogenetically consistent higher-order bacterial taxonomy, using a naïve bayesian classifier trained on sequences from known type strains to assign sequences.
NCBI displays sequences as classified in the NCBI taxonomy. This information is directly obtained from the sequence GenBank record.
Data Set Options
Selecting different data sets alters the types of sequences available for browsing. You may select one or both options to restrict the kinds of sequences displayed.
- Strain
- Selecting Type restricts the display to only sequences of known type strains.
- Source
- Select Uncultured to restrict the display to only sequences of environmental samples. Selecting Isolates restricts the display to only sequences of isolates.
- Size
- Select >1200 Bases to restrict the display to only near-full-length sequences.
- Quality
- You can view only good quality sequences, suspect quality sequences, or both. Sequences were flagged (*) as suspect quality.
Display Depth
Display depth controls the number of ranks displayed in the hierarchy. With the default "Auto" setting, the program automatically adjusts the depth to display a reasonable number of lines in the browser. Increase the depth to see more ranks at the same time.
Summary
The summary page shows where a query sequence belongs to agreed by all the match sequences of that query.
The lineage will display the ancestors of the current root taxon, starting from the highest to the lowest rank. Each taxon is followed by the number of query sequences under that taxon.
The hierarchy view displays all the taxon nodes with sequences assigned to them in the hierarchical order, starting from a root taxon. Each line contains the taxon rank, name and the short description of status. The top taxon is the current root taxon. Clicking any other taxon node will make that node display as the root and will update the hierarchy view.
To change to a different data set, see "Data Set Options". After making the appropriate selections, click the Refresh button to update the view.
Selecting Matches
To select or deselect all sequences below a taxon, click or
in front of that taxon. You can also select or deselect one sequence by checking or unchecking the checkbox before a sequence. This icon
indicates a subset of the sequences selected below a taxon. Click
once to select all sequences. As you browse, the total number of selections in all data sets is displayed at the top of the page. Click the "Save selection and return to summary" button to go back to summary page. The selections will also be saved in Sequence Cart. You can download the selections for local use.
Printer Friendly Results
From the summary page, click on "show printer friendly result" link next to the root taxon, all the match results under that root taxon will be displayed. The results for each query sequence are shown in hierarchical order.
Input Formats
Currently Fasta, GenBank and EMBL formats are allowed. Both uppercase and lowercase formatting is allowed. The number of query sequences is limited to 2000. If you need more than 2000 sequences classified, please contact rdpstaff@msu.edu.
Viewing Sequences
Individual sequences can be viewed by clicking on the RDP sequence identifier link (i.e., S000002414).
Starting a New Match
Clicking on new match link on the right top of each page will remove all the selections you made in the previous browsing experience and take you back to the start page.
Result Format
Each match result line contains six elements, from left to right:
- 1. A short ID used to uniquely identify the RDP sequence. A click will return the simple entry, including the sequence.
- 2 The orientation of the query sequence when the match is performed. "-" means the query sequence has been reverse-complemented. A top match hit with "-" orientation usually indicates the query sequence is a minus strand.
- 3. A similarity score. SeqMatch reports the percent sequence identity over all pairwise comparable positions when run with aligned myRDP sequences. (Comparable positions are aligned positions containing a base in both sequences). Note that the rank order may differ between S_ab and pairwise identity scores, but the top 20 S_ab scores will contain the closest sequence by pairwise identity about 95% of the time (Cole et al). If two sequences do not overlap, the similarity between these two sequences will be displayed as "?".
- 4. A seqmatch score (S_ab). These are the number of (unique) 7-base oligomers shared between your sequence and a given RDP sequence divided by the lowest number of unique oligos in either of the two sequences.
- 5. The number of uniquely occurring oligomers within a given sequence (Olis). If the same oligomer occurs more than once then they are counted only once; thus this number only approximately reflects the sequence length. Counting only unique oligos compensates somewhat for composition bias (for example, inserts tend to be GC-rich and it becomes very likely that the same GC-rich oligos occur several times; by counting these only once, this artifact becomes less severe).
- 6. Full name. The definition line from the RDP distribution, often the same as Genus/species/string name and accno.
Downloading Selected Sequences
The selection can be downloaded from the seqCART on the main tool menu.
Display KNN Matches
Display KNN matches controls the number of matches displayed per sequence, also the number used to classify queries by unanimous vote. The maximum value for k is 20.