Taxomatic is a tool to visually display similarity matrices. There are three choices for supplying source data; sequences selected from RDP, an aligned fasta file, or a DNADist file. When sequences you select from RDP are are used, only model positions are taken into account when calculating the distance matrix. When either RDP sequences are used or a fasta file is uploaded as the source data, a similarity matrix is computed for the selection using uncorrected pairwise gene frequencies. Uploaded fasta files should contain only comparable positions.
When displaying the distance matrix on taxomatic you may either use the RDP taxonomy or supply your own.
To use the RDP Taxonomy use sequences from your Sequence Cart or upload a fasta/dnadist file with RDP Sequence IDs. If you are using sequences from your myRDP account, you do NOT have to supply your own taxonomy; when your sequences are uploaded, they are classified for you. If you choose to upload your own taxonomy, it must be in the specified xml format (see below for an example). When you upload your own taxonomy, it will override any taxonomy information from the RDP (and as such, there must be a <Sequence> tag for every sequence that you want to use from the data source; sequences without taxonomy information will be ignored).
<?xml version="1.0" encoding="UTF-8" standalone="yes"?> <taxonomy xmlns="http://rdp.cme.msu.edu/taxomatic" name="RDP User download"> <Taxon id="780" rank="family" name="Chlorobiaceae"> <Taxon id="792" rank="genus" name="Chlorobaculum"> <Sequence seqid="S000779102" name="uncultured Chlorobaculum sp.; 16/3-110"/> <Sequence seqid="S000779116" name="uncultured Chlorobaculum sp.; i9-114"/> <Sequence seqid="S000779113" name="uncultured Chlorobaculum sp.; i9-9"/> <Sequence seqid="S000779099" name="uncultured Chlorobaculum sp.; 16/3-102"/> <Sequence seqid="S000779110" name="uncultured Chlorobaculum sp.; 16/3-167"/> <Sequence seqid="S000779120" name="uncultured Chlorobaculum sp.; i9-111"/> <Sequence seqid="S000779109" name="uncultured Chlorobaculum sp.; 16/3-165"/> <Sequence seqid="S000779101" name="uncultured Chlorobaculum sp.; 16/3-108"/> <Sequence seqid="S000779104" name="uncultured Chlorobaculum sp.; 16/3-119"/> </Taxon> </Taxon> </taxonomy>
What is SOSCC
SOSCC is an experimental distance matrix optimization algorithm that can be used to detect misplaced sequences. Before viewing on Taxomatic, you have the option to preprocess the matrix with SOSCC.
What you can control
After passing through the core of the SOSCC algorithm, the provided taxonomy data is reapplied to the distance matrix. This is done using a scoring algorithm with the following steps:
- Find where the archetype sequence for the group ends up;
- Look at the adjacent sequences
- If the sequence is a member of the group, add the hit score to the current score,
- if the sequence is not a member of the group, add the miss score to the current score;
- Repeat step 2 for every sequence in the matrix (moving toward the edges of the matrix);
- Where the score reaches it's maximum is treated as the new bound for that group.
Manipulating the hit and miss score changes how sensitive the taxonomy rebuild algorithm is to misplaced sequences. A larger hit score makes it less sensitive; larger miss score makes it more sensitive.
Note: Bootstrapping is only available if a fasta file is supplied or when using RDP sequences
In an effort to provide statistical evidence to support any taxonomic changes the SOSCC algorithm makes, you may opt to bootstrap (currently fixed at 100 bootstraps). With this, SOSCC will generate alignments using random selection with replacement, and perform the SOSCC optimization on each alignment; keeping track of the taxonomy of every sequence in each bootstrap. Once the bootstraps are complete, statistics are computed using the resulting taxonomy of each sequence from every bootstrap. The user supplied confidence level specifies what percent of the time a sequence must end up in a group for the move to be considered supported. If a move is not supported, the sequence will be reverted to its original taxonomic group (sometimes this cannot be done; most notably if the group disappeared during optimization).
Notes, Considerations, Limitations
Currently Taxomatic/SOSCC is limited to 2,000 sequences and sequences must overlap at least one base to be comparable. When using SOSCC bootstrapping can take large amounts of time (up to 5 hours for 2,000 sequences when the server is under heavy load).
Bootstrapping is not available for uploaded DNADist matrices, but you can still optimize the matrix using SOSCC. Uploaded Fasta files should contain only comparable positions.
1Original SOSCC implementation provided by Scott Harrison