Frameshift Correction and Closest
Match Assignment by RDP FrameBot
The FrameBot tool main page can be found at http://fungene.cme.msu.edu/FunGenePipeline/framebot/form.spr
RDP FrameBot (Wang et al., 2013. mBio 4:e00592-13; doi: 10.1128/mBio.00592-13) is a tool for correcting frameshift errors caused by insertions and deletions in DNA sequences. Given a set of known protein reference sequences for a gene, FrameBot will take in nucleotide reads and return frameshift-corrected nucleotide and protein sequences and an optimal protein pairwise alignment. FrameBot checks the query DNA sequence in both forward and reverse directions and returns the results in the forward orientation.
RDP currently maintains a set of reference sequences for about twenty genes. These may be selected from the dropdown box on the FrameBot webtool’s input page. Otherwise a file containing protein sequences for the target gene will have to be supplied by the user. FrameBot is computationally intensive because it does all-against-all comparison between query DNA and the target protein sequences, therefore we recommend limiting the number of protein target sequences to 200.
The choice of percent identity is important because FrameBot will filter out sequences based on this value. Each sequence is matched to a best match reference sequence and the percent identity is calculated from their alignment. If this value is below the threshold specified, the sequence will be filtered out. For a very stringent filter use a value around 0.80 and for more relaxed settings stick with the default 0.40. If you're following the tutorial, keep the default settings. Your screen should look like this after uploading the sample input and selecting nifH:
When the job is done the download will automatically start if the web page stays open and an email will be sent to you with a link to the results.
Submitting your sequences to FrameBot in this way not only performs frameshift correction and translation of DNA to protein but also dereplicates the sequences. In the main output directory you will find the dereplicated sequence file and the corresponding sample and id files.
FrameBot has six output files:
- all_seqs_derep_framebot.txt - the alignment to closest match and percent identity value.
- all_seqs_derep_nucl_corr.fasta and all_seqs_derep_prot_corr.fasta - The frameshift corrected nucleotide and protein sequences.
- all_seqs_derep_failed_framebot.txt - A list of query sequences below the percent identity threshold set by the user.
- all_seqs_derep_nucl_failed.fasta - FASTA file containing the nucleotide sequences that fail FrameBot's percent identity filter.
- all_seqs_derep_stdout.txt - FrameBot's standard output, error if any will show up here.
The framebot.txt file contains the pairwise alignment and many important statistics for each sequence.
In the graphic above, a deletion and an insertion are highlighted and replaced in the nucleotide sequence by the number of nucleotides that were there before the correction was made. The STATS line above each alignment contains the values for percent identity, length, score and # of frameshifts.
The FrameBot nearest neighbor assignments can be used to group reads by relative abundances of the nearest matches, or view the differences in the samples using ordination analysis.
PCA Analysis Using FrameBot Nearest Neighbor Assignments:
Relative abundances of NEON reads grouped by nearest matches at the phylum and class levels: