16S Unsupervised Workflow
Unsupervised processing and analysis of 16S rRNA data starts with the initial processing tool. You may follow this tutorial using your own sequences or you may download and use the input files provided in the Pipeline Initial Processing tutorial. Prepare your input including a sequence file in FASTA or SFF format, a primer sequence and a tag file. Then follow the Pipeline Initial Processing tutorial to begin processing your data.
The sequence files produced by the initial processing tool need to be checked for chimeras. For this task we recommend using one of two software packages freely available online.
First and easiest to use is the web-based version of the Decipher chimera detection tool at http://decipher.cee.wisc.edu/FindChimeras.html. If your sequence file is under 10Mb you may submit the file to the Decipher web tool making sure to check the “Short-length sequences” option. The results will be emailed to you usually within a couple of hours.
Unfortunately, the Decipher web tool is available only for sequence files under 10Mb. So if your sequence files for individual samples from initial processing turn out to be larger than 10Mb and you still want to run Decipher, you will have to submit the file in pieces over multiple jobs or download Decipher to install and run locally. RDP’s internal tests have shown that on a lone workstation the command line tool of Decipher can be slow for large jobs, and so in this case we recommend using the UCHIME software package.
UCHIME is a faster, more accurate alternative to Decipher for those who are comfortable with a command line interface and compiling software from source code. RDP's testing of chimera checking tools suggests that UCHIME has a higher sensitivity to chimeras and lower false positive rate when compared to Decipher. UCHIME can be obtained from http://drive5.com/uchime/ as source code or a precompiled Linux binary. For tips on usage refer to http://drive5.com/uchime/uchime_quickref.pdf and http://drive5.com/uchime/practical_uchime.pdf.
After running Decipher or UCHIME, you will need to generate a text file that lists the IDs of chimeric sequences one per line. Decipher’s email output will provide this list for you. Simply save it in a text file. UCHIME will give you a detailed output file with sequence IDs in the second column and chimeric status in the last column. If you are using a command line interface you can use the command:
egrep '\?$|Y$' Native_1_4_A.uchime.txt | cut -f2 > ids.txt
to create a text file listing of IDs. Once you have generated the ID file you just need to retrieve the original FASTA file being checked for chimeras. The sequence file and ID file will be the inputs for the web tool that will produce a FASTA file containing only non-chimeric sequences. You can access the FASTA sequence selection tool. Make sure to check the ‘exclude’ sequences box to get a sequence file free of chimeras. On the other hand, if you wish to have the set of chimeric sequences, you can save another FASTA file with the 'exclude sequences' box unchecked.
The next step is to align the newly obtained non-chimeric sequences. Take your sequence files and head over to the Aligner tutorial. Make sure to align each sequence file you upload as a separate sample. This makes it possible to cluster and analyze data based on samples as well.
Next we will take the output from the aligner and submit all of the aligned files to the clustering tool as in the clustering tutorial. When given the option, choose to cluster the sequence files as separate samples.
The output from clustering, a .clust file, is useful for generating a wide range of statistical information about a set of samples. Once your .clust file is ready, you have the option of submitting it to multiple tools on RDP’s website and/or importing it into R for further analysis such as ordination, heat maps, etc. RDP provides four tools that can be used at this point for analysis:
Information on the tools and their use can be found in the clustering results tutorial: Link to the *.clust results tutorial
- Rarefaction tool
- Shannon and Chao index calculator
- Jaccard and Sorensen based sample abundance calculator