RDP MultiClassifier -- a command line tool
Assign bacterial and archaeal 16S rRNA or fungal LSU sequences to the new phylogenetically consistent higher-order bacterial and fungal taxonomy
Single or multiple sequence files in FASTA, GenBank or EMBL format(Sample input files)
Compressed results folder containing the sequence count for each taxon in the hierarchy and assignment details on a sequence by sequence basis(Sample output files)
The command line RDP MultiClassifier uses RDP naïve Bayesian Classifier to classify single or multiple files containing 16S rRNA and Fungal LSU genes sequences. It outputs the assignment count for each taxon in hierarchy order in different columns, one for each sample. The hierarchy assignment count output, which is similar to an OTU table from clustering results, can be easily imported into other statistical packages, such as EstimateS or R, to do sample comparison.
Download the program from SourceForge . . . You can download the latest version rdp_multiclassifier_x.x.zip from SourceForge. There is no installation neccesary provided you have Java installed. At the time of the writing of this tutorial the latest version of MultiClassifier is 1.1.
Download the sample input files . . . stored in the sample classifier_input zip file which contains four sequence files from the ourput of the initial processing tutorial.
The main file used in running MultiClassifier is MultiClassifier.jar located in the rdp_multiclassifier_1.1 folder downloaded from SourceForge. You may run MultiClassifier by replacing PATH in the command shown below with the path to the directory containing MultiClassifier.jar on your system. For example, /home/Downloads/rdp_multiclassifier_1.1/MultiClassifier.jar
Example commands . . . From a terminal, run the following command to classify these sequences:
java -Xmx1g -jar /PATH/MultiClassifier.jar --conf=0.5 --hier_outfile=classification_hier.txt --assign_outfile=classification_detail.txt multiclassifier_input/*
An example command for running MultiClassifier
or list the file names to be classified:
java -Xmx1g -jar /PATH/MultiClassifier.jar --conf=0.5 --hier_outfile=classification_hier.txt --assign_outfile=classification_detail.txt multiclassifier_input/Native_1_2_A_trimmed.fasta multiclassifier_input/USGA_1_7_A_trimmed.fasta
More general usage information . . . java -Xmx1g -jar /PATH/MultiClassifier.jar [--gene=][--train_propfile=<file>] [--assign_outfile=<file>] [--hier_outfile=<file>] [--shortseq_outfile=<file>] [--conf=<confidence_cutoff>] [--minWords=<min_words_per_bootstrap>] [--bootstrap_out=<file>] [--format=allrank,fixrank,db] sample_fasta_file[,dupCountInfile]...
[Required] At least one sample_fasta_file is required. Multiple sequence files are separated by space.[Options]
- --gene= 16srrna or fungallsu. The default training model is 16srrna. The MultiClassifier provides two training models: 16S rRNA or Fungal LSU genes. This option can be overwritten by --train_propfile option.
contains the mapping of the training files. Note: the training files and the property file should be in the same directory. The default property file is set to data/classifier/rRNAClassifier.properties.
specifies the output file containing the assignment details for each sequence. Default is null.
specifies the output file containing the assignment count in the hierarchical format for each taxon. Default is standard output.
specifies the assignment confidence cutoff used to determine the assignment count in the hierarchical format. Range [0-1], Default is 0.8. For sequences shorter than 250 base pairs, the confidence threshold 50% is recommended to improve classification coverage.
specifies the minimum number of words for each bootstrap trial. Default is 1/8 of the words. Minimum is 5.
specifies the output file containing the number of matching assignments out of 100 bootstraps for major ranks. Default is null.
- --format= allRank, fixRank or dbformat. Default is allRank. The "allRank" format outputs the results for all ranks applied for each sequence. fixRank only outputs the results for major ranks: domain, phylum, class, order, family and genus. In case of missing ranks in the lineage, the bootstrap value and the taxon name from the immediate lower rank will be reported. This eliminates the gaps in the lineage, but also introduces non-existing taxon name and rank. "dbformat" outputs the seqname, trainset_no, tax_id, conf. This is good for storing results in a database.
- --shortseq_outfile= specifies the output file containing the sequence names that are too short to be classified.
- dupCountInfile specifies the input file containing the duplicate sequence count mapping. Default is null. ex dupCountInfile is order_no tab seqname1,seqname2,..., where seqname1 is the the id used in the input file.
The results are . . . two text files:
- Sequence by sequence assignment details which include the confidence value (0 to 1) for assignment at each level of the hierarchy
- Sequence count at each level of hierarchy separated by sample
The output files . . . can be used to generate:
- A graph showing percentage of assignment using the hierarchical count output (see sample in online Classifier tutorial)
- Ordination analysis using classification_hier.txt in Sample output file
Return to workflow: 16S (supervised)