Align Protein Sequences Using HMMER3 Aligner
The aligner tool main page can be found at http://fungene.cme.msu.edu/FunGenePipeline/aligner/form.spr
The FunGene Pipeline uses HMMER3 (http://hmmer.janelia.org) for protein sequence alignments. HMMER3 has a more accurate scoring system than older alignment and search tools such as BLAST with little to no extra computational cost. HMMER3 makes use of profile HMMs or individual models for genes that store information about how conserved each position is and which amino acids are common in that position.
RDP currently maintains models for about twenty genes. Models are built using a set of seed sequences obtained from trusted sources. This initial set of seed sequences is aligned by some other method such as Clustal and then this alignment is fed into HMMER3 to build an initial model. This model is then refined by repeatedly aligning the seed sequences to the HMM model and then rebuilding the model based on the new alignment. This is repeated until the model parameters stop changing and the model is stable
If you want to use the HMMER3 aligner for a gene other than those already in the database you will need a set of confirmed sequences for your gene of interest. Contact RDP at email@example.com to send us your sequences and we will build a model and make it available for use online.
If your sequence reads do not cover the complete amplicon or same gene region, or the quality drops at the distal end, you need to trim the reads to cover the identical gene region before continuing downstream analysis (such as clustering). To determine the best trimming position might be difficult, feel free to contact RDP staff for help.
To align the sample input files, upload the files and then select "nifh" from the gene name drop down box and click Align. An email will be sent to the email address you provide when the process finishes. You can either let the result to be automatically downloaded from your browser, or follow the link in the email.
When sequences are submitted to the aligner they are first dereplicated. The dereplication creates three files which are seen in the top of the results directory. (all_seqs_derep.fasta, all_seqs.ids and all_seqs.samples). The sample and id file are used for mapping the dereplicated sequences to the original sequences when expanding the sequences after alignment. The results folder contains a folder called "alignment" that contains a dereplicated and aligned sequence only.
Finally, the expanded_mappings folder contains all of the sequences in the aligned (*_prot_corr.fasta).