What is the Functional Gene Pipeline/Repository (FGPR)?
FGPR is a web-based resource by RDP for analyzing important eco-functional genes. It consists of the database of protein and coding sequences and the interactive display/access interface, organized in gene families, and the pipeline capable of processing/analyzing NGS sequence data. It is used to aid functional genomics studies, especially of the environment; updated monthly.
Where does the search result data come from?
FGPR data of each gene family is compiled from the HMM search using a protein model built from a set of different and well characterized "training sequences" submitted by experts. The NCBI non-redundant protein database and WGS are searched using the models and the Hidden Markov Model (HMM) search program. This is the same program used to create the PFAM database of protein motifs. Searches can be repeated using the same models when the protein database is updated. Each gene is searched for common protein motifs using the PFAM database. Scores for these conserved motifs are included in the FGPR output. This can help separate unrelated "hits" that just happen to share a common protein motif with the gene of interest from related but highly diverged sequences. For each "hit" the corresponding protein and nucleic acid records are retrieved. The protein "hits" are aligned using the HMM. Nucleic acid records are aligned by back-translating from the protein alignment. Source organism, reference information, etc. extracted from the records are linked into the FGPR output.
How do HMM searches compare to BLAST?
Since HMM models are based on a set of training sequences, they contain much more information than is conveyed by the single query sequence in BLAST. The training set helps define which regions are more conserved and what changes are most common. It's been shown mathematically that the statistical test used in BLAST is essentially equivalent to a type of HMM search with a single training sequence. BLAST is much faster than HMM model searches because it uses a heuristic to filter out sequences unlikely to match.
How do I use the FGPR?
Video tutorials are here.
For each search, you're initially presented with a list of "hits" ordered by score. Starting "training sequences" are presented in color. Jump to the bottom of the list to change the ordering or filter the results based on score, size, or source (environmental clone vs. isolated organisms). Hint: After you've set the filters and ordering to your preference, you can save the page as a "bookmark" in your browser. The score filter is preset to exclude less meaningful results for searches where the total number of results is large. The excluded results can be displayed by changing the filter value. You can choose to display only non-redundant protein hits, or to include redundant entries. (For example, NCBI sometimes considers a well-known training sequence to be a redundant entry if there's an identical protein sequence available.) Protein or nucleic acid alignments can be downloaded for any subset of hits. Analysis tools are being added. Current tools include a neighbor-joining phylogenetic tree builder and a primer/probe tester.
What are the columns in the FGPR display?
There are 12 columns in the FGPR display:
- Select: A checkbox to select the "hit" for download or further analysis.
- Score: (Bits saved) Score from the HMM search. Directly analogous to the (bits) Score in BLAST.
- PID, NID: Protein and nucleic acid identifiers with links. NID links are only to the gene coding portion of the nucleic acid record. Some protein hits were not translated from the nucleic acid and do not have a corresponding NID.
- Definition: From the NCBI protein record.
- Organism: From the NCBI protein record.
- Occ.: Occurrence, the number of HMM matches found in the protein. Should normally be 1. Any other number may indicate a false hit.
- % of HMM Coverage: Percentage of the HMM model that matches the hit protein sequence.
- % of HMM Identity: Percent identity of the protein sequence that matchs the HMM Model consensus sequence.
- Size(aa): The length of the protein.
- Reference: The first reference listed in the NCBI protein record. For those references abstracted by PubMed, a link is provided.
- Motif(n): Hits are scored against PFAM-A HMMs to common protein motifs present in the gene of interest. Link to the corresponding PFAM records are given at the top of the table.
- Notes and View/Edit: A place for members to add short notes about a particular "hit."
References and Support
1. R. Durbin, S. Eddy, A. Krogh, G. Mitchison. (1998) The theory behind profile HMMs. In: R. Durbin, S. Eddy, A. Krogh, and G. Mitchison, Biological sequence analysis: probabilistic models of proteins and nucleic acids, Cambridge University Press.
2. A. Bateman, L. Coin, R. Durbin, R.D. Finn, V. Hollich, S. Griffiths-Jones, A. Khanna, M. Marshall, S. Moxon, E.L.L. Sonnhammer, D.J. Studholme, C. Yeats, S.R. Eddy. The Pfam Protein Families Database. Nucleic Acids Res. (2004) Database Issue 32:D138-D141.
3. D.A. Benson, I. Karsch-Mizrachi, D.J. Lipman, J. Ostell, D.L. Wheeler. GenBank: update. Nucleic Acids Res. (2004) Database issue 1:D23-6.