Running the Defined Community Analysis

Quick Overview
TASKS  INPUT OUTPUT
  1. Compute a global alignment between the read and each of the reference sequences
  2. Identify the source organism, substitutions, indels and associated quality scores using the pairwise alignment producing the highest alignment score
  3. Summarize the types of errors found in the input reads
  1. Sequence file (Fastq)
  2. Reference nucleotide sequence file
  3. Quality file - (optional)

(Sample input files)

Folders for each tag file containing:

  1. a text file (*_pairwise.aln)contains the pairwise alignment sequence
  2. a tab-delimited file (*_mismatch.txt) containing each mismatch error
  3. a tab-delimited file (*_indel.txt) containing insertion and deletion error
  4. a summary file (*_error_summary.txt) that can be imported into excel to calculate error rates and make plots
  5. a tab-delimited file (*_qual.txt) containing the read Q score for each sequence if quality file provided

(Sample output files)


The Process:

The Defined Community Analysis tool main page can be found at http://fungene.cme.msu.edu/FunGenePipeline/error_analysis/form.spr.

This tool compares input nucleotide reads to the set of known sequences for amplification targets in the sequenced DNA. It determines the numbers and types of errors present in the reads. It may also help determine appropriate quality filters for the dataset from the same sequencing run.

It requires a sequence file and a reference sequence file. The required sequence file, obtained from the sequencing center, can be in FASTA format or FASTQ Format (which contains both the sequence and quality information). The defined community reads and the reference sequences should cover the same region of the gene. If not, you can trim reference sequences to the amplicon region by using the Initial Processing Tool with corresponding forward and reverse primers.

It's neccessary to check the existence of chimera (by UCHIME) or contamination in your input reads (by SeqMatch or BLAST). See our publication "FunGene: the functional gene pipeline and repository"


Download the sample input files . . .
for this tutorial -- the sample input zip file contains the following four files:
* a FASTQ sequence file (mid01_trimmed.fastq)
* a reference sequence file (nifH_control_refseq_nucl_slice.fa)


Uploading your data to the web interface.

Output . . . contains the following files (download sample output zip file):

  • a text file (*_pairwise.aln) contains the pairwise alignment between each read and its closest reference sequence
  • a tab-delimited file (*_mismatch.txt) containing each mismatch error in the following format: read ID, closest reference sequence ID, mismatch position in the alignment, the expected base, observed base, position in the read, position in the reference and Q score of the base
  • a tab-delimited file (*_indel.txt) containing each insertion and deletion error in the following format: read ID, closest reference sequence ID, indel position in the alignment, expected homopolymer length, observed homopolymer length, indel base, indel position in the read, indel position in the reference and Q score of the base if it’s an insertion
  • a tab-delimited file (*_qual.txt) containing the read Q score for each sequence if quality scores are provided
  • a summary file (*_error_summary.txt) including total mismatches and indels, the number of reads per target reference, and Q score, and errors summarized by type, reference and Q score
  • Since some sequences are identified as chimeras or contaminants, these sequences can be removed using the parseErrorAnalysis.py script from https://github.com/rdpstaff/fungene_pipeline.
    The summary file can be imported into an Excel spreadsheet to calculate error rates and make plots (see sample_errorsummary.xlsx)

    Plots made with result data:

    1. Error rate of read by Read Q Scoree:
      In this example, a read Q score cutoff of 25 can effectively remove sequences with high number of errors.
    2. Percentage of sequences with certain number of errors:
      In this example, the error rate per base is 0.13% with 96.3% sequences passing the read Q score cutoff of 25.
    3. Percentage of sequences matching each defined community organism by read Q score:
      No obvious taxonomic bias observed using the read Q score cutoff of 25. So for this example, a read Q score of 25 can be used to process the dataset from the same run.

Error rate of read by Read Q Score
error rate

Percentage of sequences with certain number of errors
error rate

Percentage of sequences matching each defined community organism by read Q score error rate

 Move to Toptop topMove to Top