Initial Process: Paired End Reads Assembler

Quick Overview
TASKS  INPUT OUTPUT
  1. Assemble Paired end reads
  2. Filter out low quality reads by base-call quality, length, and ambiguity code (ā€˜Nā€™)
  1. FASTQ format sequence file
  2. Filters: Min and Max length, Q-value cutoff, assemble paired end reads

(Sample input files)

Output Folders containing:

  1. Assembled sequences in FASTQ format and assembly statistics file
  2. Quality and length statistics of all combined sequences results in text and graphical formats in no-tag file
  3. Summary statistics text file

(Sample output files)


The Process:

The initial processing tool main page can be found at https://pyro.cme.msu.edu/init/form.spr

For paired-end reads, Initial processing requires at least one pair of input FASTQ files containing the paired-end reads with sequence and quality information. It processes the reads in two stages: Assembly is performed at the fist stage. After assembly, at stage 2, the assembled reads are sorted and filtered the same way as described in the Initial Process for single-stranded reads. If the gene chosen is 16S rRNA or Fungal 28S rRNA, the orientation of sequences will be checked and reverse complemented if needed.

Our new Assembler for paired-end reads (Cole et al., 2014. doi: 10.1093/nar/gkt1244) is an extended program based on the original PANDAseq (Masella et al., 2012. BMC Bioinformatics, 13:31). It uses a new statistical model to compute quality scores in the overlap region and handle more complex overlap layouts.


Download the sample input files . . .
for this tutorial -- the sample input tgz file contains the following four files:

Inside the input .tgz file . . . It contains four sequence fastq files: 2 pairs of sequence FASTQ files (Mock**.fastq)

region 1 tag file

The sequence file . . . is a larger file that contains FASTQ formatted nucleotide reads:

sequence reads


Uploading your data . . .
Make sure to check the box "Assemble paired end reads". The forward primer(s) or reverse primer(s) are not required. You can choose the gene name, adjust the the maximum forward primer or reverse primer edit distance, number of N's, minimum or maximum sequence length and minimum average exponential quality score (average Read Q score). Average Read Q score is defined as -10log(E) where E is the average predicted error rate for the read. A read Q score of 25 to 27 is recommended to filter out low quality assembled reads based on our results using MiSeq defined community datasets. See our manuscript (in preparation) for more details.

For our sample data the initial processing form looks like this:

SCREENSHOT of input form

Output . . . for each tag is a directory which contains the following files (download sample output zip file):

  • Your specified initial parameters (input_params.txt)
  • The folder "assembled_paired_end_sequences" contains the assembled paired sequences in FASTQ format and the assembled analysis results in (FASTQ.LOG) directly output from the Assembler (first stage), one for each pair of input FASTAQ files.
  • Quality and length statistics in text and graphical formats of combined all sequences (*_qual_stats.txt, *_quality.png, *_length_stats.txt, *_length_histo.png)
  • Since there is no tag file uploaded for this example, the results (from stage 2 ) are combined in folder called "NoTag". You can choose to use the results from directory "assembled_paired_end_sequences" if you need to work with sequences from one pair of input only.

contents of output folder

The summary stat file . . .(Mock**.fastq.log) contains:

  • Number of sequences reads
  • Average assembled length of sequences
  • Average overlap length and bits
  • Average Read Q-score

The summary also shows how many sequence pairs failed assembly for various reasons. If an overlapping region could not be found between the sequence pair, the sequence pair was marked an "NOALIGN". In this example, there are 82 pairs were assembled but with read Q score below the specified cutoff.So these were labelled as LOWQ reads. Only the sequences labelled as "OK" stat will be kept and passed to the stage 2 process.

summary stats file

The process single-stranded reads, go to Pipeline Initial Process

 Move to Toptop topMove to Top