Running the Pipeline Initial Process

Quick Overview
TASKS  INPUT OUTPUT
  1. Sort and bin the reads to samples tagged by Multiplexing Identifiers (MIDs) or barcodes
  2. Match the reads to the primer(s) to exclude non-target reads and trim the primer regions off the reads (optional)
  3. Filter out low quality reads by base-call quality, length, and ambiguity code (ā€˜Nā€™)
  1. RDPipeline initial process supports sequence file in FASTA, FASTQ or SFF Format.
  2. Primer sequence(s)
  3. Tag file
  4. Filters: length, number of 'N's, Q-value cutoff

(Sample input files)

Folders for each tag file containing:

  1. Trimmed sequences in FASTQ format. These sequences are ready for downstream analysis, such as classification or alignment
  2. Quality and length statistics in text and graphical formats
  3. A text file that lists dropped sequences and the reason for their failure
  4. Summary statistics text file

(Sample output files)


The Process:

The initial processing tool main page can be found at https://pyro.cme.msu.edu/init/form.spr

RDP Pipeline Initial processing steps include matching the raw reads to experimental samples, trimming off the tag and primer portions, and removing sequences of low quality. If the gene chosen is 16S RNA, the orientation of sequences will be checked and reverse complemented if needed.

Initial processing requires a sequence file and at least one forward primer. The required sequence file, obtained from the sequencing center, and can be in FASTA format or SFF Format (which contains both the sequence and quality information).


Download the sample input files . . .
for this tutorial -- the sample input zip file contains the following four files:
a FASTQ sequence file (1.TCA.454Reads.fastq)
a tag file (region1_tag.txt)
a primer file (primer.txt)

Uncompress input .zip file . . . It contains one sequence fastq file, one primer txt file and one tag txt file.

region 1 tag file

The sequence file . . . is a file that contains FASTQ formatted nucleotide reads:

sequence reads

The tag file . . . organizes samples based on user-defined nucleotide tag sequences. It is a tab-delimited text file with a tag sequence, followed by a sample name in each line. In creating sample names, avoid the use of periods and spaces. You may use the underscore character to separate characters if desired.

region 1 tag file

The primer file . . . contains the primer sequences to be pasted into the corresponding text boxes. A forward primer is at the proximal end of the sequencing process. The orientation of the forward primer is the same as the orientation of the amplicon sequence. The primer sequences you enter are for the target region only. They do not include barcode and adaptor regions of the primers.

primer file


Uploading your data . . .
including the sequence and tag files, is accomplished by use of the initial processing web interface. The primers are input manually into separate text boxes. Finally, the user can control the quality of the output sequences by manipulating the maximum forward primer edit distance, maximum reverse primer edit distance, number of N's, minimum sequence length and minimum average exponential quality score. For our sample data the initial processing page looks like this:

SCREENSHOT of input form

Output Files. . . for each tag is a directory which contains the following files (download sample output zip file):

  • Your specified initial parameters (input_params.txt)
  • Trimmed sequences in FASTQ format -- these sequences are ready for downstream analysis, such as classification or alignment (*_trimmed.fastq)
  • Quality and length statistics in text and graphical formats (*_qual_stats.txt, *_quality.png, *_length_stats.txt, *_length_histo.png,)
  • A text file that lists dropped sequences and reason for their failure (*_dropped_seqs.txt)
  • Summary statistics text file (*_summary.txt)

contents of output folder

The summary stats file . . . contains 5 summary lines:

Line 1: Tag sample name
Line 2: Total sequences matching the tag
Line 3: Number of sequences that passed filtering and trimming
Line 4: Average length of sequences after filtering and trimming
Line 5: Standard deviation of the length of the sequences after filtering and trimming

This is followed by a summary of how many sequences are removed by each filter (primer match, N count, minimum length, etc.).

summary stats file

Go to paired end assembling initial process: Assemble Paired End Reads

 Move to Toptop topMove to Top