trimmomatic-nf¶

The trimmomatic workflow performs trimming to remove poor quality sequences and technical sequences such as adapters. It should be used with high-coverage genomic DNA. You should not use the trimmomatic workflow on low-coverage NIL or RIL data.

trimmomatic-nf
Usage
- Prerequisites
- Running the pipeline
  - Parameters
Overview

Usage¶

Prerequisites¶

You have downloaded FASTQ Data to a subdirectory within a raw directory. For wild isolates this will be /projects/b1059/data/fastq/WI/dna/raw/<folder_name>
FASTQs must end in a .fq.gz extension for the pipeline to work..
You have modified FASTQ names if necessary to add strain names or other identifying information.
You have installed software-requirements, preferably using the andersen-lab-env. You can learn how to install the environment here.

Software requirements

trimmomatic
fastqc
multiqc

Note

All FASTQs should end with a _1.fq.gz or a _2.fq.gz. You can rename FASTQs using the rename command:

rename --dry-run --subst .fastq.gz .fq.gz --subst _R1_001 _1 --subst _R2_001 _2 *.fastq.gz

The --dry-run flag will output how files will be renamed. Review the output and remove the flag when you are ready.

Running the pipeline¶

First you will need to cd to the directory containing the raw FASTQs. This directory will be downloaded into a raw parent directory. When you run the pipeline it will create a sequence directory of the same name in an existing or newly created processed directory and dump FASTQs there.

Unlike all other pipelines, the trimmomatic-nf pipeline is run directly from the git repo

nextflow run andersenlab/trimmomatic-nf -latest --email <your email address>

Note

The pipeline is designed to not be destructive. Trimming creates from the raw parent directory to the processed parent directory as shown below.

/projects/b1059/data/fastq/WI/dna/raw/<seq_folder>/S_1.fq.gz

-- trimming -->

/projects/b1059/data/fastq/WI/dna/processed/<folder_name>/S_1P.fq.gz

Parameters¶

--email - Specify an email address to be notified when the pipeline succeeds or fails.

Overview¶

Output¶

The resulting trimmed FASTQs will be output in the processed directory located up one level from the current directory. For example:

FASTQs are originally deposited in this directory

/projects/b1059/data/fastq/WI/dna/raw/new_wi_seq

You run the pipeline while sitting in the same directory:

/projects/b1059/data/fastq/WI/dna/raw/new_wi_seq

And results are output in the following directory:

/projects/b1059/data/fastq/WI/dna/processed/new_wi_seq

Report output

The trimmomatic-nf pipeline outputs four files, all of which are located in the processed directory. Continuing with the example above, report files will be located here:

/projects/b1059/data/fastq/WI/dna/processed/new_wi_seq/report

The report output files are:

md5sum.txt - md5 hashes of all the untrimmed FASTQs. These can be used to verify the integrity of the files.
trimming_log.txt - A summary of the pipeline-run and software environment.
multiqc_report_pre.html - Aggregated FASTQC report before trimming.
multiqc_report_post.html - Aggregated FASTQC report after trimming.

Additionally, the raw/seq and processed/seq will have fastqc/ folders containing the original, unaggregatred FASTQC reports.

Cleanup¶

If you have triple-checked everything and are satisfied with the results, the original, raw sequence data can be deleted.

Poor quality data¶

If you observe poor quality sequence data you should remove it.

Backup¶

Once you have completed the trimmomatic-nf pipeline you should backup the FASTQs. More information on this is available in the backup