trimmomatic-nf¶
The trimmomatic workflow performs trimming to remove poor quality sequences and technical sequences such as adapters. It should be used with high-coverage genomic DNA. You should not use the trimmomatic workflow on low-coverage NIL or RIL data.
Usage¶
Prerequisites¶
- You have downloaded FASTQ Data to a subdirectory within a raw directory. For wild isolates this will be
/projects/b1059/data/fastq/WI/dna/raw/<folder_name>
- FASTQs must end in a
.fq.gz
extension for the pipeline to work.. - You have modified FASTQ names if necessary to add strain names or other identifying information.
- You have installed software-requirements, preferably using the
andersen-lab-env
. You can learn how to install the environment here.
Software requirements
- trimmomatic
- fastqc
- multiqc
Note
All FASTQs should end with a _1.fq.gz
or a _2.fq.gz
. You can rename FASTQs using the rename command:
rename --dry-run --subst .fastq.gz .fq.gz --subst _R1_001 _1 --subst _R2_001 _2 *.fastq.gz
The --dry-run
flag will output how files will be renamed. Review the output and remove
the flag when you are ready.
Running the pipeline¶
First you will need to cd
to the directory containing the raw FASTQs. This directory will be downloaded into a raw
parent directory.
When you run the pipeline it will create a sequence directory of the same name in an existing or newly created processed
directory and dump FASTQs there.
Unlike all other pipelines, the trimmomatic-nf
pipeline is run directly from the git repo
nextflow run andersenlab/trimmomatic-nf -latest --email <your email address>
Note
The pipeline is designed to not be destructive. Trimming creates from the raw
parent directory to the processed parent directory as
shown below.
/projects/b1059/data/fastq/WI/dna/raw/<seq_folder>/S_1.fq.gz
-- trimming -->
/projects/b1059/data/fastq/WI/dna/processed/<folder_name>/S_1P.fq.gz
Parameters¶
- --email - Specify an email address to be notified when the pipeline succeeds or fails.
Overview¶
Output¶
The resulting trimmed FASTQs will be output in the processed
directory located up one level from the current directory. For example:
FASTQs are originally deposited in this directory
/projects/b1059/data/fastq/WI/dna/raw/new_wi_seq
You run the pipeline while sitting in the same directory:
/projects/b1059/data/fastq/WI/dna/raw/new_wi_seq
And results are output in the following directory:
/projects/b1059/data/fastq/WI/dna/processed/new_wi_seq
Report output
The trimmomatic-nf
pipeline outputs four files, all of which are located in the processed directory. Continuing with the example above, report files will be located here:
/projects/b1059/data/fastq/WI/dna/processed/new_wi_seq/report
The report output files are:
- md5sum.txt - md5 hashes of all the untrimmed FASTQs. These can be used to verify the integrity of the files.
- trimming_log.txt - A summary of the pipeline-run and software environment.
- multiqc_report_pre.html - Aggregated FASTQC report before trimming.
- multiqc_report_post.html - Aggregated FASTQC report after trimming.
Additionally, the raw/seq
and processed/seq
will have fastqc/
folders containing the original, unaggregatred FASTQC reports.
Cleanup¶
If you have triple-checked everything and are satisfied with the results, the original, raw sequence data can be deleted.
Poor quality data¶
If you observe poor quality sequence data you should remove it.
Backup¶
Once you have completed the trimmomatic-nf pipeline you should backup the FASTQs. More information on this is available in the backup