The trimmomatic workflow performs trimming to remove poor quality sequences and technical sequences such as adapters. It should be used with high-coverage genomic DNA. You should not use the trimmomatic workflow on low-coverage NIL or RIL data.
- You have downloaded FASTQ Data to a subdirectory within a raw directory. For wild isolates this will be
- FASTQs must end in a
.fq.gzextension for the pipeline to work..
- You have modified FASTQ names if necessary to add strain names or other identifying information.
- You have installed software-requirements, preferably using the
andersen-lab-env. You can learn how to install the environment here.
All FASTQs should end with a
_1.fq.gz or a
_2.fq.gz. You can rename FASTQs using the rename command:
rename --dry-run --subst .fastq.gz .fq.gz --subst _R1_001 _1 --subst _R2_001 _2 *.fastq.gz
--dry-run flag will output how files will be renamed. Review the output and remove
the flag when you are ready.
Running the pipeline¶
First you will need to
cd to the directory containing the raw FASTQs. This directory will be downloaded into a
raw parent directory.
When you run the pipeline it will create a sequence directory of the same name in an existing or newly created
processed directory and dump FASTQs there.
Unlike all other pipelines, the
trimmomatic-nf pipeline is run directly from the git repo
nextflow run andersenlab/trimmomatic-nf -latest --email <your email address>
The pipeline is designed to not be destructive. Trimming creates from the
raw parent directory to the processed parent directory as
-- trimming -->
- --email - Specify an email address to be notified when the pipeline succeeds or fails.
The resulting trimmed FASTQs will be output in the
processed directory located up one level from the current directory. For example:
FASTQs are originally deposited in this directory
You run the pipeline while sitting in the same directory:
And results are output in the following directory:
trimmomatic-nf pipeline outputs four files, all of which are located in the processed directory. Continuing with the example above, report files will be located here:
The report output files are:
- md5sum.txt - md5 hashes of all the untrimmed FASTQs. These can be used to verify the integrity of the files.
- trimming_log.txt - A summary of the pipeline-run and software environment.
- multiqc_report_pre.html - Aggregated FASTQC report before trimming.
- multiqc_report_post.html - Aggregated FASTQC report after trimming.
processed/seq will have
fastqc/ folders containing the original, unaggregatred FASTQC reports.
If you have triple-checked everything and are satisfied with the results, the original, raw sequence data can be deleted.
Poor quality data¶
If you observe poor quality sequence data you should remove it.
Once you have completed the trimmomatic-nf pipeline you should backup the FASTQs. More information on this is available in the backup