wi-nf

The wi-nf pipeline aligns, calls variants, and performs analysis from wild isolate sequence data. The output of the wi-nf pipeline can be uploaded to google storage as a new release for the CeNDR website.

Usage


     ▄         ▄  ▄▄▄▄▄▄▄▄▄▄▄                         ▄▄        ▄  ▄▄▄▄▄▄▄▄▄▄▄ 
    ▐░▌       ▐░▌▐░░░░░░░░░░░▌                       ▐░░▌      ▐░▌▐░░░░░░░░░░░▌
    ▐░▌       ▐░▌ ▀▀▀▀█░█▀▀▀▀                        ▐░▌░▌     ▐░▌▐░█▀▀▀▀▀▀▀▀▀ 
    ▐░▌       ▐░▌     ▐░▌                            ▐░▌▐░▌    ▐░▌▐░▌          
    ▐░▌   ▄   ▐░▌     ▐░▌           ▄▄▄▄▄▄▄▄▄▄▄      ▐░▌ ▐░▌   ▐░▌▐░█▄▄▄▄▄▄▄▄▄ 
    ▐░▌  ▐░▌  ▐░▌     ▐░▌          ▐░░░░░░░░░░░▌     ▐░▌  ▐░▌  ▐░▌▐░░░░░░░░░░░▌
    ▐░▌ ▐░▌░▌ ▐░▌     ▐░▌           ▀▀▀▀▀▀▀▀▀▀▀      ▐░▌   ▐░▌ ▐░▌▐░█▀▀▀▀▀▀▀▀▀ 
    ▐░▌▐░▌ ▐░▌▐░▌     ▐░▌                            ▐░▌    ▐░▌▐░▌▐░▌          
    ▐░▌░▌   ▐░▐░▌ ▄▄▄▄█░█▄▄▄▄                        ▐░▌     ▐░▐░▌▐░▌          
    ▐░░▌     ▐░░▌▐░░░░░░░░░░░▌                       ▐░▌      ▐░░▌▐░▌          
     ▀▀       ▀▀  ▀▀▀▀▀▀▀▀▀▀▀                         ▀        ▀▀  ▀           

    parameters              description                    Set/Default
    ==========              ===========                    =======

    --debug                 Set to 'true' to test          ${params.debug}
    --cores                 Regular job cores              ${params.cores}
    --out                   Directory to output results    ${params.out}
    --fqs                   fastq file (see help)          ${params.fqs}
    --fq_file_prefix        fastq prefix                   ${params.fq_file_prefix}
    --reference             Reference Genome               ${params.reference}
    --annotation_reference  SnpEff annotation              ${params.annotation_reference}
    --bamdir                Location for bams              ${params.bamdir}
    --tmpdir                A temporary directory          ${params.tmpdir}
    --email                 Email to be sent results       ${params.email}

    HELP: http://andersenlab.org/dry-guide/pipeline-wi/

Pipeline Overview

Pipeline-overview

Usage

Docker File

andersenlab/wi-nf is the docker image used within the wi-nf pipeline. If Quest ever supports singularity, it can be converted to a singularity image and used with Nextflow.

pyenv environments

The pipeline uses two python environments due to clashing dependencies. The two environments are:

  • vcf-kit - A python 2.7.14 environment with vcf-kit and cyvcf installed.
  • multiqc - A python 3.6.0 environment with multiqc installed.

If you are not using the docker container you must install these virtual environments by running the setup_pyenv.sh script:

bash setup_pyenv.sh

If you require one of these environments for a process within the pipeline, add the following as the first line of your script:

source init_pyenv.sh && pyenv activate <environment name>

Profiles and Running the Pipeline

The nextflow.config file included with this pipeline contains three profiles. These set up the environment for testing local development, testing on Quest, and running the pipeline on Quest.

  • local - Used for local development. Uses the docker container.
  • quest_debug - Runs a small subset of available test data. Should complete within a couple of hours. For testing/diagnosing issues on Quest.
  • quest - Runs the entire dataset.

Running the pipeline locally

When running locally, the pipeline will run using the andersenlab/wi-nf docker image. You must have docker installed.

nextflow run main.nf -profile local -resume

Debugging the pipeline on Quest

When running on Quest, you should first run the quest debug profile. The Quest debug profile will use a test dataset and sample sheet which runs much faster and will encounter errors much sooner should they need to be fixed. If the debug dataset runs to completion it is likely that the full dataset will as well.

nextflow run main.nf -profile quest_debug -resume

Running the pipeline on Quest

The pipeline can be run on Quest using the following command:

nextflow run main.nf -profile quest -resume

Configuration

Most configuration is handled using the -profile flag and nextflow.config; If you want to fine tune things you can use the options below.

--cores

The number of cores to use during alignments and variant calling.

--out

A directory in which to output results. By default it will be WI-YYYY-MM-DD where YYYY-MM-DD is todays date.

--fqs (FASTQs)

When running the wi-nf pipeline you must provide a sample sheet that tells it where fastqs are and which samples group into isotypes. By default, this is the sample sheet in the base of the wi-nf repo (SM_sample_sheet.tsv), but can be specified using --fqs if an alternative is required. The sample sheet provides information on the isotype, fastq_pairs, library, location of fastqs, and sequencing folder.

More information on the sample sheet and adding new sequence data in the Sample sheet section.

--fqs_file_prefix

A prefix path for FASTQs defined in a sample sheet. The sample sheet designed for usage on Quest (SM_sample_sheet.tsv) uses absolute paths so no FASTQ prefix is necessary. Additionally, there is no need to set this option as it is set for you when using the -profile flag. This option is only necessary (maybe) with a custom dataset where you are not using absolute paths to reference FASTQs.

--reference

A fasta reference indexed with BWA. On Quest, the reference is available here:

/projects/b1059/data/genomes/c_elegans/WS245/WS245.fa.gz

--tmpdir

A directory for storing temporary data.

Sample Sheet

The sample sheet defines which FASTQs belong to which isotypes, set FASTQ IDs, library, and more. The sample sheet is constructed using the scripts/construct_SM_sheet.sh script. When new sequence data is added, this needs to be modified to add information about the new FASTQs. Updates to this script should be committed to the repo.

Important

You have to assign isotypes using the concordance script before they can be processed using wi-nf.

When you run the scripts/construct_SM_sheet.sh, it will output the SM_Sample_sheet.tsv file in the base of the repo. You should carefully examine the diff of this file using Git to ensure it is modifying the sample sheet correctly. Errors can be disasterous. Note that the output uses absolute paths to FASTQ files.

Sample sheet structure

AB1 BGI1-RET2-AB1   RET2    /projects/b1059/data/fastq/WI/dna/processed/original_wi_set/BGI1-RET2-AB1-trim-1P.fq.gz /projects/b1059/data/fastq/WI/dna/processed/original_wi_set/BGI1-RET2-AB1-trim-2P.fq.gz original_wi_set
AB1 BGI2-RET2-AB1   RET2    /projects/b1059/data/fastq/WI/dna/processed/original_wi_set/BGI2-RET2-AB1-trim-1P.fq.gz /projects/b1059/data/fastq/WI/dna/processed/original_wi_set/BGI2-RET2-AB1-trim-2P.fq.gz original_wi_set
AB1 BGI3-RET2b-AB1  RET2b   /projects/b1059/data/fastq/WI/dna/processed/original_wi_set/BGI3-RET2b-AB1-trim-1P.fq.gz    /projects/b1059/data/fastq/WI/dna/processed/original_wi_set/BGI3-RET2b-AB1-trim-2P.fq.gz    original_wi_set

Notice that the file does not include a header. The table with corresponding columns looks like this.

isotype fastq_pair_id library fastq-1-path fastq-2-path sequencing_folder
AB1 BGI1-RET2-AB1 RET2 /projects/b1059/data/fastq/WI/dna/processed/original_wi_set/BGI1-RET2-AB1-trim-1P.fq.gz /projects/b1059/data/fastq/WI/dna/processed/original_wi_set/BGI1-RET2-AB1-trim-2P.fq.gz original_wi_set
AB1 BGI2-RET2-AB1 RET2 /projects/b1059/data/fastq/WI/dna/processed/original_wi_set/BGI2-RET2-AB1-trim-1P.fq.gz /projects/b1059/data/fastq/WI/dna/processed/original_wi_set/BGI2-RET2-AB1-trim-2P.fq.gz original_wi_set
AB1 BGI3-RET2b-AB1 RET2b /projects/b1059/data/fastq/WI/dna/processed/original_wi_set/BGI3-RET2b-AB1-trim-1P.fq.gz /projects/b1059/data/fastq/WI/dna/processed/original_wi_set/BGI3-RET2b-AB1-trim-2P.fq.gz original_wi_set

The columns are detailed below:

  • isotype - The name of the isotype. It is used to group FASTQ-pairs into BAMs which are treated as individuals.
  • fastq_pair_id - This must be unique identifier for all individual FASTQ pairs. There is (unfortunately) no standard for this.
  • library - A name identifying the DNA library. If the FASTQ-pairs for a strain were sequenced using different library preps they should be assigned different library names. Likewise, if they were the same DNA library they should have the same library name. Keep in mind that within an isotype the library names for each strain must be independent.
  • fastq-1-path - The absolute path of the first fastq.
  • fastq-2-path - The absolute path of the second fastq.

Output

The output from the pipeline is structured to enable it to easily be integrated with CeNDR. The final output directory looks like this:

├── log.txt
├── alignment
│   ├── isotype_bam_idxstats.tsv
│   ├── isotype_bam_stats.tsv
│   ├── isotype_coverage.full.tsv
│   └── isotype_coverage.tsv
├── cegwas
│   ├── kinship.Rda
│   └── snps.Rda
├── isotype
│   ├── tsv
│   │   ├── <isotype>.(date).tsv.gz
│   │   └── <isotype>.(date).tsv.gz
│   └── vcf
│       ├── <isotype>.(date).vcf.gz
│       └── <isotype>.(date).vcf.gz.tbi
├── phenotype
│   ├── MT_content.tsv
│   ├── kmers.tsv
│   └── telseq.tsv
├── popgen
│   ├── WI.(date).tajima.bed.gz
│   ├── WI.(date).tajima.bed.gz.tbi
│   └── trees
│       ├── I.pdf
│       ├── I.png
│       ├── I.tree
│       ├── ...
│       ├── genome.pdf
│       ├── genome.png
│       └── genome.tree
├── report
│   ├── multiqc.html
│   └── multiqc_data
│       ├── multiqc_bcftools_stats.json
│       ├── multiqc_data.json
│       ├── multiqc_fastqc.json
│       ├── multiqc_general_stats.json
│       ├── multiqc_picard_AlignmentSummaryMetrics.json
│       ├── multiqc_picard_dups.json
│       ├── multiqc_picard_insertSize.json
│       ├── multiqc_samtools_idxstats.json
│       ├── multiqc_samtools_stats.json
│       ├── multiqc_snpeff.json
│       └── multiqc_sources.json
├── tracks
│   ├── (date).HIGH.bed.gz
│   ├── (date).HIGH.bed.gz.tbi
│   ├── (date).LOW.bed.gz
│   ├── (date).LOW.bed.gz.tbi
│   ├── (date).MODERATE.bed.gz
│   ├── (date).MODERATE.bed.gz.tbi
│   ├── (date).MODIFIER.bed.gz
│   ├── (date).MODIFIER.bed.gz.tbi
│   ├── phastcons.bed.gz
│   ├── phastcons.bed.gz.tbi
│   ├── phylop.bed.gz
│   └── phylop.bed.gz.tbi
└── variation
    ├── WI.(date).soft-filter.vcf.gz
    ├── WI.(date).soft-filter.vcf.gz.csi
    ├── WI.(date).soft-filter.vcf.gz.tbi
    ├── WI.(date).soft-filter.stats.txt
    ├── WI.(date).hard-filter.vcf.gz
    ├── WI.(date).hard-filter.vcf.gz.csi
    ├── WI.(date).hard-filter.vcf.gz.tbi
    ├── WI.(date).hard-filter.stats.txt
    ├── WI.(date).hard-filter.genotypes.tsv
    ├── WI.(date).hard-filter.genotypes.frequency.tsv
    ├── WI.(date).impute.vcf.gz
    ├── WI.(date).impute.vcf.gz.csi
    ├── WI.(date).impute.vcf.gz.tbi
    ├── WI.(date).impute.stats.txt
    ├── sitelist.tsv.gz
    └── sitelist.tsv.gz.tbi

log.txt

A summary of the nextflow run.

alignment/

Alignment statistics by isotype

  • isotype_bam_idxstats.tsv - A summary of mapped and unmapped reads by sample.
  • isotype_bam_stats.tsv - BAM summary at the sample level
  • isotype_coverage.full.tsv - Coverage at the sample level
  • isotype_coverage.tsv - Simple coverage at the sample level.

cegwas/

  • kinship.Rda - A kinship matrix constructed from WI.(date).impute.vcf.gz
  • snps.Rda - A mapping snp set generated from WI.(date).hard-filter.vcf.gz

isotype

This directory contains files that integrate with the genome browser on CeDNR.

isotype/tsv/

This directory contains tsv's that are used to show where variants are in CeNDR.

isotype/vcf/

This direcoty contains the VCFs of isotypes.

phenotype/

  • MT_content.tsv - Mitochondrial content (MtDNA cov / Nuclear cov).
  • kmers.tsv - 6-mers for each isotype.
  • telseq.tsv - Telomere length for each isotype ~ split out by read group.

popgen/

  • WI.(date).tajima.bed.gz - Tajima's D bedfile for use on the report viewer of CeNDR.
  • WI.(date).tajima.bed.gz.tbi - Tajima's D index.
  • trees/ - Phylogenetic trees for each chromosome and the entire genome.
    • I.(pdf|png|tree)
    • ...
    • genome.(pdf|png|tree)

report/

  • multiqc.html - A comprehensive report of the sequencing run.
  • multiqc_data/
    • multiqc_bcftools_stats.json
    • multiqc_data.json
    • multiqc_fastqc.json
    • multiqc_general_stats.json
    • multiqc_picard_AlignmentSummaryMetrics.json
    • multiqc_picard_dups.json
    • multiqc_picard_insertSize.json
    • multiqc_samtools_idxstats.json
    • multiqc_samtools_stats.json
    • multiqc_snpeff.json
    • multiqc_sources.json

track/

  • (date).LOW.bed.gz(+.tbi) - LOW effect mutations bed track and index.
  • (date).MODERATE.bed.gz(+.tbi) - MODERATE effect mutations bed track and index.
  • (date).HIGH.bed.gz(+.tbi) - HIGH effect mutations bed track and index.
  • (date).MODIFIER.bed.gz(+.tbi) - MODERATE effect mutations bed track and index.
  • phastcons.bed.gz(+.tbi) - Phastcons bed track and index.
  • phylop.bed.gz(+.tbi) - PhyloP bed track and index

Variation/

  • WI.(date).soft-filter.vcf.gz(+.csi|+.tbi) -
  • WI.(date).soft-filter.stats.txt -
  • WI.(date).hard-filter.vcf.gz(+.csi|+.tbi) -
  • WI.(date).hard-filter.stats.txt -
  • WI.(date).hard-filter.genotypes.tsv -
  • WI.(date).hard-filter.genotypes.frequency.tsv -
  • WI.(date).impute.vcf.gz(+.csi|+.tbi) -
  • WI.(date).impute.stats.txt -
  • sitelist.tsv.gz(+.tbi) - Union of all sites identified in original SNV variant calling round.