Writing Nextflow pipelines¶
Check out the Nextflow documentation for help getting started!
Note
Learning to script with Nextflow definitely has a high learning curve. Don't get discouraged! Start with something small and simple. Maybe convert a current script you have that uses a large for
loop into a nextflow pipeline to start getting the hang of things!
When should my pipeline be a Nextflow script?¶
Not every analysis needs to be a Nextflow pipeline. For smaller analyses (especially those with few inputs) it might be easier to write a bash/shell script. However, there are many advantages to Nextflow:
- When you are running many parallel tasks
- Nextflow takes care of all the job submissions and is really good at running the same basic script across 6 chromosomes, 500 strains, 1000 permutations, whatever you need!
- When your analysis consists of several different steps that are either sequential and/or can be run simultaneously.
- Because Nextflow is great at parallelization, it knows which steps rely on other steps and can speed up your script by running independent steps in parallel.
- Also, you can take advantage of the
-resume
function for scripts that take a long time to run because Nextflow caches results (which means if there is an error you can fix it but you don't have to start over from the begining!)
- You want to be able to easily run your script on different computing platforms (i.e. QUEST, local machine, GCP...)
- This can be done by creating different profiles for each platform. Check out the nextflow documentation on profiles.
The basics¶
Channels
All input and output files in Nextflow are piped through "channels". You can create a channel with an input file or parameter and then feed these channels to a "process" (or step). Channels can be merged, split, or rearranged. Check out the Nextflow documentation for channels to learn more. Also check out this cheatsheet for useful operators.
Miscellaneous channels tips:
Channel.from("A.txt")
will putA.txt
as is into the channelChannel.fromPath("A.txt")
will add a full path (usually current directory) and put/path/A.txt
into the channel.Channel.fromPath("folder/A.txt")
will add a full path (usually current directory) and put/path/folder/A.txt
into the channel.Channel.fromPath("/path/A.txt")
will put/path/A.txt
into the channel.- In other words,
Channel.fromPath
will only add a full path if there isn't already one and ensure there is always a full path in the resulting channel. - This goes hand in hand with
input: path("A.txt")
inside the process, where Nextflow actually creates a symlink namedA.txt
(note the path from first/
to last/
is stripped) linking to/path/A.txt
in the working directory, so it can be accessed within the working directory by the scriptcat A.txt
without specifying a path.
Processes
Each chunk of code in a nextflow script is broken up into distinct "processes" that you can think of as "steps" or even small/large "functions". A process can have one line of code or hundreds. It can be done in bash or by running an R or python script. An example process is shown below:
# first process in linkagemapping-nf
process split_pheno {
input:
file('infile')
output:
file('*.tsv')
"""
Rscript --vanilla ${workflow.projectDir}/bin/split_pheno.R ${infile} ${params.thresh}
"""
}
Each process is defined by process <name> {}
and has three basic parts: 1) input, 2) output, 3) script. You can dictate the pipeline by generating a workflow that acts like a protocol or recipe for Nextflow: which inputs go to which process and what order to do things in. Note, this is different than in DSL1, which is not actively used anymore. To learn more about processes, check out the nextflow docs.
# example of a simple workflow (at the top of a nextflow script)
workflow {
# create a channel from the input file and give it to the split_pheno process
Channel.fromPath(params.in) | split_pheno
# take the output from split_pheno and "flatten it" and send to the mapping process
split_pheno.out.flatten() | mapping
}
Miscellaneous notes on input paths for process:
- With
input: path("A.txt")
one can refer to the file in the script asA.txt
. Side noteA.txt
doesn't have to be the same name as in channel creation, it can be anything,input: path("B.txt")
,input: path("n")
etc. - With
input: path(A)
one can refer to the file in the script as$A
, and the value of$A
will be the original file name (without path, see section above). input: path("A.txt")
andinput: path "A.txt"
generally both work. Occasionally had errors that required the following (tip from @danielecook):- If not in a tuple, use
input: path "A.txt"
- If in a tuple, use
input: tuple path("A.txt"), path("B.txt")
- This goes the same for
output
. - From @pditommaso:
path(A)
is almost the same asfile(A)
, however the first interprets a value of type string as the input file path (ie the location in the file system where it's stored), the latter interprets a value of type string and materialise it to a temporary files. It's recommended the use ofpath
since it's less ambiguous and fits better in most use-cases.
The working directory
One of the main distinctions of Nextflow is that each execution of a process happens in its own temporary working directory. This is important for several reasons:
- You do not need to name temporary files dynamically (i.e. with strain or trait name) to avoid overwriting files, because you can repeat the same process with a different trait in a different directory.
- This means you can call a temporary file
strain.bam
instead ofDL238.bam
andCB4856.bam
. This can make for simpler coding - However, if you provide strain name as a value in the input channel, it is easy to name files dynamically with
${strain}.bam
which will outputDL238.bam
orCB4856.bam
- This means you can call a temporary file
- If there is an error, you can go into the working directory to see all input and output files (sometimes in the form of symlinks) for that specific process. You can also find the script (
.command.sh
) that was run and try to reproduce the error manually. If there was an error, the message is recorded inerrlog.txt
- If there is an error, the Nextflow error output will point you to the working directory for that specific process and might look something like
/projects/b1042/AndersenLab/work/katie/4c/4d9c3b333734a5b63d66f0bc0cfcdc
- You can also find the working directory from the hash shown next to a running/completed process. For example
[4c/4d9c3b]
corresponds to the working directory above. - See the running nextflow page for creating a function to automatically
cd
into the working directory given that hash. - You can also find the working directory in the
.nextflow.log
file or in thereport.html
if one is generated.
- If there is an error, the Nextflow error output will point you to the working directory for that specific process and might look something like
Miscellaneous tips on the working directory:
- Note that with
publishDir "path", mode: 'move'
, the output file will be moved outside of the working directory and Nextflow will not be able to use it as input for another process, so only use it when there is not a following process that uses the output file. - Be mindful that if the
""" (script section) """
involves changing directory, such ascd
orrmarkdown::render( knit_root_dir = "folder/" )
, Nextflow will still only search the working directory for output files. - Run
nextflow clean -f
in the excecution folder to clean up the working directories. - In Nextflow scripts (.nf files), one can use:
${workflow.projectDir}
to refer where the project locates (usually the folder of main.nf). For example:publishDir "${workflow.projectDir}/output", mode: 'copy'
orRscript ${workflow.projectDir}/bin/task.R
.${workflow.launchDir}
to refer to where the script is called from.$baseDir
usually refers to the same folder as${workflow.projectDir}
but it can also be used in the config file, where${workflow.projectDir}
and${workflow.launchDir}
are not accessible.- They are much more reliable than
$PWD
or$pwd
.
Note
The standard name of a nextflow script is main.nf
but it doesn't have to be! If you just call nextflow run andersenlab/nemascan
it will automatically choose the main.nf
script. It is best practice to always write out the script name though
Debugging with print
- To print a channel, use .view()
. It's especially useful to resolve WARN: Input tuple does not match input set cardinality declared by process
. (Don't forget to remove .view()
after debugging)
channel_vcf
.combine(channel_index)
.combine(channel_chr)
.view()
- To print from the script section inside the processes, add
echo true
.
process test {
echo true // this will print the stdout from the script section on Terminal
input: path(vcf)
"""
head $vcf
"""
}
Notes on transition to DSL2
If you are new to nextflow or don't know anything about DSL1 or DSL2, you can disregard this section and use DSL2 syntax!
- Moving to DSL2 is a one-way street. It's so intuitive with clean and readable code.
- In DSL1, each queue channel can only be used once.
- In DSL2, a channel can be fed into multiple processes
- In DSL2, each process can only be called once. The solution is either .concat()
the input channels so they run as parallel processes, or put the process in a module and import multiple times from the module. (One may be able to call a process in different workflows, haven't tested yet).
- DSL2 also enforces that all inputs needs to be combined into 1 channel before it goes into a process. See the cheatsheet for useful operators.
- Simple steps to convert from original syntax to DSL2
- Deprecated operators.
Run reports
nextflow main.nf -with-report -with-timeline -with-dag
-with-report
Nextflow html report contains resource usage for each process, and details (most useful being the status and working directory) for each process-with-timeline
How much wait time and run time each process took for the run. Very useful reference for optimizing resource allocation and improving run time.-with-dag
Make a flowchart to show the relationship of channels and processes.- Software dependencies to use these features. Note the differences on Mac and Linux.
- Or, set this up in the
nextflow.config
file for a pipeline to ensure they are generated each time the script is run:
import java.time.*
Date now = new Date()
params {
tracedir = "pipeline_info"
timestamp = now.format("yyyyMMdd-HH-mm-ss")
}
timeline {
enabled = true
file = "${params.tracedir}/${params.timestamp}_timeline.html"
}
report {
enabled = true
file = "${params.tracedir}/${params.timestamp}_report.html"
}
How to require users to sepcify a parameter value
- There are 2 types of paramters: (a) one with no actual value (b) one with actual values.
- (a) If a parameter is specified but no value is given, it is implicitly considered
true
. So one can use this to run debug modenextflow main.nf --debug
if (params.debug) {
... (set parameters for debug mode)
} else {
... (set parameters for normal use)
}
- or to print help message
nextflow main.nf --help
if (params.help) {
println """
... (help msg here)
"""
exit 0
}
- (b) For parameters that need to contain a value, Nextflow recommends to set a default and let users to overwrite it as needed. However, if you want to require it to be specified by the user:
params.reference = null // no quotes. this line is optional, since without initialising the parameter it will default to null.
if (params.reference == null) error "Please specify a reference genome with --reference"
- Below works as long as the user always append a value:
--reference=something
. It will not print the error message with:nextflow main.nf --reference
(without specifying a value) because this will setparams.reference
totrue
(see point (a)) and!params.reference
will befalse
.
if (!params.reference) error "Please specify a reference genome with --reference"
Resources¶
- Nextflow documentation
- Nextflow cheatsheet
- Nextflow gitter
- Awesome Nextflow pipeline examples - Repository of great nextflow pipelines.
- Official Nextflow patterns
- Google group