nil-ril-nf pipelines all make use of sample sheets. Sample sheets specify which fastqs belong to a given strain or isotype.
Creating sample sheets¶
wi-nf and concordance-nf pipelines¶
concordance-nf pipelines, sample-sheets are generated using the file located (in each of these repos) in the
scripts/construct_sample_sheet.sh. Importantly, these scripts are almost identical except that the
concordance-nf pipeline constructs a sample sheet for strains whereas the
wi-nf sample sheet is for isotypes.
When adding new sequence data you need to update these scripts. See Adding Sequencing Data for more information.
The nomenclature regarding sample sheets and scripts was changed in March of 2018 to make it clearer. You may encounter older files with the following names that correspond to the newer names
For the nil-ril-nf pipelines you must manually create the sample sheets according to the format below.
The sample sheet defines which FASTQs belong to which strain/isotype and specifies additional information regarding a sample. Additional information specfieid are the FASTQ ID (a unique identifier for a FASTQ-pair), Sequencing POOL (which defines the group of samples that were sequenced together), the locations of the FASTQs, and the sequencing folder.
Internally, the 'sequencing pool' information as treated as the DNA-library identifier by BWA (
Our lab processes sequence data such that the pool name uniquely identifies DNA-libraries for each sample.
Sample sheet structure
All columns are required.
- Sample Identifier - How FASTQs should be grouped in the pipeline. Usually this is by strain or isotype.
- FASTQ ID - A unique ID for the FASTQ pair. It must be unique for all sequencing runs defined in the sample sheet.
- Sequencing pool - The sequencing pool is often defined arbitrarily. It refers to the set of strains that were sequenced together. It acts as an identifer of the DNA library within the pipelines.
- FASTQ1 - A relative or absolute path to the first FASTQ.
- FASTQ2 - A relative or absolute path to the second FASTQ.
- Sequencing Folder - This column is provided for informational purposes. It generally refers to the name of the folder containing the FASTQs.
AB1 BGI1-RET2-AB1 RET2 /projects/b1059/data/fastq/WI/dna/processed/original_wi_set/BGI1-RET2-AB1-trim-1P.fq.gz /projects/b1059/data/fastq/WI/dna/processed/original_wi_set/BGI1-RET2-AB1-trim-2P.fq.gz original_wi_set AB1 BGI2-RET2-AB1 RET2 /projects/b1059/data/fastq/WI/dna/processed/original_wi_set/BGI2-RET2-AB1-trim-1P.fq.gz /projects/b1059/data/fastq/WI/dna/processed/original_wi_set/BGI2-RET2-AB1-trim-2P.fq.gz original_wi_set AB1 BGI3-RET2b-AB1 RET2b /projects/b1059/data/fastq/WI/dna/processed/original_wi_set/BGI3-RET2b-AB1-trim-1P.fq.gz /projects/b1059/data/fastq/WI/dna/processed/original_wi_set/BGI3-RET2b-AB1-trim-2P.fq.gz original_wi_set
Notice that the file does not include a header. The table with corresponding header included below look like this:
|Sample Identifeir||FASTQ ID||Sequencing Pool||fastq-1-path||fastq-2-path||sequencing_folder|
Absolute vs. relative paths¶
When constructing the sample sheet for the
concordance-nf pipelines you are required to use the absolute paths to each FASTQ. The
nil-ril-nf pipeline can use relative paths to FASTQs by specifying the
--fq_file_prefix option to the parent directory containing FASTQs.