Sample Sheets

The wi-nf, concordance-nf, and nil-ril-nf pipelines all make use of sample sheets. Sample sheets specify which fastqs belong to a given strain or isotype.

Creating sample sheets

wi-nf and concordance-nf pipelines

For the wi-nf and concordance-nf pipelines, sample-sheets are generated using the file located (in each of these repos) in the scripts/construct_sample_sheet.sh. Importantly, these scripts are almost identical except that the concordance-nf pipeline constructs a sample sheet for strains whereas the wi-nf sample sheet is for isotypes.

When adding new sequence data you need to update these scripts. See Adding Sequencing Data for more information.

Note

The nomenclature regarding sample sheets and scripts was changed in March of 2018 to make it clearer. You may encounter older files with the following names that correspond to the newer names

  • SM_sample_sheet --> sample_sheet.tsv
  • construct_SM_sheet.sh --> construct_sample_sheet.tsv

nil-ril-nf

For the nil-ril-nf pipelines you must manually create the sample sheets according to the format below.

Sample-Sheet Format

The sample sheet defines which FASTQs belong to which strain/isotype and specifies additional information regarding a sample. Additional information specfieid are the FASTQ ID (a unique identifier for a FASTQ-pair), Sequencing POOL (which defines the group of samples that were sequenced together), the locations of the FASTQs, and the sequencing folder.

Note

Internally, the 'sequencing pool' information as treated as the DNA-library identifier by BWA (LB). Our lab processes sequence data such that the pool name uniquely identifies DNA-libraries for each sample.

Sample sheet structure

All columns are required.

  • Sample Identifier - How FASTQs should be grouped in the pipeline. Usually this is by strain or isotype.
  • FASTQ ID - A unique ID for the FASTQ pair. It must be unique for all sequencing runs defined in the sample sheet.
  • Sequencing pool - The sequencing pool is often defined arbitrarily. It refers to the set of strains that were sequenced together. It acts as an identifer of the DNA library within the pipelines.
  • FASTQ1 - A relative or absolute path to the first FASTQ.
  • FASTQ2 - A relative or absolute path to the second FASTQ.
  • Sequencing Folder - This column is provided for informational purposes. It generally refers to the name of the folder containing the FASTQs.

Example

AB1 BGI1-RET2-AB1   RET2    /projects/b1059/data/fastq/WI/dna/processed/original_wi_set/BGI1-RET2-AB1-trim-1P.fq.gz /projects/b1059/data/fastq/WI/dna/processed/original_wi_set/BGI1-RET2-AB1-trim-2P.fq.gz original_wi_set
AB1 BGI2-RET2-AB1   RET2    /projects/b1059/data/fastq/WI/dna/processed/original_wi_set/BGI2-RET2-AB1-trim-1P.fq.gz /projects/b1059/data/fastq/WI/dna/processed/original_wi_set/BGI2-RET2-AB1-trim-2P.fq.gz original_wi_set
AB1 BGI3-RET2b-AB1  RET2b   /projects/b1059/data/fastq/WI/dna/processed/original_wi_set/BGI3-RET2b-AB1-trim-1P.fq.gz    /projects/b1059/data/fastq/WI/dna/processed/original_wi_set/BGI3-RET2b-AB1-trim-2P.fq.gz    original_wi_set

Notice that the file does not include a header. The table with corresponding header included below look like this:

Sample Identifeir FASTQ ID Sequencing Pool fastq-1-path fastq-2-path sequencing_folder
AB1 BGI1-RET2-AB1 RET2 /projects/b1059/data/fastq/WI/dna/processed/original_wi_set/BGI1-RET2-AB1-trim-1P.fq.gz /projects/b1059/data/fastq/WI/dna/processed/original_wi_set/BGI1-RET2-AB1-trim-2P.fq.gz original_wi_set
AB1 BGI2-RET2-AB1 RET2 /projects/b1059/data/fastq/WI/dna/processed/original_wi_set/BGI2-RET2-AB1-trim-1P.fq.gz /projects/b1059/data/fastq/WI/dna/processed/original_wi_set/BGI2-RET2-AB1-trim-2P.fq.gz original_wi_set
AB1 BGI3-RET2b-AB1 RET2b /projects/b1059/data/fastq/WI/dna/processed/original_wi_set/BGI3-RET2b-AB1-trim-1P.fq.gz /projects/b1059/data/fastq/WI/dna/processed/original_wi_set/BGI3-RET2b-AB1-trim-2P.fq.gz original_wi_set

Absolute vs. relative paths

When constructing the sample sheet for the wi-nf and concordance-nf pipelines you are required to use the absolute paths to each FASTQ. The nil-ril-nf pipeline can use relative paths to FASTQs by specifying the --fq_file_prefix option to the parent directory containing FASTQs.