Command line¶
Bash is the default unix shell on Mac OS and most Linux operating systems. Many bioinformatic programs are run using the command line, so becoming familiar with Bash is important.
Start with this introduction to bash. Also check out this cheatsheet
Basic Commands¶
You should familiarize yourself with the following commands.
- alias - create a shortcut for a command
- cat - concatenate files
- zcat - concatenate zipped files
- cd - change directories
- curl - download files
- echo - print strings
- export - Add a variable to the global environment so that they get passed on to child processes.
- grep - filter by pattern
- egrep - filter by regex
- rm - delete files
- sudo - run as an administrator
- sort - sorts files
- source - runs a file
- ssh - connect to servers
- which - locate files on your PATH
- uniq - get unique lines. File must be sorted.
More Advanced¶
You should learn these once you have the basics down.
- git - version control
- awk - file manipulation; Filtering; Rearranging columns
- sed - quick find/replace
Good Guides¶
Below are some good guides for various bash utilities.
grep¶
awk¶
- awk guide
- awk by example - hundreds of examples
Rearranging columns¶
cat example.tsv | awk -f OFS="\t" '{ print $2, $3, $1 }'
The line above will print the second column, the third column and finally the first column.
Filtering based on criteria¶
Print only lines that start with a comment (#) character
cat example.tsv awk '$0 ~ "^#" { print }'
bcftools¶
bcftools view
- bcftools view
- view VCF - bcftools view -h
- view only header of VCF - bcftools view -H
- view VCF without header
- bcftools view -h
- bcftools view -s CB4856,XZ1516,ECA701
- subset vcf for only these three samples - bcftools view -S sample_file.txt
- subset vcf for only samples listed in sample_file.txt
- bcftools view -S sample_file.txt
- bcftools view -r III:1-800000
- subset vcf for a region of interest - can also just use
-r III
to get entire chromosome - bcftools view -R regions.txt
- subset vcf for a region(s) of interest in the regions.txt
file
- can also just use
bcftools query
- bcftools query -l
- print out list of samples in vcf - Print out contents of vcf in specified format (i.e. tsv):
bcftools query -f '%CHROM\t%POS\t%REF\t%ALT[\t%SAMPLE=%GT]\n' <vcf> > out.tsv
Output of above line of code:
- bcftools query -i GT=="alt"
- keep rows that include a tag (like a filter) - bcftools query -e GT=="ref"
- remove rows that include a tag
Note
bcftools query -i/e
are not necessarily opposites. For example, if you have three genotype options (REF, ALT, or NA), including only ALT calls is different than exluding only REF calls...
For more, check out the bcftools manual and this cheatsheet
Screen¶
Screen can be used to run things in the background. It is extremely useful if you need to run things on quest without worry that they will be terminated if you log out or get kicked off. This is essential when running nextflow because pipelines can sometimes run for many hours and its likely you will be kicked off in that time or lose your connection.