Data storage on Rockfish¶

Data storage on Rockfish

In 2023, the Andersen lab began moving its computing resources from QUEST at Northwestern to Rockfish at Johns Hopkins. The goal was to have a seamless transition by maintaining the file system and structure in Rockfish as similar as possible to QUEST.

The primary file system that the Andersen lab uses in Rockfish is called VAST. The VAST partition purchased by the lab has a file quota of 120TB. In addition, the lab has access to two secondary partitions: data_eande106 and scr4-eande106, with file quotas of 10TB and 1TB respectively. Once granted access to the Andersen lab security group and allocation, you will find symbolic links for vast_eande106, data_eande106, and scr4-eande106 partitions attached to your home directory (e.g. /home/<user>/vast_eande106). Symbolic links act like shortcuts that will allow you to access all partitions without having to use intricate file paths that point to the true location of the partitions in the Rockfish file system.

/vast/eande106 (vast_eande106)¶

vast_eande106 is where the vast (hehe, get it?) majority of files will exist. A breakdown of the directories of this partition is described below. Files stored in vast cannot have special characters (e.g. !@#$^&*()?:;) and have a few protected directory names that are not allowed (e.g. aux, prn, con)

data (vast_eande106)¶

This directory contains all the lab data -- split by species. The goal is to create the same file structure across all species. This not only makes it easy to find the file you are looking for, but also makes it easy to script file locations. A general example of a species' file structure is below:

├── genomes
│   ├── genetic_map
│   │   └──  {chr}.map
│   ├── {project_ID}
│   │   └──  {WS_build}
│   │       ├──  *.genome.fa*
│   │       ├──  csq
│   │       │    ├──  *.AA_Length.tsv
│   │       │    ├──  *.AA_Scores.tsv
│   │       │    └──  *.csq.gff3.gz
│   │       ├──  lcr
│   │       │    ├──  *.dust.bed.gz
│   │       │    └──  *.repeat_masker.bed.gz
│   │       └──  snpeff
│   │            ├──  {species}.{project}.{ws_build}
│   │            │   ├──  genes.gtf.gz
│   │            │   ├──  sequences.fa
│   │            │   └──  snpEffectPredictor.bin
│   │            └──  snpEff.config
│   └──  WI_PacBio_assemblies
├── RIL
│   ├── alignments
│   │   └──  {strain}.bam
│   ├── fastq 
│   │   └──  {strain}.*.fastq.gz
│   └── variation
│       └──  {project_analysis}
├── NIL
│   ├── alignments
│   │   └──  {strain}.bam
│   ├── fastq 
│   │   └──  {strain}.*.fastq.gz
│   └── variation
│       └──  {project_analysis}
├── WI
│   ├── alignments
│   │   ├──  _bam_not_for_cendr
│   │   └──  {strain}.bam
│   ├── concordance
│   │   └──  {cendr_release_date}
│   ├── divergent_regions
│   │   └──  {cendr_release_date}
│   ├── fastq 
│   │   ├──  dna
│   │   │   ├──  _fastq_not_for_cendr   
│   │   │   └──  {strain}.*.fastq.gz
│   │   └──  rna
│   │       └──  {project}
│   ├── haplotype
│   │   └──  {cendr_release_date}
│   ├── tree
│   │   └──  {cendr_release_date}
│   └── variation
│       └──  {cendr_release_date}
│           ├──  tracks   
│           └──  vcf
│               ├──  strain_vcf
│               │   └──  {strain}.{date}.vcf.gz
│               └──  WI.{date}.*.vcf.gz
└── {other - i.e. BSA, MUTANT, HiC, PacBio}
        └── *currently could look like anything in here - organized by project*

Note

This restructure is still not 100% complete and things might continue to change as we expand our genomes and data types. If you see something that is not organized well, let Erik or Mike know.

docs (vast_eande106)¶

This directory should probably be removed and the docs inside stored elsewhere. Currently, shows the process for fastq SRA submission from 2019 and then again from 2021.

projects (vast_eande106)¶

Projects is where most people will spend most of their time. Each user with access to quest should create a user-specific folder (named with their name) in the projects directory. Inside your folder, you can do all your work: analyzing data, writing new scripts, temporary storage, etc.

Note

Please be aware of how much space you are using in your personal folder. Of course some projects might require more space than others, and some projects require a lot of temporary space that can be deleted once completed. However, if you find that you have > 500-1000 GB of used space in your folder, please take a look if there is any unused data or scripts. Either way, it is good practice to clean out your folder every few months to avoid storage pile-up. You can check how much space you are using with du -hs * or ask Katie.

singularity (vast_eande106)¶

This directory contains docker images suitable for specific nextflow workflows as well as those created for specific past analyses.

to_be_deleted (vast_eande106)¶

This directory is a temporary holding place for large files/data that can be deleted. Think of it as a soft-trash bin. If, after a few months of living in to_be_deleted, it is likely the data in fact can be deleted without being missed. This was a temporary solution for the large restructure in 2021 and can likely be removed in the future.

workflows (vast_eande106)¶

This directory is being phased out. Workflows will now be hosted on GitHub and any lab members who wish to run a shared workflow should run remotely (if nextflow script) or clone the repo into their personal folders.

/data/eande106 (data_eande106)¶

data_eande106 is where software and pipelines will live. Software and pipeline files often contain special characters (e.g. packages with :: delimiters) or directories with protected names. Any future software must be installed here. A breakdown of the directories of this partition is described below.

analysis (data_eande106)¶

This directory contains non-data output and analyses from general lab pipelines (i.e. wi-gatk or alignment-nf). It is organized by pipeline and then by analysis type-date. If you are running these pipelines (including nil-ril-nf) it is important that you move your analysis folder here once complete (and out of your personal folder) so that everyone has access to the results.

software (data_eande106)¶

General

This is a great location to download any software tools or packages that you need to use that you cannot access with Rockfish or create a conda environment for. Especially important if they are software packages that other people in the lab might also use, as it is a shared space.

Conda environments

Inside the directory conda_envs you can find all the shared lab conda environments necessary for running certain nextflow pipelines. If you create a shared conda environment, it is important that you update the README.md with the code you used to create the environment for reproducible purposes. You can create a conda environment in this directory with:

conda create -p /home/<user>/data_eande106/software/conda_envs/<name_of_env>

You can also update your ~/.condarc file to point to this directory so that you can easily load conda environments just by the name (i.e. source activate nf23_env instead of source activate /home/<user>/data_eande106/software/conda_envs/nf23_env):

channels:
    - conda-forge
    - bioconda
    - defaults

auto_activate_base: false

envs_dirs:
    - ~/.conda/envs/
    - /home/<user>/data_eande106/software/conda_envs/

Important

It is very important that you do not update any packages or software while running a shared conda environment. This is especially an issue with Nextflow and the nf23_env environment. Updating nextflow whilie running this environment will update the version in the environment -- and it needs to be v23.10 for many of the pipelines to run successfully

If you want to see what all is loaded in a particular conda environment, or re-create an environment, you can use:

# make an exact copy 
conda create --clone py35 --name py35-2

# list all packages and versions
conda list --explicit > bio-env.txt

# list a history of revisions
conda list --revisions

# go back to a previous revision
conda install --revision 2

# create environment from file
conda env create --file bio-env.txt

Also, check out this cheat sheet or our conda page for more.

Note

Most of the nextflow pipelines are written using module load anaconda3/2022.05; source activate nf23_env, so if you are having trouble running with conda, try loading the right environment first.

R libraries

Unfortunately, R doesn't seem to work very well with conda environments, and making sure everyone has the same version of several different R packages (specifically Tidyverse) can be a nightmare for software development. One way we have gotten around this is by installing the proper versions of R packages to a shared location (/data/eande106/software/R_lib_3.6.0). Check out the Using R on Rockfish section of the R page to learn more about installing a package to this folder and using it in a script.

Important

It is very important that you do not update any packages in this location unless absolutely necessary and proceed with extreme caution (especially if it is tidyverse!!!). Updating packages could break pipelines that rely on them.

/scratch4/eande106 (scr4-eande106)¶

scr4-eande106 is a small partition primarily used for scratch space. It's currently contains a series of folders with two-digit hexadecimal numbers that are used for Nextflow working files, and it's supposed to be used only to hold small test files and scripts with the aim to eventually delete such test files and reuse this space.**

If you need to use the scratch partition for something other than running nextflow, receate a directory under your own name (~/scr4-eande106/<your name>).

Mapping VAST to your local computer¶

In order to make accessing and transferring data easier, you can create a link to vast on your computer as if it were a network hard-drive. To do this, use the following procedure.

Windows machines¶

Open a file explorer window and right-click on the "Network" icon on the left-hand side of the window. Select "Map Network Drive". This will bring up a dialog box for you to enter the address of vast. Put in the address \\vast.rockfish.jhu.edu\bio-andersen$. You will also get to select a letter drive to map it to. I suggest "V". Also make sure that the check box for reconnecting to server upon restart is checked. When you click OK, you will be prompted for credentials. Instead of your usual JHED ID, use WIN\<JHED> where is your JHED ID. Use your normal JHED password. Check the box to remember credentials and press OK. This should show a brief window while connecting and then close the dialog box. You can now find VAST under "Network" when you expand it in the explorer window. I suggest right-clicking on the VAST folder and pinning it to your EZ-access menu.

Mac and Linux machines¶

Open Finder and under the "Go" drop-down menu at the top, select connect to server. This will open a dialog box with a line for you to enter an address for VAST. Use the address smb://vast.rockfish.jhu.edu/bio-andersen$ and click "Connect". You will be prompted for credentials. Use your JHED ID and password. This should create an icon for VAST under "Locations" in your Finder windows. You can also drag and drop the VAST folder to your favorites section of the Finder window for quick access.

For Linux machines, use the Nautilus file browser (or whatever file manager your distro uses) and follow the directions as described for Macs.