Skip to content

General Guidelines for Organizing Data and Analysis

Erick Samera edited this page Mar 4, 2023 · 1 revision

This document is intended to provide the guidelines for organizing data and analysis.

Table of Contents

Current Locations of Data

Data that is generated from the instruments are eventually stored in three separate places.

  1. Instrument local storage: The SeqStudio, QuantStudio, GelDoc, MiSeq, and iScan all have computers integrated or attached to the instrument that ingest the data directly from the instrument.
  2. Z drive: The Z drive is network-attached storage that is only exposed to computers directly connected to the local area network of the lab. All data generated from instruments are eventually transferred from the local storage to the Z drive.
  3. Teams OneDrive: Each project has a Microsoft Teams team which are allocated a certain amount of cloud storage. Data files are eventually transferred from instruments or the Z drive to the Teams OneDrive. Files resulting from analysis are also primarily stored in the Teams OneDrive. The Teams OneDrive is the most accessible form of storage.

Types of Data that are Handled Often in the Lab

The data types that are handled often in the lab will be listed by instrument.

Table 1. Data files generated by the SeqStudio

Data type Description Software to work with it
.ab1 These are the raw chromatogram files that the sequencer produces that basecalling is done with. They also contain metadata on how the run went and other relevant parameters. Processing these files is mostly proprietary. ApE, Biopython, ThermoFisher Cloud apps
.csv Contains metadata on each sample that was run on the sequencer. Excel, text editor
.pdf PDF version of the metadata in the csv. PDF reader

Table 2. Data files generated by the QuantStudio

Data type Description Software to work with it
.eds Stands for experiment data, standard curve. Used to setup a qPCR experiment and its run parameters, and the run data is also written to this file afterwards. Proprietary. QuantStudio Software, ThermoFsiher Cloud apps
.xlsx Data from the .eds file can be exported into .xlsx for easier processing. Excel

Table 3. Data files generated by the GelDoc

Data type Description Software to work with it
.scn Contains gel imaging data as well as annotation data. ImageLab
.jpg/.png/.tiff Possible output formats for the gel image. Image viewing and editing software (e.g., Paint, Paint.net, Photoshop, Photopea, etc.)
.xlsx Can be exported to contain information about the quantification of bands on the gel image based on references bands on the gel. Requires annotation of the gel image using ImageLab. Excel

Table 3. Data files generated by the MiSeq

Data type Description Software to work with it
.csv Contains sampling information and run metadata. Excel, text editor
.fastq(.gz) FASTQ results from the sequencing run. mostly CLI (e.g., fastqc, multiqc)
.xml Detailed run parameters for debugging purposes. text editor
.bcl Raw basecall data CLI only (e.g., bcl2fastq)

File Hierarchy for Instrument Local Storage

# TODO

File Hierarchy for Z Drive

image

Fig. This is what Lyndsey has in her organization pptx, although we may want to change it to better reflect new conventions.

root

The root folder of the Z drive contains the following folders:

  1. Final-project-files: Reports and completed analyses for each project go here.
  2. Other-files: Catch all folder for random files that don't fit in any of the other categories
  3. Raw-data: Contains all of the data generated by the instruments.
  4. Ext-data: Contains all the data that was provided to us by external sources, i.e., data that we did not generate ourselves with our instruments.

Final-project-files

The Final Project Files folder is organized by project.

Other-files

Other files currently contains the following files:

  1. The Applied Genomics Centre General Manual pdf
  2. Files related to conferences that we've attended
  3. Various documentation of instruments
  4. MSDS sheets of every kit and reagent that we have in the lab
  5. Photos that have been taken for promotional purposes

Raw-data

The Raw Data folder is organized by instruments:

  1. GCMS
  2. Gel-Images (Gel Doc)
  3. HPLC
  4. NGS (IonTorrent, MiSeq)
  5. QuantStudio
  6. QuantStudio-3D
  7. SeqStudio
  8. Tapestation
  9. iScan

(There is an additional directory, File Transfer Log, which is used to log the status of backups)

In the Gel Images and QuantStudio directories, folders are organized by project. In the SeqStudio and NGS directories, folders are organized by sequencing run due to the prevalence of runs being a multiplex of projects. Organization within project-specific folders on the Z-drive is up to those working on the projects.

Ext-data

This directory is organized according to the data that is received and the project that they belong to.

File Hierarchy for Teams OneDrive

Each Microsoft Teams team is allocated a certain amount of cloud storage. Most of the working data and analyses will live in the Teams OneDrive. Every Teams OneDrive has the following directory structure:

  1. ARCHIVE: Old data and analyses that are currently not relevant to the project, but kept in case they need to be referred to. This is often data that has been superseded by new data or data that is no longer usable because of changes in procedures.
  2. Experimental Design and SOPs: Contains documents related to how certain experiments are laid out, as well as project-specific SOPs. Not all projects place their documentation and SOP in this folder; some may use OneNote instead.
  3. Reference and Literature: Contains data and documentation that was not generated within the lab (e.g. retrieved from online databases, from industry partners, from peers, etc.).
  4. Results: Contains data that was generated by us as part of experiments. Often accompanied by the related analyses files.
  5. Summary Documents: Contains documents that summarize the significant findings of the project as well as the current progress.
  6. Update Meeting Presentations: Contains files related to presentations that are given to the industry partners.

The organization of files within each of these folders is up to those working on the projects.

image

Fig. Diagram showing the file hierarchy for Project-based Teams

File Hierarchy for Analyses

Although most projects will have varying ways of organizing their analyses, the following is a directory structure that is employed mainly for use in bioinformatics analyses but can be adopted by various other analyses. The goals of this directory structure are the following:

  1. There is a clear distinction between the raw data (e.g., data received off of the instruments) and any intermediate and final files that are generated during the analysis
  2. Scripts, bioinformatics tools, and other procedures that were used in this specific analysis are documented and available if the need to reproduce the analysis arises.
  3. The origin of files after data analysis can be easily determined.
  4. Relative file paths are maintained, allowing for ease of reproducibility.
20XX-XX-XX_project-name_description/ 
├─ bin/ 
│     ├─ handler_script.sh 
│  ├─ program_script.py 
├─ data/ 
│  ├─ raw_data.ext 
├─ results/ 
│  ├─ 1_intermediate_step/ 
│  │  ├─ intermediate_data.ext 
│  ├─ 2_intermediate_step/ 
│  │  ├─ intermediate_data.ext 
│  ├─ final_result/ 
│  │  ├─ final_data.ext 
├─  README.md 

The bin folder contains the code and scripts that are used to carry out the analysis. Often times, there is a special handler script that will be included in bin. This script contains all of the terminal commands to execute the entire analysis pipeline. Therefore, replicating the analysis is as easy as executing the handler script.

The data folder contains only the raw (input) data. This folder can be organized however you wish. For input data sets that contain large files (e.g. an entire reference genome), you may not want to duplicate those large files across every similar analysis that is carried out. For these cases, the README.md file should indicate where the user can find that large data file. These files should not only be stored on the local machine that the analysis is being carried out on.

The results folder contains all intermediate files and final result files generated in the analysis. Subdirectories can be used prefixed by a number indicating the order in which they are generated to organize intermediate files.

The documentation for the entire analysis lives in the root of the project folder. This file should be comprehensive and at the bare minimum list the following items:

  1. All scripts that have been used in the project
  2. The input data and their sources
  3. The final results files
  4. The date that this analysis was carried out
  5. A summary of all of the processing steps that were carried out
Clone this wiki locally