GitHub - jessicalumian/ep-streaming

This is a comparison of a streaming and nonstreaming versions of the Eel Pond mRNASeq Protocol. This page contains workflows for streaming and nonstreaming algorithms of the subset of and the full set of Nematostella data (from Tulin et. al).

Analysis commands:

Streaming, full dataset
Nonstreaming, full dataset one, two, three
Streaming, subset
Nonstreaming, subset one, two, three

For data subset:

Start by firing up Amazon EC2 (m3.xlarge for data subset). Instructions on setting up an EC2 are here.

For full data set:

Boot up an m4.4xlarge Amazon EC2. Under "Add Storage", add 600 GB on the root volume. Or, m4.large for smaller computer.

Start tmux and mount data:

tmux
lsblk # lists all possible volumes, identify which is right
mkdir data/ 
sudo mount /dev/xvdf data/ # fill in correct four characters. Note- this mount replaces entire directory, so do it in an empty place
df
ls

Continuing on as root, start at the top of the protocols and continue on with the (non)streaming specific commands for whichever pipeline is being run.

For both datasets:

Install git-core for literate resting text extraction of khmer-protocols.

sudo bash
apt-get update
chmod a+rwxt /mnt
apt-get -y install git-core python-dev python-pip python-virtualenv sysstat

Extract commands from protocols, note ctb branch is nonstreaming.

For streaming (full or subset) or non streaming full data set:

cd /home/ubuntu
rm -fr literate-resting khmer-protocols
git clone https://github.com/dib-lab/literate-resting.git
git clone https://github.com/dib-lab/khmer-protocols.git -b jem-streaming

cd khmer-protocols/mrnaseq

Note: For nonstreaming, data subset, do -b ctb instead

For all methods: Extract commands from protocols.

for i in [1-9]-*.rst
do
   /home/ubuntu/literate-resting/scan.py $i || break
done

In another ssh session, run sar to monitor resrouces. Use screen to do so in same window. Note - ctrl+b = press control key and a at the same time, this won't copy paste. Use tmux to have multiple windows within same ssh session.

Now create a new window to run commands while tmux runs in this one:

ctrl+b c

Start running sar:

sar -u -r -d -o times.dat 1

Change windows to be able to run command:

crtl+b n

When you return to the instance, type:

tmux attach

FULL DATASET (mounted manually)

Streaming:

bash 2-streaming-full.rst.sh

Nonstreaming:

for i in [3-5]-*.rst.sh
do
   bash $i
done

DATA SUBSET

Streaming:

bash 1-streaming-subset.rst.sh

Nonstreaming: (from ctb branch instead of jem-streaming)

for i in [1-3]-*.rst.sh
do
   bash $i
done

After pipeline finishes, use the following commands to extract disk, CPU, and RAM information from sar in corresponding screen:

sar -d -p -f times.dat > disk.txt
sar -u -f times.dat > cpu.txt
sar -r -f times.dat > ram.txt
gzip *.txt

Use scp to transfer files to local computer (could also use cyberduck, but this is quicker). Fill in with correct paths and < > brackets.

Command for local computer when in your desired file location for the assembly:

scp -i ~/Downloads/amazon-streaming-current.pem ubuntu@<Public DNS>:/mnt/work/trinity_out_dir/Trinity.fasta .

And also copy the times.dat and disk, cpu, and ram files to a local computer, running this same command on the local computer:

scp -i ~/Downloads/amazon-streaming-current.pem ubuntu@<Public DNS>:/home/ubuntu/khmer-protocols/mrnaseq/times.dat .
scp -i ~/Downloads/amazon-streaming-current.pem ubuntu@<Public DNS>:/home/ubuntu/khmer-protocols/mrnaseq/*.txt.gz .  
scp -i ~/Downloads/amazon-streaming-current.pem ubuntu@<Public DNS>:/home/ubuntu/times.out .

In sar, do "./extract.py xvdf" to run and get log.out file (specifies disk of interest). The parsing script can be found in this repository: https://github.com/jessicamizzi/sartre-jessica. Note - the sartre-jessica extract script will only work for full data sets that have been mounted. For data subsets, clone and use https://github.com/ctb/sartre.

If readstats.py was run:

scp -i ~/Downloads/amazon-streaming-current.pem ubuntu@e<Public DNS>:/mnt/work/readstats.out .

Install Transrate:

cd
curl -O -L https://bintray.com/artifact/download/blahah/generic/transrate-1.0.1-linux-x86_64.tar.gz
tar xzf transrate-1.0.1-linux-x86_64.tar.gz

export PATH=$PATH:$HOME/transrate-1.0.1-linux-x86_64
echo 'export PATH=$PATH:$HOME/transrate-1.0.1-linux-x86_64' >> ~/.bashrc
export PATH=$PATH:$HOME/transrate-1.0.1-linux-x86_64/bin
echo 'export PATH=$PATH:$HOME/transrate-1.0.1-linux-x86_64/bin' >> ~/.bashrc

transrate --install-deps ref

Make working directory

mkdir /mnt/transrate
cd /mnt/transrate

Copy assembly over, rename it, run sed to fix formatting problems, and run transrate

cp /mnt/work/trinity_out_dir/Trinity.fasta .
sed 's_|_-_g' Trinity.fasta > Trinity.fixed.fa
transrate --assembly Trinity.fixed.fa

Then, download your beautiful assemblies.csv stats file!

scp -i ~/Downloads/amazon-streaming-current.pem ubuntu@<Public DNS>:/mnt/transrate/transrate_results/assemblies.csv .

tmux commands cheat sheet:

tmux
ctrl+b c # create new window
ctrl+b n # move to next window
ctrl+b p # move to previous window
tmux attach # attach back to tmux after logging back on

Name		Name	Last commit message	Last commit date
Latest commit History 53 Commits
Rscripts		Rscripts
.DS_Store		.DS_Store
.gitignore		.gitignore
Full-through-diginorm.ipynb		Full-through-diginorm.ipynb
README.md		README.md
Time-Series-Graphs.ipynb		Time-Series-Graphs.ipynb
Utilization-Figures-final.ipynb		Utilization-Figures-final.ipynb
box_plotter.R		box_plotter.R
graph-dump.ipynb		graph-dump.ipynb
parse-times.py		parse-times.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About

Releases

Packages

Languages

jessicalumian/ep-streaming

Folders and files

Latest commit

History

Repository files navigation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages