Difference between revisions of "MSblender TACC"

From Marcotte Lab
Jump to: navigation, search
(Before you start)
(Run MSGF+)
 
(17 intermediate revisions by 2 users not shown)
Line 1: Line 1:
 
== Before you start ==
 
== Before you start ==
 
* To use this setting, your TACC account needs to be allocated to our lab project('A-cm10'). If you don't have an account, create it at https://portal.tacc.utexas.edu/. Then, ask Edward to assign your account as a member of lab project.
 
* To use this setting, your TACC account needs to be allocated to our lab project('A-cm10'). If you don't have an account, create it at https://portal.tacc.utexas.edu/. Then, ask Edward to assign your account as a member of lab project.
* Generally, 'longhorn' has shorter queue than 'lonestar'. So, use 'longhorn'.  
+
* This document is for [https://portal.tacc.utexas.edu/user-guides/stampede 'stampede'].  
* Always work at $SCRATCH directory, not at /corral or your $HOME.  
+
* Currently in most cases I use three search engines: comet, X!Tandem, and MS-GF+.
* All source codes are available at /corral/utexas/A-cm10/src.MS/. Personally I use symbolic link for this directory under $HOME so I can use '~' shortcut.  
+
* You don't need to run 'MSblender' modeling on TACC, because it does not take that long. I normally run all searches at TACC, then transfer the output to my local machine to run MSblender. So it only covers 'search' part. For running MSblender, please see [[MSblender]] page.
<pre>$ ln -s /corral/utexas/A-cm10/src.MS/ ~/src.MS
+
$ ls ~/src.MS </pre>
+
* All packages are installed at /corral/utexas/A-cm10/src.MS/local/.
+
* To run InsPecT, you need to set LD_LIBRARY_PATH for expat library. Type below command before running InsPecT (or put it on '$HOME/.profile_user'  
+
<pre> $ export LD_LIBRARY_PATH="/corral/utexas/A-cm10/src.MS/local/lib/" </pre>
+
* Default python interpreter is 2.4 at TACC. You need to load 2.7.1 as below.
+
<pre> $ module load python</pre>
+
  
== Currently installed packages ==
+
== Install MSblender (and comet, MSGF+, X!Tandem) ==  
These packages are installed at lonestar.
+
<pre>$ cd ~
* TPP-4.5.1 + X!Tandem (2010.10.01.1)
+
* /corral/utexas/A-cm10/src.MS/local/tppbin/xinteract (integrated wrapper for *Prophet)
+
* /corral/utexas/A-cm10/src.MS/local/tppbin/tandem (X!Tandem with k-score support)
+
* Crux 1.37
+
** /corral/utexas/A-cm10/src.MS/local/bin/crux
+
* Tide 1.0
+
** /corral/utexas/A-cm10/src.MS/local/bin/tide-index
+
** /corral/utexas/A-cm10/src.MS/local/bin/tide-msconvert
+
** /corral/utexas/A-cm10/src.MS/local/bin/tide-search
+
* MSGFDB (20120106)
+
** /corral/utexas/A-cm10/src.MS/MSGFDB/current/MSGFDB.jar
+
* InsPecT (20120109)
+
** /corral/utexas/A-cm10/src.MS/local/bin/inspect
+
 
+
== Install MS-toolbox & MSblender ==
+
* I recommend to install MS-toolbox at your home individually, because everyone may want to use different search parameters.
+
<pre>$ module load git
+
$ cd ~
+
 
$ mkdir git
 
$ mkdir git
 
$ cd git
 
$ cd git
$ git clone git@github.com:marcottelab/MS-toolbox.git
+
$ git clone https://github.com/marcottelab/MSblender.git</pre>
$ git clone git@github.com:marcottelab/MSblender.git</pre>
+
* You don't need to compile MSblender codes under 'src' directory. Executable file is already available at /corral/utexas/A-cm10/src.MS/local/bin/msblender.
+
  
== Let's start ==
+
== Prepare a working space ==
 
<pre>$ module load python
 
<pre>$ module load python
 
$ cd $SCRATCH
 
$ cd $SCRATCH
$ mkdir my-project
+
$ mkdir myProject
$ cd my-project
+
$ cd myProject
$ python ~/git/MS-toolbox/bin/mstb-setup.py</pre>
+
$ mkdir mzXML
 +
$ mkdir DB
 +
$ mkdir comet
 +
$ mkdir MSGF+
 +
$ mkdir tandemK</pre>
  
It will make five directories (DB, mzXML, RAW, scripts, tmp), and one text file called 'mstb.conf'. Transfer your mzXML files to 'mzXML' directory. I also keep RAW files on the same directory. But it would be good to transfer them to corral & ranch(tape storage) to archive.  
+
== Prepare database ==
 +
* You can run this process on any computer. If it takes longer than a minute, it would be better to process it on other than TACC login node (your account may be locked).
  
== Setup mstb.conf ==
+
<pre>$ python $HOME/git/MSblender/pre/fasta-reverse.py my_seq.fa
This is master configuration file for all MS-toolbox run. You may need to change DB_* part for your DB files. You can copy the remaining part as below.
+
$ cat my_seq.fa.* > my_seq.combined.fa</pre>
<pre>DB_NAME        OMRF20110730_XENLA_EGG1_v4.mpep_trypsin_combined
+
DB_FASTA        /scratch/00992/linusben/xenopus.prot/TXGP_XENLA_Prot_Kwon201109/DB/OMRF20110730_XENLA_EGG1_v4.mpep_trypsin_combined.fasta
+
=== DB setup for X!tandem ===
DB_FASTAPRO    /scratch/00992/linusben/xenopus.prot/TXGP_XENLA_Prot_Kwon201109/DB/OMRF20110730_XENLA_EGG1_v4.mpep_trypsin_combined.fasta.pro
+
<pre> $ $HOME/git/MSblender/extern/fasta_pro.exe my_seq.combined.fa</pre>
DB_TRIE        /scratch/00992/linusben/xenopus.prot/TXGP_XENLA_Prot_Kwon201109/DB/OMRF20110730_XENLA_EGG1_v4.mpep_trypsin_combined.trie
+
DB_CRUX_INDEX  /scratch/00992/linusben/xenopus.prot/TXGP_XENLA_Prot_Kwon201109/DB/OMRF20110730_XENLA_EGG1_v4.mpep_trypsin_combined.crux
+
DB_BLASTDB      /scratch/00992/linusben/xenopus.prot/TXGP_XENLA_Prot_Kwon201109/DB/your_db.fa
+
DB_DECOY_PREFIX rv_
+
  
PATH_TPP        /corral/utexas/A-cm10/src.MS/local/tpp
+
You may see the message like below:
PATH_XINTERACT  /corral/utexas/A-cm10/src.MS/local/tppbin/xinteract
+
PATH_MSCONVERT  /corral/utexas/A-cm10/src.MS/local/tppbin/msconvert
+
  
PATH_TANDEMK_EXE                /corral/utexas/A-cm10/src.MS/local/tppbin/tandem.exe
+
<pre>$ ~/git/MSblender/extern/fasta_pro.exe my_seq.combined.fa
PATH_TANDEM2XML                /corral/utexas/A-cm10/src.MS/local/tppbin/tandem.exe
+
fasta_pro file conversion utility, v. 2006.09.15
PATH_TANDEMK_DEFAULT_PARAM      /corral/utexas/A-cm10/src.MS/local/tppbin/isb_default_input_kscore.xml
+
input path = my_seq.combined.fa
 +
output path = my_seq.combined.fa.pro
 +
db type = plain</pre>
  
PATH_OMSSACL    /usr/local/bin/omssacl
+
=== DB setup for comet ===
 +
You don't need to do anything for this.
  
DIR_INSPECT    /corral/utexas/A-cm10/src.MS/inspect/current
+
=== DB setup for MSGF+ ===
PATH_INSPECT    /corral/utexas/A-cm10/src.MS/local/bin/inspect
+
It uses significant amount of computing resources (i.e. memory), so it may not be suitable to run on login node.  
PATH_MSGFDB_JAR /corral/utexas/A-cm10/src.MS/MSGFDB/current/MSGFDB.jar
+
  
PATH_CRUX      /corral/utexas/A-cm10/src.MS/local/bin/crux</pre>
+
<pre>$ module load jdk64
 +
$ java -Xmx4000M -cp /home1/00992/linusben/git/MSblender/extern/MSGFPlus.jar edu.ucsd.msjava.msdbsearch.BuildSA -d XenopusHybrid_xlJGIv16_xtJGIv83.combined.fa -tda 0</pre>
  
== Setup your database ==
+
== Prepare mzXML files ==
Transfer your FASTA file to 'DB' directory. You need 'combined' database, with target and decoy. It is recommended to use 'reverse' decoy sequences. If you use 'fasta-reverse.py' script on MS-toolbox, it generates reverse sequence with 'rv_'prefix.
+
  
<pre>$ python ~/git/MS-toolbox/bin/fasta-reverse.py XENLA_prot_v4.fasta
+
Copy your mzXML files on this diretory ($SCRATCH/myProject/mzXML).
$ mv XENLA_prot_v4.fasta.target XENLA_prot_v4_combined.fasta
+
$ cat XENLA_prot_v4.fasta.reverse >> XENLA_prot_v4_combined.fasta
+
$ head -n 1 XENLA_prot_v4.fasta
+
>10a1.1|XB-GENE-6077477|AAH55957|33416620
+
$ head -n 1 XENLA_prot_v4.fasta.reverse
+
>rv_nadkd1|XB-GENE-991229|AAI46629|148921623 </pre>
+
  
=== DB setup for X!tandem ===
+
== Run comet ==
<pre> $~/src.MS/local/bin/fasta_pro.exe (my combined fasta file) </pre>
+
<pre>$ cd $SCRATCH/myProject/comet
It makes an index file with '.pro' suffix after your FASTA filename.
+
$ ~/git/MSblender/extern/comet.linux.exe -p</pre>
<pre> $~/src.MS/local/bin/fasta_pro.exe XENLA_prot_v4_combined.fasta
+
fasta_pro file conversion utility, v. 2006.09.15
+
input path = XENLA_prot_v4_combined.fasta
+
output path = XENLA_prot_v4_combined.fasta.pro
+
db type = plain</pre>
+
  
=== DB setup for Crux ===
+
Edit 'comet.params.new' file. Typically, you need to change the following lines.
<pre> $~/src.MS/local/bin/crux create-index --enzyme trypsin --missed-cleavages 2 --peptide-list T --decoys none (my combined fasta file) (my index name)</pre>
+
<pre>num_threads = 16
* If you want to use Crux function separately (or other embeded post-processing tool, i.e. percolator or q-ranker), you should use FASTA file with target sequence only, with certain decoy option (default option is protein-shuffle, but peptide-shuffle would be better.)
+
* 'peptide-list' is optional.
+
* Trypsin digestion pattern in Crux is '[KR]|{P}', so it does not cut K/R if the next AA is P. If you want to ignore this 'Proline' constraint, you can use '--custom-enzyme "[KR]|[X]"' instead of '--enzyme trypsin'.  
+
  
=== DB setup for InsPecT ===
+
peptide_mass_tolerance = 20.0
<pre> $~/src.MS/inspect/current/PrepDB.py FASTA (my fasta file)</pre>
+
peptide_mass_units = 2
* It makes an index file with '.trie' suffix after your FASTA filename.
+
  
=== DB setup for MSGFDB ===
+
search_enzyme_number = 2   ## See the end of param file for the type of enzymes
<pre>$ java -cp ~/src.MS/MSGFDB/current/MSGFDB.jar msdbsearch.BuildSA -d (my FASTA file) -tda 0</pre>
+
* It generates .canno, .cnlcp, .csarr & .cseq files.
+
* If you want to use native MS-GFDB function, use -tda 2 (generate target & combined database) with target-only FASTA file.
+
  
== Prepare search ==
+
output_txtfile = 1
<pre>$ python ~/git/MS-toolbox/bin/prepare-tandemK.py
+
output_pepxmlfile = 0</pre>
Create /scratch/00992/linusben/xenopus.prot/TXGP_XENLA_Prot_Kwon201109/tandemK.
+
Write /scratch/00992/linusben/xenopus.prot/TXGP_XENLA_Prot_Kwon201109/tandemK/tandem-taxonomy.xml.
+
Write /scratch/00992/linusben/xenopus.prot/TXGP_XENLA_Prot_Kwon201109/tandemK/20110713_XENLA_Egg1_1.tandemK.xml
+
...
+
  
TandemK is ready. Run /scratch/00992/linusben/xenopus.prot/TXGP_XENLA_Prot_Kwon201109/scripts/run-tandemK.sh.</pre>
+
Then, create the launcher script (called 'stampede-comet.sh') as below.
  
<pre>$ python ~/git/MS-toolbox/bin/prepare-inspect.py
+
<pre>#!/bin/bash
Create /scratch/00992/linusben/xenopus.prot/TXGP_XENLA_Prot_Kwon201109/inspect.
+
#SBATCH -n 16
Write /scratch/00992/linusben/xenopus.prot/TXGP_XENLA_Prot_Kwon201109/inspect/20110713_XENLA_Egg1_1.inspect_in.
+
#SBATCH -p normal
Write /scratch/00992/linusben/xenopus.prot/TXGP_XENLA_Prot_Kwon201109/inspect/20110713_XENLA_Egg1_2.inspect_in.
+
#SBATCH -t 24:00:00
...
+
  
InsPecT is ready. Run /scratch/00992/linusben/xenopus.prot/TXGP_XENLA_Prot_Kwon201109/scripts/run-inspect.sh.</pre>
+
#SBATCH -o cmt.o%j
 +
COMET="$HOME/git/MSblender/extern/comet.linux.exe"
  
<pre>$ python ~/git/MS-toolbox/bin/prepare-MSGFDB.py
+
DB="../DB/my_seq.combined.fa"
Create /scratch/00992/linusben/xenopus.prot/TXGP_XENLA_Prot_Kwon201109/MSGFDB.
+
DBNAME=$(basename $DB)
20110713_XENLA_Egg1_1.mzXML
+
DBNAME=${DBNAME/.fa/}
20110713_XENLA_Egg1_2.mzXML
+
....
+
  
MSGFDB is ready. Run /scratch/00992/linusben/xenopus.prot/TXGP_XENLA_Prot_Kwon201109/scripts/run-MSGFDB.sh.</pre>
+
PARAM="./comet.params.new"
  
== Run search ==
+
#SBATCH -J "cmt"
In a standalone workstation, you can run ./script/run-(search_engine).sh directly to start. But you shouldn't do this in TACC login terminal. Put the following parameters on each run-*.sh script, then submit a job by qsub.
+
for MZXML in $(ls ../mzXML/*mzXML)
 +
do
 +
  OUT=$(basename $MZXML)
 +
  OUT=${OUT/.mzXML/}"."$DBNAME".comet"
 +
  time $COMET -P$PARAM -D$DB -N$OUT $MZXML
 +
done
 +
</pre>
  
* If you use lonestar, replace '4way 8' to '8way to 24'. See [http://www.tacc.utexas.edu/user-services/user-guides/lonestar-user-guide Lonestar user guide] and [http://www.tacc.utexas.edu/user-services/user-guides/longhorn-user-guide Longhorn user guide] for detail.
+
Then, submit the job by typing 'sbatch stampede-comet.sh'
* Don't forget to put your email address at -M.
+
* Put short job name to check the status easily.
+
  
 +
== Run MSGF+ ==
 +
 +
Create 'stampede-MSGF+.sh' file as below.
 
<pre>#!/bin/bash
 
<pre>#!/bin/bash
#$ -V                  # Inherit the submission environment
+
#SBATCH -n 16
#$ -cwd                # Start job in submission directory
+
#SBATCH -p normal
#$ -j y                # Combine stderr and stdout
+
#SBATCH -t 24:00:00
#$ -o $JOB_NAME.o$JOB_ID
+
 
#$ -pe 4way 8
+
#SBATCH -o mg+.o%j
#$ -q long
+
#$ -l h_rt=24:00:00     # Run time (hh:mm:ss)
+
#$ -M (your email)
+
#$ -m be                # Email at Begin and End of job
+
#$ -P hpc
+
 
set -x
 
set -x
  
#$ -N (job name)
+
module load jdk64
(put the remaining part of run-* script after #!/bin/bash line) </pre>
+
 
 +
MSGFplus_JAR="$HOME/git/MSblender/extern/MSGFPlus.jar"
 +
 
 +
DB="../DB/my_seq.combined.fa"
 +
 
 +
DBNAME=$(basename $DB)
 +
DBNAME=${DBNAME/.fa/}
 +
 
 +
#SBATCH -J "mg+"
 +
for MZXML in $(ls ../mzXML/*mzXML)
 +
do
 +
  OUT=$(basename $MZXML)
 +
  OUT=${OUT/.mzXML/}"."$DBNAME".MSGF+.mzid"
 +
  TBL=${OUT/.mzid/.tsv}
 +
  time java -Xmx20000M -jar $MSGFplus_JAR -d $DB -s $MZXML -o $OUT -t 20ppm -tda 0 -ntt 2 -e 1 -inst 3
 +
  time java -Xmx20000M -cp $MSGFplus_JAR edu.ucsd.msjava.ui.MzIDToTsv -i $OUT -o $TBL -showQValue 1 -showDecoy 1 -unroll 0
 +
done</pre>
 +
 
 +
Then, submit the job by typing 'sbatch stampede-MSGF+.sh'
 +
 
 +
== Run X!Tandem ==
 +
 
 +
<pre>$ cd $SCRATCH/myProject/tandemK
 +
$ ~/git/MSblender/search/prepare-tandemK-high.py ../mzXML/ ../DB/my_seq.combined.fa.pro</pre>
 +
 
 +
First argument of prepare-tendemK-high.py is a directory for mzXML, and second one is for .pro database generated by fasta_pro.exe as above.
 +
 
 +
You will see *.xml files matched to your mzXML files (X!Tandem input), 'tandem-taxonomy.xml' file (another X!Tandem input), and run-tandemK.sh (a script to run X!tandem).
 +
 
 +
Make the following launcher ('stampede-tandemK.sh'), and submit it as 'sbatch stampede-tandemK.sh'.
 +
 
 +
<pre>#!/bin/bash
 +
#SBATCH -n 16
 +
#SBATCH -p normal
 +
#SBATCH -t 24:00:00
 +
 
 +
#SBATCH -o tK.o%j
 +
#SBATCH -J "tK"
 +
set -x
 +
 
 +
bash ./run-tandemK.sh</pre>
 +
 
 +
If you have many mzXML files, you can run it parallel by splitting run-tandemK.sh with 'split -l' command, and run individual splitted script at 'stampede-tandemK.sh'.

Latest revision as of 15:39, 29 June 2015

Contents

Before you start

  • To use this setting, your TACC account needs to be allocated to our lab project('A-cm10'). If you don't have an account, create it at https://portal.tacc.utexas.edu/. Then, ask Edward to assign your account as a member of lab project.
  • This document is for 'stampede'.
  • Currently in most cases I use three search engines: comet, X!Tandem, and MS-GF+.
  • You don't need to run 'MSblender' modeling on TACC, because it does not take that long. I normally run all searches at TACC, then transfer the output to my local machine to run MSblender. So it only covers 'search' part. For running MSblender, please see MSblender page.

Install MSblender (and comet, MSGF+, X!Tandem)

$ cd ~
$ mkdir git
$ cd git
$ git clone https://github.com/marcottelab/MSblender.git

Prepare a working space

$ module load python
$ cd $SCRATCH
$ mkdir myProject
$ cd myProject
$ mkdir mzXML
$ mkdir DB
$ mkdir comet
$ mkdir MSGF+
$ mkdir tandemK

Prepare database

  • You can run this process on any computer. If it takes longer than a minute, it would be better to process it on other than TACC login node (your account may be locked).
$ python $HOME/git/MSblender/pre/fasta-reverse.py my_seq.fa
$ cat my_seq.fa.* > my_seq.combined.fa

DB setup for X!tandem

 $ $HOME/git/MSblender/extern/fasta_pro.exe my_seq.combined.fa

You may see the message like below:

$ ~/git/MSblender/extern/fasta_pro.exe my_seq.combined.fa 
fasta_pro file conversion utility, v. 2006.09.15
 input path = my_seq.combined.fa
output path = my_seq.combined.fa.pro
db type = plain

DB setup for comet

You don't need to do anything for this.

DB setup for MSGF+

It uses significant amount of computing resources (i.e. memory), so it may not be suitable to run on login node.

$ module load jdk64
$ java -Xmx4000M -cp /home1/00992/linusben/git/MSblender/extern/MSGFPlus.jar edu.ucsd.msjava.msdbsearch.BuildSA -d XenopusHybrid_xlJGIv16_xtJGIv83.combined.fa -tda 0

Prepare mzXML files

Copy your mzXML files on this diretory ($SCRATCH/myProject/mzXML).

Run comet

$ cd $SCRATCH/myProject/comet
$ ~/git/MSblender/extern/comet.linux.exe -p

Edit 'comet.params.new' file. Typically, you need to change the following lines.

num_threads = 16

peptide_mass_tolerance = 20.0
peptide_mass_units = 2

search_enzyme_number = 2   ## See the end of param file for the type of enzymes

output_txtfile = 1
output_pepxmlfile = 0

Then, create the launcher script (called 'stampede-comet.sh') as below.

#!/bin/bash
#SBATCH -n 16
#SBATCH -p normal
#SBATCH -t 24:00:00

#SBATCH -o cmt.o%j
COMET="$HOME/git/MSblender/extern/comet.linux.exe"

DB="../DB/my_seq.combined.fa"
DBNAME=$(basename $DB)
DBNAME=${DBNAME/.fa/}

PARAM="./comet.params.new"

#SBATCH -J "cmt"
for MZXML in $(ls ../mzXML/*mzXML)
do
  OUT=$(basename $MZXML)
  OUT=${OUT/.mzXML/}"."$DBNAME".comet"
  time $COMET -P$PARAM -D$DB -N$OUT $MZXML
done

Then, submit the job by typing 'sbatch stampede-comet.sh'

Run MSGF+

Create 'stampede-MSGF+.sh' file as below.

#!/bin/bash
#SBATCH -n 16
#SBATCH -p normal
#SBATCH -t 24:00:00

#SBATCH -o mg+.o%j
set -x

module load jdk64

MSGFplus_JAR="$HOME/git/MSblender/extern/MSGFPlus.jar"

DB="../DB/my_seq.combined.fa"

DBNAME=$(basename $DB)
DBNAME=${DBNAME/.fa/}

#SBATCH -J "mg+"
for MZXML in $(ls ../mzXML/*mzXML)
do
  OUT=$(basename $MZXML)
  OUT=${OUT/.mzXML/}"."$DBNAME".MSGF+.mzid"
  TBL=${OUT/.mzid/.tsv}
  time java -Xmx20000M -jar $MSGFplus_JAR -d $DB -s $MZXML -o $OUT -t 20ppm -tda 0 -ntt 2 -e 1 -inst 3
  time java -Xmx20000M -cp $MSGFplus_JAR edu.ucsd.msjava.ui.MzIDToTsv -i $OUT -o $TBL -showQValue 1 -showDecoy 1 -unroll 0
done

Then, submit the job by typing 'sbatch stampede-MSGF+.sh'

Run X!Tandem

$ cd $SCRATCH/myProject/tandemK
$ ~/git/MSblender/search/prepare-tandemK-high.py ../mzXML/ ../DB/my_seq.combined.fa.pro

First argument of prepare-tendemK-high.py is a directory for mzXML, and second one is for .pro database generated by fasta_pro.exe as above.

You will see *.xml files matched to your mzXML files (X!Tandem input), 'tandem-taxonomy.xml' file (another X!Tandem input), and run-tandemK.sh (a script to run X!tandem).

Make the following launcher ('stampede-tandemK.sh'), and submit it as 'sbatch stampede-tandemK.sh'.

#!/bin/bash
#SBATCH -n 16
#SBATCH -p normal
#SBATCH -t 24:00:00

#SBATCH -o tK.o%j
#SBATCH -J "tK"
set -x

bash ./run-tandemK.sh

If you have many mzXML files, you can run it parallel by splitting run-tandemK.sh with 'split -l' command, and run individual splitted script at 'stampede-tandemK.sh'.