Difference between revisions of "MSblender"

Latest revision as of 16:39, 20 November 2014

MSblender is a statistical tool for merging database search results from multiple database search engines for peptide identification based on a multivariate modeling approach. We presented this work at RECOMB-CP 2011 on March, 2011, and published in the Journal of Proteomics Research in April, 2011 (see MSblender#Citation for details).

Authors

Hyungwon Choi (hwchoi at umich d0t edu)
Taejoon Kwon (taejoon dot kwon at marcottelab d0t org)

Prerequisites

(We have tested our code under Mac OSX (10.5 Leopard) and Ubuntu Linux (10.04 and later). We don't support MS Windows platforms yet.) To run MSblender, you should install the following programs/packages on the machine.

python (2.5 or later)
gcc (we used version 4.4.3, but we believe that our ANSI-C based codes are not dependent on specific version of gcc).
GNU Scientific Library (version 1.13 or later)
- If you use ubuntu (or debian) linux, install 'gsl-bin' and 'libgsl0-*' packages.
(Optional) matplotlib (python graph library). Only required for plotting scripts written in python.

Also, you need to have search engine results to run MSblender. All searches should be conducted with same concatenated database (target + decoy). The current script recognizes 'xf_' prefix at protein ID as a decoy sequence, but you can easily modify this at 'make-msblender_in.py'.

Installation

Download source code from GitHub. Alternatively, you can download it from http://www.marcottelab.org/users/MSblender/release/ .
Enter to 'src/' directory, and execute './compile' script. You should have GNU Scientific Library before running this script. It will generate 'msblender' and 'msblender.h.gch' files at the same directory.
That's it. Now you are ready to run MSblender.

How to use

MSblender works in three stages: pre-processing, modelling and post-processing.

Pre-processing

First MSblender converts the set of peptide mass spectral search engine results into a unified tab-delimited text file called the 'hit_list' format. Then it transforms this 'hit_list' into an MSblender modelling program input file. You can see a 'test' dataset and its output at http://www.marcottelab.org/users/MSblender/test/.

Currently, MSblender supports the following search engine results (and scores).

SEQUEST, Xcorr (If you get an SRF file from Thermo BioWorks, you can convert it to pepxml directly using Mspire, developed by John T. Prince).
X!Tandem, k-score (a.k.a COMET search engine) based -log(E-value)
OMSSA, -log(E-value)
InsPecT, MQscore
MyriMatch, mvh
MSGFDB, -log(SpecProb)

For example, you can convert an X!Tandem pepxml file to logE_hit_score as shown below:

$ ../src/MSblender-20110130/pre/tandem_pepxml-to-logE_hit_list.py test.tandem_k.pepxml 
Write test.tandem_k.logE_hit_list ...

The hit_list file generated by this program looks like this:

# pepxml: test.tandem_k.pepxml
#Spectrum_id	Charge	PrecursorMz	MassDiff	Peptide	Protein	MissedCleavages	Score(-log10[E-value])
MSups_5ul.07228.07228.4	4	689.596425	0.004000	SLLSNVEGDNAVPMQHNNRPTQPLK	CAH1_HUMAN_UPS|P00915|5000|50000|260	0	1.795880
MSups_5ul.11647.11647.2	2	592.839650	0.000000	ADGLAVIGVLMK	CAH1_HUMAN_UPS|P00915|5000|50000|260	0	1.148742
MSups_5ul.06405.06405.2	2	524.279350	0.003000	DLFNAIATGK	CATA_HUMAN_UPS|P04040|5000|5000|526	0	0.327902
....

Some search engines report multiple PSMs for a single spectrum (mainly because of different charge state estimations). For example, in the default setting, MyriMatch reports all best hits for both +2 and +3 charge states, so the total number of PSMs is almost twice that of other search engine results. To correct for these effects, you can choose 'the best' PSM per each spectrum based on the score you defined. And 'select-best-PSM.py' is the script for that.

$ wc test.myrimatch.mvh_hit_list
  10888   87099 1168772 test.myrimatch.mvh_hit_list
$ ../src/MSblender-20110130/pre/select-best-PSM.py test.myrimatch.mvh_hit_list
$ wc test.myrimatch.mvh_hit_list_best 
  5516  44123 598964 test.myrimatch.mvh_hit_list_best

Then, you can compile multiple 'hit_list' files into msblender input file. You need to have a text conf file as below:

InsPect         test.inspect.MQscore_hit_list_best
MyriMatch       test.myrimatch.mvh_hit_list_best
SEQUEST         test.sequest.xcorr_hit_list_best
X!Tandem        test.tandem_k.logE_hit_list_best

Then, run 'make-msblender_in.py' script. It will make <prefix of conf file>.msblender_in file and <prefix of conf file>.prot_list.

$ ../src/MSblender-20110130/pre/make-msblender_in.py msblender.conf

'msblender.msblender_in' looks like this:

sp_pep_id decoy InsPect_score MyriMatch_score SEQUEST_score X!Tandem_score
MSups_5ul.00439.00439.3.ASLSNTPSIGQ 0 0.031000  NA  NA  NA
MSups_5ul.00439.00439.3.LDELRDEGK 0 NA  18.090108 0.914975  -0.832509
MSups_5ul.00444.00444.1.GQFVK 1 NA  2.598828  NA  NA
MSups_5ul.00446.00446.3.LDELRDEGK 0 NA  13.341218 0.930569  -0.579784
MSups_5ul.00461.00461.3.ADDKETCFAEEGKK  0 NA  16.846330 1.834260  -0.770852
...

And 'msblender.prot_list' looks like this:

MSups_5ul.00439.3.ASLSNTPSIGQ	XXX.CATG_HUMAN_UPS|P08311|5000|5
MSups_5ul.00439.3.LDELRDEGK	ALBU_HUMAN_UPS|P02768|5000|50000|584
MSups_5ul.00444.1.GQFVK	xf_KCRM_HUMAN_UPS|P06732|5000|500
MSups_5ul.00446.3.LDELRDEGK	ALBU_HUMAN_UPS|P02768|5000|50000|584
MSups_5ul.00461.3.ADDKETCFAEEGKK	ALBU_HUMAN_UPS|P02768|5000|50000|584

Multivariate Modeling

Feed 'msbledner_in' file to the 'msblender' executive file under 'src/' directory as below:

$ ~/git/MSblender/src/msblender msblender.msblender_in
....
29	93483.809008	0.5133
TRUE
0.514
3.328	28.868	2.830	0.898	

1.795	20.493	0.985	1.516	
20.493	502.115	24.678	37.461	
0.985	24.678	1.670	2.143	
1.516	37.461	2.143	3.484	

0.486
4.182	51.858	3.870	2.507	

1.630	14.148	0.815	1.226	
14.148	380.431	17.345	26.407	
0.815	17.345	1.302	1.646	
1.226	26.407	1.646	2.589	


30	93484.230685	0.5133
TRUE
0.486
4.182	51.858	3.870	2.507	

1.630	14.148	0.815	1.226	
14.148	380.431	17.345	26.407	
0.815	17.345	1.302	1.646	
1.226	26.407	1.646	2.589	

0.514
3.328	28.868	2.830	0.898	

1.795	20.493	0.985	1.516	
20.493	502.115	24.678	37.461	
0.985	24.678	1.670	2.143	
1.516	37.461	2.143	3.484	
$

The program will terminate when it converges. If the number of iterations reaches your initial setting (here, 100), run the script again allowing for more iterations.

Now you should see an output file named 'test.msblender_in.msblender_out' in the same directory. The file looks like this:

Spectrum  Decoy InsPect_score MyriMatch_score SEQUEST_score X!Tandem_score  mvScore
MSups_5ul.00439.00439.3.ASLSNTPSIGQ F 0.03        0.006
MSups_5ul.00439.00439.3.LDELRDEGK F   18.09 0.91  -0.83 1.000
MSups_5ul.00444.00444.1.GQFVK D   2.60      0.085
MSups_5ul.00446.00446.3.LDELRDEGK F   13.34 0.93  -0.58 1.000
MSups_5ul.00461.00461.3.ADDKETCFAEEGKK  F   16.85 1.83  -0.77 1.000
MSups_5ul.00590.00590.2.AAFTECCQAADK  F 4.80  34.62 3.17  1.39  1.000
...

Post-processing

'make-spcount.py' script at 'post/' directory can estimate numbers of spectral counts (and peptides) at a given FDR cutoff, based on the mvscore model.

$ ~/git/MSblender/post/make-spcount.py msblender.msblender_in.msblender_out msblender.prot_list 0.01
Read msblender.prot_list ... Done
Read msblender.msblender_in.msblender_out ... Done
$

'.spcount_FDR0010', a output with spectral counts per protein, looks like this:

#ProtID	TotalCount	MSups_5ul
ALBU_HUMAN_UPS|P02768|5000|50000|584	1300.00	1300.00
ANT3_HUMAN_UPS|P01008|5000|50|432	4.00	4.00
ANXA5_HUMAN_UPS|P08758|5000|0.5|319	1.00	1.00
CAH1_HUMAN_UPS|P00915|5000|50000|260	556.50	556.50
CAH2_HUMAN_UPS|P00918|5000|50000|259	391.00	391.00

And, '.pep_list_FDR0010' file looks like this:

ALBU_HUMAN_UPS|P02768|5000|50000|584	MSups_5ul	ETYGEMADCCAKQEPERNECFLQHK	4.00
ALBU_HUMAN_UPS|P02768|5000|50000|584	MSups_5ul	CCAAADPHECYAK	2.00
ALBU_HUMAN_UPS|P02768|5000|50000|584	MSups_5ul	YLYEIARR	2.00
ALBU_HUMAN_UPS|P02768|5000|50000|584	MSups_5ul	SLHTLFGDKLCTVATLR	90.00

Citation

T. Kwon*, H. Choi*, C. Vogel, A.I. Nesvizhskii, and E.M. Marcotte, MSblender: a probabilistic approach for integrating peptide identifications from multiple database search engines. J. Proteome Research, 10(7): 2949–2958 (2011) Link

@@ Line 1: / Line 1: @@
-MSblender is a statistical tool for merging database search results from multiple database search engines for peptide identification based on a multivariate modelling approach. '''We will present this work at [http://proteomics.ucsd.edu/recombcp2011/ RECOMB-CP 2011] in March, 2011.'''
+MSblender is a statistical tool for merging database search results from multiple database search engines for peptide identification based on a multivariate modeling approach. We presented this work at [http://proteomics.ucsd.edu/recombcp2011/ RECOMB-CP 2011] on March, 2011, and published in the ''Journal of Proteomics Research'' in April, 2011 (see [[MSblender#Citation]] for details).
 == Authors ==
-* [http://www-personal.umich.edu/~hwchoi/Main.html Hyungwon Choi] (hwchoi at umich dot edu)
+* [http://www-personal.umich.edu/~hwchoi/Main.html Hyungwon Choi] (hwchoi at umich d0t edu)
-* [[User:Taejoon|Taejoon Kwon]] (taejoon dot kwon at mail dot utexas dot edu)
+* [[User:Taejoon|Taejoon Kwon]] (taejoon dot kwon at marcottelab d0t org)
 == Prerequisites ==
-'''(We tested our codes at Mac OSX (10.5 Leopard) and Ubuntu Linux (10.04 and later). We don't support MS Windows platform yet.)''' To run MSblender, you should install the following programs/packages on the machine.
+'''(We have tested our code under Mac OSX (10.5 Leopard) and Ubuntu Linux (10.04 and later). We don't support MS Windows platforms yet.)''' To run MSblender, you should install the following programs/packages on the machine.
 * [http://www.python.org python]  (2.5 or later)
@@ Line 12: / Line 12: @@
 * [http://www.gnu.org/software/gsl/ GNU Scientific Library] (version 1.13 or later)
 ** If you use ubuntu (or debian) linux, install 'gsl-bin' and 'libgsl0-*' packages.
-* (Optional) [http://matplotlib.sourceforge.net/ matplotlib (python graph library)]. Only required for 'pre/plot-his_list.py' script.
+* (Optional) [http://matplotlib.sourceforge.net/ matplotlib (python graph library)]. Only required for plotting scripts written in python.
+Also, you need to have search engine results to run MSblender. All searches should be conducted with same concatenated database (target + decoy). The current script recognizes 'xf_' prefix at protein ID as a decoy sequence, but you can easily modify this at 'make-msblender_in.py'.
 == Installation ==
-* Download source code from [https://github.com/MarcotteLabGit/MSblender GitHub]. Alternatively, you can download it from http://www.marcottelab.org/users/MSblender/src/MSblender-current.tgz .
+* Download source code from [https://github.com/marcottelab/MSblender GitHub]. Alternatively, you can download it from http://www.marcottelab.org/users/MSblender/release/ .
-* Enter to 'c/' directory, and execute './compile' script. You should have GNU Scientific Library before running this script. It will generate 'msblender' and 'msblender.h.gch' files at the same directory.
+* Enter to 'src/' directory, and execute './compile' script. You should have GNU Scientific Library before running this script. It will generate 'msblender' and 'msblender.h.gch' files at the same directory.
 * That's it. Now you are ready to run MSblender.
 == How to use ==
-MSblender is working in three steps: pre-processing, modelling and post-processing.
+MSblender works in three stages: pre-processing, modelling and post-processing.
 === Pre-processing ===
-First MSblender converts various search engine results into a unified tab-delimited text file called 'hit_list' format. Then it transfers 'hit_list' to MSblender modelling program input file.
+First MSblender converts the set of peptide mass spectral search engine results into a unified tab-delimited text file called the 'hit_list' format. Then it transforms this 'hit_list' into an MSblender modelling program input file. You can see a 'test' dataset and its output at http://www.marcottelab.org/users/MSblender/test/.
 Currently, MSblender supports the following search engine results (and scores).
-* [http://dx.doi.org/10.1016/1044-0305(94)80016-2 SEQUEST], Xcorr (If you get SRF file from Thermo BioWorks, you can convert it to pepxml directly using [http://mspire.rubyforge.org/ Mspire], developed by [http://www.chem.byu.edu/users/jtprince John T. Prince]).
+* [http://dx.doi.org/10.1016/1044-0305(94)80016-2 SEQUEST], Xcorr (If you get an SRF file from Thermo BioWorks, you can convert it to pepxml directly using [http://mspire.rubyforge.org/ Mspire], developed by [http://www.chem.byu.edu/users/jtprince John T. Prince]).
 * [http://www.ncbi.nlm.nih.gov/pubmed/15595733 X!Tandem], [http://www.ncbi.nlm.nih.gov/pubmed/16729052 k-score (a.k.a COMET search engine)] based -log(E-value)
 * [http://www.ncbi.nlm.nih.gov/pubmed/15473683 OMSSA], -log(E-value)
@@ Line 33: / Line 35: @@
 * [http://www.ncbi.nlm.nih.gov/pubmed/20829449 MSGFDB], -log(SpecProb)
-For example, you can convert X!Tandem pepxml file to logE_hit_score as below:
+For example, you can convert an X!Tandem pepxml file to logE_hit_score as shown below:
-<pre>$ ../src/MSblender-20110130/pre/sequest_pepxml-to-xcorr_hit_list.py test.sequest.pepxml
+<pre>$ ../src/MSblender-20110130/pre/tandem_pepxml-to-logE_hit_list.py test.tandem_k.pepxml
-Write test.sequest.xcorr_hit_list ...
+Write test.tandem_k.logE_hit_list ... </pre>
-$ </pre>
-The hit_list file looks like as below:
+The hit_list file generated by this program looks like this:
-<pre># pepxml: test.sequest.pepxml
+<pre># pepxml: test.tandem_k.pepxml
-#Spectrum_id	Charge	PrecursorMz	MassDiff	Peptide	Protein	MissedCleavages	Score(Xcorr)
+#Spectrum_id	Charge	PrecursorMz	MassDiff	Peptide	Protein	MissedCleavages	Score(-log10[E-value])
-MSups_5ul.04192.04194.2	2	577.843127	0.006395	MLVVLLQANR	ANXA5_HUMAN_UPS|P08758|5000|0.5|319	0	0.524123
+MSups_5ul.07228.07228.4	4	689.596425	0.004000	SLLSNVEGDNAVPMQHNNRPTQPLK	CAH1_HUMAN_UPS|P00915|5000|50000|260	0	1.795880
-MSups_5ul.07228.07228.4	4	689.596178	0.002584	SLLSNVEGDNAVPMQHNNRPTQPLK	CAH1_HUMAN_UPS|P00915|5000|50000|260	1	2.518871
+MSups_5ul.11647.11647.2	2	592.839650	0.000000	ADGLAVIGVLMK	CAH1_HUMAN_UPS|P00915|5000|50000|260	0	1.148742
-MSups_5ul.11647.11647.2	2	592.839464	-0.000197	ADGLAVIGVLMK	CAH1_HUMAN_UPS|P00915|5000|50000|260	0	2.787324
+MSups_5ul.06405.06405.2	2	524.279350	0.003000	DLFNAIATGK	CATA_HUMAN_UPS|P04040|5000|5000|526	0	0.327902
-MSups_5ul.05651.05651.3	3	549.303576	-0.003018	VWPHKDYPLIPVGK	CATA_HUMAN_UPS|P04040|5000|5000|526	1	2.593570
+....</pre>
+Some search engines report multiple PSMs for a single spectrum (mainly because of different charge state estimations). For example, in the default setting, MyriMatch reports all best hits for both +2 and +3 charge states, so the total number of PSMs is almost twice that of other search engine results. To correct for these effects, you can choose 'the best' PSM per each spectrum based on the score you defined. And 'select-best-PSM.py' is the script for that.
+<pre>$ wc test.myrimatch.mvh_hit_list
+   87099 1168772 test.myrimatch.mvh_hit_list
+$ ../src/MSblender-20110130/pre/select-best-PSM.py test.myrimatch.mvh_hit_list
+$ wc test.myrimatch.mvh_hit_list_best
+  44123 598964 test.myrimatch.mvh_hit_list_best</pre>
+Then, you can compile multiple 'hit_list' files into msblender input file. You need to have a text conf file as below:
+<pre>InsPect         test.inspect.MQscore_hit_list_best
+MyriMatch       test.myrimatch.mvh_hit_list_best
+SEQUEST         test.sequest.xcorr_hit_list_best
+X!Tandem        test.tandem_k.logE_hit_list_best</pre>
+Then, run 'make-msblender_in.py' script. It will make <prefix of conf file>.msblender_in file and <prefix of conf file>.prot_list.
+ $ ../src/MSblender-20110130/pre/make-msblender_in.py msblender.conf
+'msblender.msblender_in' looks like this:
+<pre>sp_pep_id decoy InsPect_score MyriMatch_score SEQUEST_score X!Tandem_score
+MSups_5ul.00439.00439.3.ASLSNTPSIGQ 0 0.031000  NA  NA  NA
+MSups_5ul.00439.00439.3.LDELRDEGK 0 NA  18.090108 0.914975  -0.832509
+MSups_5ul.00444.00444.1.GQFVK 1 NA  2.598828  NA  NA
+MSups_5ul.00446.00446.3.LDELRDEGK 0 NA  13.341218 0.930569  -0.579784
+MSups_5ul.00461.00461.3.ADDKETCFAEEGKK  0 NA  16.846330 1.834260  -0.770852
+...</pre>
+And 'msblender.prot_list' looks like this:
+<pre>MSups_5ul.00439.3.ASLSNTPSIGQ	XXX.CATG_HUMAN_UPS|P08311|5000|5
+MSups_5ul.00439.3.LDELRDEGK	ALBU_HUMAN_UPS|P02768|5000|50000|584
+MSups_5ul.00444.1.GQFVK	xf_KCRM_HUMAN_UPS|P06732|5000|500
+MSups_5ul.00446.3.LDELRDEGK	ALBU_HUMAN_UPS|P02768|5000|50000|584
+MSups_5ul.00461.3.ADDKETCFAEEGKK	ALBU_HUMAN_UPS|P02768|5000|50000|584</pre>
+=== Multivariate Modeling ===
+Feed 'msbledner_in' file to the 'msblender' executive file under 'src/' directory as below:
+<pre>$ ~/git/MSblender/src/msblender msblender.msblender_in
 ....
-</pre>
+	93483.809008	0.5133
+TRUE
+.514
+.328	28.868	2.830	0.898
+.795	20.493	0.985	1.516
+.493	502.115	24.678	37.461
+.985	24.678	1.670	2.143
+.516	37.461	2.143	3.484
+.486
+.182	51.858	3.870	2.507
+.630	14.148	0.815	1.226
+.148	380.431	17.345	26.407
+.815	17.345	1.302	1.646
+.226	26.407	1.646	2.589
+	93484.230685	0.5133
+TRUE
+.486
+.182	51.858	3.870	2.507
+.630	14.148	0.815	1.226
+.148	380.431	17.345	26.407
+.815	17.345	1.302	1.646
+.226	26.407	1.646	2.589
+.514
+.328	28.868	2.830	0.898
+.795	20.493	0.985	1.516
+.493	502.115	24.678	37.461
+.985	24.678	1.670	2.143
+.516	37.461	2.143	3.484
+$</pre>
+The program will terminate when it converges. If the number of iterations reaches your initial setting (here, 100), run the script again allowing for more iterations.
+Now you should see an output file named 'test.msblender_in.msblender_out' in the same directory. The file looks like this:
+<pre>Spectrum  Decoy InsPect_score MyriMatch_score SEQUEST_score X!Tandem_score  mvScore
+MSups_5ul.00439.00439.3.ASLSNTPSIGQ F 0.03        0.006
+MSups_5ul.00439.00439.3.LDELRDEGK F   18.09 0.91  -0.83 1.000
+MSups_5ul.00444.00444.1.GQFVK D   2.60      0.085
+MSups_5ul.00446.00446.3.LDELRDEGK F   13.34 0.93  -0.58 1.000
+MSups_5ul.00461.00461.3.ADDKETCFAEEGKK  F   16.85 1.83  -0.77 1.000
+MSups_5ul.00590.00590.2.AAFTECCQAADK  F 4.80  34.62 3.17  1.39  1.000
+...</pre>
+=== Post-processing ===
+'make-spcount.py' script at 'post/' directory can estimate numbers of spectral counts (and peptides) at a given FDR cutoff, based on the mvscore model.
+<pre>$ ~/git/MSblender/post/make-spcount.py msblender.msblender_in.msblender_out msblender.prot_list 0.01
+Read msblender.prot_list ... Done
+Read msblender.msblender_in.msblender_out ... Done
+$</pre>
+'.spcount_FDR0010', a output with spectral counts per protein, looks like this:
+<pre>#ProtID	TotalCount	MSups_5ul
+ALBU_HUMAN_UPS|P02768|5000|50000|584	1300.00	1300.00
+ANT3_HUMAN_UPS|P01008|5000|50|432	4.00	4.00
+ANXA5_HUMAN_UPS|P08758|5000|0.5|319	1.00	1.00
+CAH1_HUMAN_UPS|P00915|5000|50000|260	556.50	556.50
+CAH2_HUMAN_UPS|P00918|5000|50000|259	391.00	391.00 </pre>
+And, '.pep_list_FDR0010' file looks like this:
+<pre>ALBU_HUMAN_UPS|P02768|5000|50000|584	MSups_5ul	ETYGEMADCCAKQEPERNECFLQHK	4.00
+ALBU_HUMAN_UPS|P02768|5000|50000|584	MSups_5ul	CCAAADPHECYAK	2.00
+ALBU_HUMAN_UPS|P02768|5000|50000|584	MSups_5ul	YLYEIARR	2.00
+ALBU_HUMAN_UPS|P02768|5000|50000|584	MSups_5ul	SLHTLFGDKLCTVATLR	90.00</pre>
 == Citation ==
-* T. Kwon*, H. Choi*, C. Vogel, A.I. Nesvizhskii, and E.M. Marcotte, MSblender: a probabilistic approach for integrating peptide identifications from multiple database search engines. <i>Submitted.</i>
+* T. Kwon*, H. Choi*, C. Vogel, A.I. Nesvizhskii, and E.M. Marcotte, MSblender: a probabilistic approach for integrating peptide identifications from multiple database search engines. <i>J. Proteome Research</i>, 10(7): 2949–2958 (2011) [http://pubs.acs.org/doi/full/10.1021/pr2002116 Link]
 == See also ==
+* [[MStoolbox]] Automated multiple search engine pipeline.
+* [[MSblender_TACC]] How to use MSblender at the Texas Advanced Computing Center
 * https://github.com/MarcotteLabGit/MSblender (GitHub source repository)

Difference between revisions of "MSblender"

Latest revision as of 16:39, 20 November 2014

Contents

Authors

Prerequisites

Installation

How to use

Pre-processing

Multivariate Modeling

Post-processing

Citation

See also

Navigation menu

Views

Personal tools

Navigation

Projects

Classes

Search

Toolbox