Difference between revisions of "TXGP RNAseq assembly"

From Marcotte Lab
Jump to: navigation, search
(Created page with "= Dataset for RNA-seq assembly = {| class="wikitable" style="text-align: center;" !Dataset !Contributor !Samples !Reads !Assembled transcripts(raw) !''X. laevis'' genes !''X. tr...")
 
(Raw data used for TXGP RNA-seq assembly)
 
(16 intermediate revisions by one user not shown)
Line 1: Line 1:
= Dataset for RNA-seq assembly =
+
= Raw data used for TXGP RNA-seq assembly =
  
 
{| class="wikitable" style="text-align: center;"
 
{| class="wikitable" style="text-align: center;"
Line 6: Line 6:
 
!Samples
 
!Samples
 
!Reads
 
!Reads
!Assembled transcripts(raw)
+
!Assembled Tx(raw)
 
!''X. laevis'' genes
 
!''X. laevis'' genes
 
!''X. tropicalis'' genes
 
!''X. tropicalis'' genes
Line 14: Line 14:
 
|Amin201106_XENLA
 
|Amin201106_XENLA
 
|Nirav Amin, Frank Conlon (UNC)
 
|Nirav Amin, Frank Conlon (UNC)
|2 (no rep)
+
|2 <br/>(no rep)
|28~33M (75bp, single)
+
|~30M/library<br/> (75bp, single)<br/> 61M total
|591,321
+
|~ 591k
 
|13,523
 
|13,523
 
|10,225
 
|10,225
Line 25: Line 25:
 
|Park201106_XENLA
 
|Park201106_XENLA
 
|Tae Joo Park, Richard Harland (UC Berkeley)
 
|Tae Joo Park, Richard Harland (UC Berkeley)
|5 (no rep)
+
|5 <br/>(no rep)
|..
+
|~100M/library<br/> (50bp, single)<br/> 500M total
|..
+
|~ 1,480k
|..
+
|14,890
|..
+
|12,648
|..
+
|13,328
 
|-
 
|-
  
Line 36: Line 36:
 
|TXGP201107_XENLA
 
|TXGP201107_XENLA
 
|TXGP
 
|TXGP
|2 (1 x 2 rep)
+
|2 <br/>(1x2 rep)
|..
+
|~100M/library<br/> (100bp, paired)<br/> 400M total
|..
+
|~ 1,677k
|..
+
|14,441
|..
+
|12,482
|..
+
|12,986
 
|-
 
|-
  
Line 47: Line 47:
 
|Chung201110_XENLA
 
|Chung201110_XENLA
 
|Meii Chung, John Wallingford (UT Austin)
 
|Meii Chung, John Wallingford (UT Austin)
|4 (2 x 2 rep)
+
|4 <br/>(2x2 rep)
|..
+
|16~38M/library<br/> (50bp, paired)<br/> 222M total
|..
+
|~ 600k
|..
+
|11,198
|..
+
|7,871
|..
+
|9,134
 
|-
 
|-
  
Line 58: Line 58:
 
|Quigley201112_XENLA
 
|Quigley201112_XENLA
 
|Ian Quigley, Christopher R. Kintner (Salk Institute)
 
|Ian Quigley, Christopher R. Kintner (Salk Institute)
|9 (unknown rep)
+
|9<br/>(unknown rep)
|..
+
|23~50M/library<br/>(50bp,single)<br/>311M total
|..
+
|~ 647k
|..
+
|13,291
|..
+
|10,790
|..
+
|11,383
 
|-
 
|-
  
Line 69: Line 69:
 
|Jarikji201201_XENLA
 
|Jarikji201201_XENLA
 
|Zeina Jarikji, Marko Horb (MBL)
 
|Zeina Jarikji, Marko Horb (MBL)
|15 (5 x 3 rep)
+
|9<br/>(3x3 rep)
|..
+
|39~72M/library<br/>(100bp,paired)<br/>932M total
|..
+
|~ 3,254k<br/>
|..
+
|14,613
|..
+
|12,342
|..
+
|13,218
 
|-
 
|-
  
 
|-
 
|-
 
|TeperekTkacz201202_XENLA
 
|TeperekTkacz201202_XENLA
|Marta Teperek-Tkacz, John Gurdon (Gurdon Institute/Cambridge)
+
|Marta Teperek-Tkacz, John Gurdon (Gurdon Institute)
|1 (no rep)
+
|1<br/>(no rep)
|..
+
|94M/library<br/>(50bp,paired)<br/>200M total
|..
+
|~436k
|..
+
|13,838
|..
+
|10,559
|..
+
|11,409
 
|-
 
|-
  
 
|}
 
|}
  
 +
* assembled Tx == number of peptide query sequences for BLAST search.
 +
 +
= Pre-processing =
 +
* Filter out reads with no-call.
 +
* Trim 5' or 3' end if necessary.
 +
* For paired-end library, compile paired reads (without filter-out reads at both side of pair).
 +
 +
= Tx Assembly =
 +
* We currently use velvet+oases pipeline, with different k-mer (25,29,33,37,41,45).
 +
* After first-round assembly, do the second round assembly with contigs of each k-mer, with k-mer 33.
 +
 +
= Post-processing with orthology =
 +
* Translate k33_merged assembled transcripts into peptides with standard codon table. Take longest peptide sequences from 6-frame translation.
 +
* Do BLAST to model oragnism protein sequences
 +
** EnsEMBL-63: HUMAN, MOUSE, DANRE(zebrafish), XENTR(X. tropicals)
 +
** XenBase: XENLA (2011-dec version)
 +
* Filter out BLAST hits with following conditons.
 +
** E-value < 0.01
 +
** Alignment percentage (aligned length/min(query_seq,target_seq)) > 0.50
 +
** len(query_seq)/len(target_seq) > 0.50 (to get rid of short peptides from assembled Tx)
 +
* Make a group of sequences per each model organism sequence (putative ortho-group).
 +
* Do multiple sequence alignment of ortho-groups with MUSCLE.
 +
* Based on second-iteration tree (generated by MUSCLE), select representative sequences per clusters (up to 6 sequences per group).
 +
 +
(under development for further steps)
 
----  
 
----  
 
[[Category:XenopusGenome]]
 
[[Category:XenopusGenome]]

Latest revision as of 01:17, 13 April 2012

Contents

Raw data used for TXGP RNA-seq assembly

Dataset Contributor Samples Reads Assembled Tx(raw) X. laevis genes X. tropicalis genes H. sapiens genes
Amin201106_XENLA Nirav Amin, Frank Conlon (UNC) 2
(no rep)
~30M/library
(75bp, single)
61M total
~ 591k 13,523 10,225 11,540
Park201106_XENLA Tae Joo Park, Richard Harland (UC Berkeley) 5
(no rep)
~100M/library
(50bp, single)
500M total
~ 1,480k 14,890 12,648 13,328
TXGP201107_XENLA TXGP 2
(1x2 rep)
~100M/library
(100bp, paired)
400M total
~ 1,677k 14,441 12,482 12,986
Chung201110_XENLA Meii Chung, John Wallingford (UT Austin) 4
(2x2 rep)
16~38M/library
(50bp, paired)
222M total
~ 600k 11,198 7,871 9,134
Quigley201112_XENLA Ian Quigley, Christopher R. Kintner (Salk Institute) 9
(unknown rep)
23~50M/library
(50bp,single)
311M total
~ 647k 13,291 10,790 11,383
Jarikji201201_XENLA Zeina Jarikji, Marko Horb (MBL) 9
(3x3 rep)
39~72M/library
(100bp,paired)
932M total
~ 3,254k
14,613 12,342 13,218
TeperekTkacz201202_XENLA Marta Teperek-Tkacz, John Gurdon (Gurdon Institute) 1
(no rep)
94M/library
(50bp,paired)
200M total
~436k 13,838 10,559 11,409
  • assembled Tx == number of peptide query sequences for BLAST search.

Pre-processing

  • Filter out reads with no-call.
  • Trim 5' or 3' end if necessary.
  • For paired-end library, compile paired reads (without filter-out reads at both side of pair).

Tx Assembly

  • We currently use velvet+oases pipeline, with different k-mer (25,29,33,37,41,45).
  • After first-round assembly, do the second round assembly with contigs of each k-mer, with k-mer 33.

Post-processing with orthology

  • Translate k33_merged assembled transcripts into peptides with standard codon table. Take longest peptide sequences from 6-frame translation.
  • Do BLAST to model oragnism protein sequences
    • EnsEMBL-63: HUMAN, MOUSE, DANRE(zebrafish), XENTR(X. tropicals)
    • XenBase: XENLA (2011-dec version)
  • Filter out BLAST hits with following conditons.
    • E-value < 0.01
    • Alignment percentage (aligned length/min(query_seq,target_seq)) > 0.50
    • len(query_seq)/len(target_seq) > 0.50 (to get rid of short peptides from assembled Tx)
  • Make a group of sequences per each model organism sequence (putative ortho-group).
  • Do multiple sequence alignment of ortho-groups with MUSCLE.
  • Based on second-iteration tree (generated by MUSCLE), select representative sequences per clusters (up to 6 sequences per group).

(under development for further steps)