Difference between revisions of "XENLA GeneModel2012"

From Marcotte Lab
Jump to: navigation, search
(Orthologous genes)
 
(16 intermediate revisions by 3 users not shown)
Line 1: Line 1:
= Data summary =
+
'''Check [[Xenopus_Genome_Project#Assembled_transcripts]] for the latest (refined) gene model.''' [[User:TaejoonKwon|TaejoonKwon]] ([[User talk:TaejoonKwon|talk]]) 11:19, 31 January 2014 (CST)
  
 
{|style="border-style: solid; border-width: 1px"
 
{|style="border-style: solid; border-width: 1px"
Line 6: Line 6:
 
* Data users may freely download and analyze sequences posted here.  
 
* Data users may freely download and analyze sequences posted here.  
 
* Data users may use data to analyze their own data, i.e. reference database for MS/MS proteomics data, and/or RNA-seq data.
 
* Data users may use data to analyze their own data, i.e. reference database for MS/MS proteomics data, and/or RNA-seq data.
* The publication and presentation of global analysis of data with these sequences are not allowed until 'data owner' ([http://www.xenbase.org/community/person.do?method=display&personId=756 Dr. Masanori Taira]) published the paper. As soon as the paper is accepted, we will post that info on this website.
+
* The publication and presentation of global analysis of data with these sequences are not allowed until 'data owner'. As soon as the paper is accepted, we will post that info on this website. If it is not clear to whom you should contact, please contact to [mailto:taejoon.kwon@marcottelab.org Dr. Taejoon Kwon].
* If you have more question about this data, please contact [http://www.xenbase.org/community/person.do?method=display&personId=756 Dr. Masanori Taira], [http://www.cm.utexas.edu/edward_marcotte Dr. Edward Marcotte], or [mailto:taejoon.kwon_at_marcottelab_dot_org Dr. Taejoon Kwon].
+
 
</font>
 
</font>
 
|}
 
|}
 +
 +
= WT data =
 +
* http://daudin.icmb.utexas.edu/pub/annot/UTA.2012/
 +
 +
= Taira201203 data =
 +
* Contributed by Masanori Taira (Graduate School of Science, University of Tokyo), Shuji Takahashi (Komaba Organization for Educational Excellence, College of Arts and Sciences, University of Tokyo), Toshiaki Tanaka (Tokyo Institute of Technology), Atsushi Toyoda and Asao Fujiyama (National Institute of Genetics), Yutaka Suzuki (Graduate School of Frontier Sciences, University of Tokyo)
 +
* If you have more question about this data, please contact [http://www.xenbase.org/community/person.do?method=display&personId=756 Dr. Masanori Taira], [http://www.cm.utexas.edu/edward_marcotte Dr. Edward Marcotte], or [mailto:taejoon.kwon@marcottelab.org Dr. Taejoon Kwon].
  
 
== Taira201203_XENLA_tissue ==
 
== Taira201203_XENLA_tissue ==
Collect total RNA from14 Tissue of ''Xenopus laevis'' J strain.
+
Collect total RNA from 14 Tissue of ''Xenopus laevis'' J strain.
 
* Brain, eye, heart, intestine, kidney, liver, lung, muscle, ovary, pancreas, skin, spleen, stomach, testis
 
* Brain, eye, heart, intestine, kidney, liver, lung, muscle, ovary, pancreas, skin, spleen, stomach, testis
 
* Sons & daughters of single pair of frogs (Their mother frog was used for 1st BAC-end sequencing)
 
* Sons & daughters of single pair of frogs (Their mother frog was used for 1st BAC-end sequencing)
 
* Standard Illumina sample prep. (poly-A capture)
 
* Standard Illumina sample prep. (poly-A capture)
 
* Illumina HiSeq 2000, 2x100 bp
 
* Illumina HiSeq 2000, 2x100 bp
* 108.5 billions of nucleotide calls in total.
 
 
* 55M ~ 130M reads/tissue (27M ~ 65M pairs)
 
* 55M ~ 130M reads/tissue (27M ~ 65M pairs)
* [http://www.marcottelab.org/users/XenopusData/J/Taira201203_XENLA_stage.2012jul24.pdf Brief report for data processing]
+
 
 +
* Raw sequences (112,045 in total)
 +
** http://daudin.icmb.utexas.edu/pub/annot/UTA.2012/Taira201203_XENLA_tissue_cdna_final.fa
 +
** http://daudin.icmb.utexas.edu/pub/annot/UTA.2012/Taira201203_XENLA_tissue_pep_final.fa
 +
 
 +
* nr_gene_list (42,890 transcripts)
 +
** http://daudin.icmb.utexas.edu/pub/annot/UTA.2012/Taira201203_XENLA_tissue_pep_final.nr_gene_list
 +
 
 +
* OrthoGeneAll (42,890 sequences)
 +
** http://daudin.icmb.utexas.edu/pub/annot/UTA.2012/Taira201203_XENLA_tissue_cdna_orthoGeneAll.fa
 +
** http://daudin.icmb.utexas.edu/pub/annot/UTA.2012/Taira201203_XENLA_tissue_pep_orthoGeneAll.fa
 +
 
 +
* OrthoGeneOne (24,762 sequences)
 +
** http://daudin.icmb.utexas.edu/pub/annot/UTA.2012/Taira201203_XENLA_tissue_cdna_orthoGeneOne.fa
 +
** http://daudin.icmb.utexas.edu/pub/annot/UTA.2012/Taira201203_XENLA_tissue_pep_orthoGeneOne.fa
  
 
== Taira201203_XENLA_stage ==
 
== Taira201203_XENLA_stage ==
Line 27: Line 46:
 
* Standard Illumina sample prep. (poly-A capture)
 
* Standard Illumina sample prep. (poly-A capture)
 
* Illumina HiSeq 2000, 2x100 bp
 
* Illumina HiSeq 2000, 2x100 bp
* 163.8 billions of nucleotide calls in total.
 
 
* 40M ~ 110M reads/tissue (20M ~ 55M pairs)
 
* 40M ~ 110M reads/tissue (20M ~ 55M pairs)
* [http://www.marcottelab.org/users/XenopusData/J/Taira201203_XENLA_tissue.2012jul24.pdf Brief report for data processing]
 
  
= Assembled transcripts =
+
* Raw sequences (78,546 in total)
== Raw sequences ==
+
** http://daudin.icmb.utexas.edu/pub/annot/UTA.2012/Taira201203_XENLA_stage_cdna_final.fa
* From tissue samples: [http://www.marcottelab.org/users/XenopusData/J/Taira201203_XENLA_tissue_pep_final.fa Protein FASTA] [http://www.marcottelab.org/users/XenopusData/J/Taira201203_XENLA_tissue_cdna_final.fa cDNA FASTA]
+
** http://daudin.icmb.utexas.edu/pub/annot/UTA.2012/Taira201203_XENLA_stage_pep_final.fa
* From stage samples: [http://www.marcottelab.org/users/XenopusData/J/Taira201203_XENLA_stage_pep_final.fa Protein FASTA] [http://www.marcottelab.org/users/XenopusData/J/Taira201203_XENLA_stage_pep_final.fa cDNA FASTA]
+
 
 +
* nr_gene_list (31,833 transcripts in total)
 +
** http://daudin.icmb.utexas.edu/pub/annot/UTA.2012/Taira201203_XENLA_stage_pep_final.nr_gene_list
 +
 
 +
* OrthoGeneAll (31,833 sequences)
 +
** http://daudin.icmb.utexas.edu/pub/annot/UTA.2012/Taira201203_XENLA_stage_cdna_orthoGeneAll.fa
 +
** http://daudin.icmb.utexas.edu/pub/annot/UTA.2012/Taira201203_XENLA_stage_pep_orthoGeneAll.fa
 +
 
 +
* OrthoGeneOne (18,848 sequences)
 +
** http://daudin.icmb.utexas.edu/pub/annot/UTA.2012/Taira201203_XENLA_tissue_cdna_orthoGeneOne.fa
 +
** http://daudin.icmb.utexas.edu/pub/annot/UTA.2012/Taira201203_XENLA_tissue_pep_orthoGeneOne.fa
  
= Annotation =
+
== Making nr_gene_list (orthologs to other species) & orthoGene ==
== Orthologous genes ==
+
<b> This process is no longer used in annotation process. </b> [[User:TaejoonKwon|TaejoonKwon]] ([[User talk:TaejoonKwon|talk]])
We used EnsEMBL-66 as main protein sequences. For ''X. laevis'', we used protein sequences from XenBase (downloaded on Dec-2011).
+
  
* Human:
+
# Take all orthologous candidate genes from BLASTP results (top-3 in max. See [#] for the details.).
[[xdata:/J/blast/Taira201203_XENLA_stage_pep_final.HUMAN_ens66_pep_longest.bp+_summary| Stage_pep --> Human]],[[xdata:/J/blast/HUMAN_ens66_pep_longest.Taira201203_XENLA_stage_pep_final.bp+_summary| Human --> Stage_pep]],  
+
# Through the order of 'XENLA'->'HUMAN'->'XENTR'->'MOUSE'->'DANRE'->'CHICK'->'CAEEL'->'DROME' in species, report assembled transcript Id with following conditions.
[[xdata:/J/blast/Taira201203_XENLA_tissue_pep_final.HUMAN_ens66_pep_longest.bp+_summary| Tissue_pep --> Human]],[[xdata:/J/blast/HUMAN_ens66_pep_longest.Taira201203_XENLA_tissue_pep_final.bp+_summary| Human --> Tissue_pep]],
+
#* An assembled transcript has orthologous candidates in a given species, both as target (database in BLAST search) and query.
 +
#* There is at least one overlap between query list and target list. For example, the same gene in other organism should be identified as one of top 3 hits in bi-directional BLAST search.
 +
#* If there are more than one overlapped genes, report all of them.  
 +
#* If an assembled transcript has candidate orthologous gene in one species, stop searching orthologs and move on to next assembled transcript. So, if a transcript has orthologous gene satisfied this criteria in HUMAN, orthologs in other species next in order, i.e. MOUSE, DANRE, CHICK, etc., are not searched. Main reason for this is to remove redundancy of highly conserved across all species.
  
== Micriarray ==
+
Based on 'nr_gene_list' table, we selected transcripts/peptides as non-redundant sequence set. 'orthoGeneAll' set contains all sequences reported on 'nr_gene_list' table, and 'orthoGeneOne' set contains the longest sequence per orthologous gene group. For example, in tissue sample set, the following three transcripts are reported as known ''X. laevis'' rfx2 gene.  
* Affymetrix microarray v.1 (http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GPL1318)
+
* Affymetrix microarray v.2 (http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GPL10756)
+
  
= Contributors =
+
<pre>Taira201203_XENLA_tissue_00066978 XENLA rfx2|XB-GENE-991777,rfx6|XB-GENE-6488525
* Masanori Taira (Graduate School of Science, University of Tokyo)
+
Taira201203_XENLA_tissue_00144530 XENLA rfx2|XB-GENE-991777
* Shuji Takahashi (Komaba Organization for Educational Excellence, College of Arts and Sciences, University of Tokyo)
+
Taira201203_XENLA_tissue_00191686 XENLA rfx2|XB-GENE-991777</pre>
* Toshiaki Tanaka (Tokyo Institute of Technology)
+
* Atsushi Toyoda and Asao Fujiyama (National Institute of Genetics)
+
* Yutaka Suzuki (Graduate School of Frontier Sciences, University of Tokyo)
+
  
* Edward M. Marcotte (University of Texas at Austin)
+
In 'orthoGeneAll', all three sequences are reported, although in 'orthoGeneOne', Taira201203_XENLA_tissue_00144530 is not reported (it is shorter than Taira201203_XENLA_tissue_00191686). We should mention that, in this example, we did not pick one of three, because Taira201203_XENLA_tissue_00066978 has another candidate gene, rfx6, that is not presented in other two genes.
* John B. Wallingford (University of Texas at Austin)
+
* Taejoon Kwon (University of Texas at Austin)
+
* [http://www.tacc.utexas.edu/ Texas Advanced Computing Center (TACC)]
+
  
 
----
 
----
[[Category:Xenopus]]
+
[[Category:XenopusGenome]]

Latest revision as of 12:19, 31 January 2014

Check Xenopus_Genome_Project#Assembled_transcripts for the latest (refined) gene model. TaejoonKwon (talk) 11:19, 31 January 2014 (CST)

Disclaimer

  • Data users may freely download and analyze sequences posted here.
  • Data users may use data to analyze their own data, i.e. reference database for MS/MS proteomics data, and/or RNA-seq data.
  • The publication and presentation of global analysis of data with these sequences are not allowed until 'data owner'. As soon as the paper is accepted, we will post that info on this website. If it is not clear to whom you should contact, please contact to Dr. Taejoon Kwon.

Contents

WT data

Taira201203 data

  • Contributed by Masanori Taira (Graduate School of Science, University of Tokyo), Shuji Takahashi (Komaba Organization for Educational Excellence, College of Arts and Sciences, University of Tokyo), Toshiaki Tanaka (Tokyo Institute of Technology), Atsushi Toyoda and Asao Fujiyama (National Institute of Genetics), Yutaka Suzuki (Graduate School of Frontier Sciences, University of Tokyo)
  • If you have more question about this data, please contact Dr. Masanori Taira, Dr. Edward Marcotte, or Dr. Taejoon Kwon.

Taira201203_XENLA_tissue

Collect total RNA from 14 Tissue of Xenopus laevis J strain.

  • Brain, eye, heart, intestine, kidney, liver, lung, muscle, ovary, pancreas, skin, spleen, stomach, testis
  • Sons & daughters of single pair of frogs (Their mother frog was used for 1st BAC-end sequencing)
  • Standard Illumina sample prep. (poly-A capture)
  • Illumina HiSeq 2000, 2x100 bp
  • 55M ~ 130M reads/tissue (27M ~ 65M pairs)

Taira201203_XENLA_stage

Collect total RNA from 11 different developmental stages of Xenopus laevis J strain embryo.

  • Stage 01, 08, 09, 10.5, 12, 15, 20, 25, 30, 35, 40
  • Sons & daughters of single pair of frogs (their mother frog was used for 1st BAC-end sequencing)
  • Standard Illumina sample prep. (poly-A capture)
  • Illumina HiSeq 2000, 2x100 bp
  • 40M ~ 110M reads/tissue (20M ~ 55M pairs)

Making nr_gene_list (orthologs to other species) & orthoGene

This process is no longer used in annotation process. TaejoonKwon (talk)

  1. Take all orthologous candidate genes from BLASTP results (top-3 in max. See [#] for the details.).
  2. Through the order of 'XENLA'->'HUMAN'->'XENTR'->'MOUSE'->'DANRE'->'CHICK'->'CAEEL'->'DROME' in species, report assembled transcript Id with following conditions.
    • An assembled transcript has orthologous candidates in a given species, both as target (database in BLAST search) and query.
    • There is at least one overlap between query list and target list. For example, the same gene in other organism should be identified as one of top 3 hits in bi-directional BLAST search.
    • If there are more than one overlapped genes, report all of them.
    • If an assembled transcript has candidate orthologous gene in one species, stop searching orthologs and move on to next assembled transcript. So, if a transcript has orthologous gene satisfied this criteria in HUMAN, orthologs in other species next in order, i.e. MOUSE, DANRE, CHICK, etc., are not searched. Main reason for this is to remove redundancy of highly conserved across all species.

Based on 'nr_gene_list' table, we selected transcripts/peptides as non-redundant sequence set. 'orthoGeneAll' set contains all sequences reported on 'nr_gene_list' table, and 'orthoGeneOne' set contains the longest sequence per orthologous gene group. For example, in tissue sample set, the following three transcripts are reported as known X. laevis rfx2 gene.

Taira201203_XENLA_tissue_00066978	XENLA	rfx2|XB-GENE-991777,rfx6|XB-GENE-6488525
Taira201203_XENLA_tissue_00144530	XENLA	rfx2|XB-GENE-991777
Taira201203_XENLA_tissue_00191686	XENLA	rfx2|XB-GENE-991777

In 'orthoGeneAll', all three sequences are reported, although in 'orthoGeneOne', Taira201203_XENLA_tissue_00144530 is not reported (it is shorter than Taira201203_XENLA_tissue_00191686). We should mention that, in this example, we did not pick one of three, because Taira201203_XENLA_tissue_00066978 has another candidate gene, rfx6, that is not presented in other two genes.