Difference between revisions of "XENLA WorldCup"

From Marcotte Lab
Jump to: navigation, search
(Browser)
 
(10 intermediate revisions by one user not shown)
Line 1: Line 1:
 
Information about ''Xenopus laevis'' gene annotation released on July, 2014.  
 
Information about ''Xenopus laevis'' gene annotation released on July, 2014.  
  
= Browser =
+
= Genome Browser =
 
* You can see/search final gene model (UTA201407f) and all other sequences mentioned in this page at http://daudin.icmb.utexas.edu/XENLA_JGIv72/  
 
* You can see/search final gene model (UTA201407f) and all other sequences mentioned in this page at http://daudin.icmb.utexas.edu/XENLA_JGIv72/  
  
Line 9: Line 9:
 
* ChIP-seq data is kindly contributed by following groups:
 
* ChIP-seq data is kindly contributed by following groups:
 
** H3K27ac, H3K4me1, H3K4me3 - [http://web.stanford.edu/group/bakerlab/Welcome.html Rakhi Gupta/Julie Baker lab, Stanford University, USA]
 
** H3K27ac, H3K4me1, H3K4me3 - [http://web.stanford.edu/group/bakerlab/Welcome.html Rakhi Gupta/Julie Baker lab, Stanford University, USA]
** H3K27ac, H3K4me3, H3K4me2, E2f4, E2f4+Mci - [[http://www.salk.edu/faculty/kintner.html Ian Quigley/Chris Kintner lab, Salk Institute, USA]. E2f4 data is published [http://genesdev.cshlp.org/content/28/13/1461 here]
+
** H3K27ac, H3K4me3, H3K4me2, E2f4, E2f4+Mci - [http://www.salk.edu/faculty/kintner.html Ian Quigley/Chris Kintner lab, Salk Institute, USA]. E2f4 data is published [http://genesdev.cshlp.org/content/28/13/1461 here]
 
** H3K4me3 - [http://www.gurdon.cam.ac.uk/research/gurdon Marta Teperek/John Gurdon lab, Cambridge, UK]
 
** H3K4me3 - [http://www.gurdon.cam.ac.uk/research/gurdon Marta Teperek/John Gurdon lab, Cambridge, UK]
** Rfx2 - , published [http://elifesciences.org/content/3/e01439 here]
+
** Rfx2 - [http://www.bio.utexas.edu/faculty/wallingford/ Mei-I Chung/John Wallingford lab, University of Texas at Austin, USA], published [http://elifesciences.org/content/3/e01439 here]
  
 
= Raw materials =
 
= Raw materials =
* ''Xenopus laevis'' Reference sequences - http://daudin.icmb.utexas.edu/xenopus-pub/ref/
+
* ''Xenopus laevis'' Reference sequences - http://daudin.icmb.utexas.edu/xenopus-pub/ref/ (original) http://daudin.icmb.utexas.edu/xenopus-pub/tx/pub.201406 (processed)
 
** mgEST_Xl4jul2012.fa - Michael Gilchrist's assembled transcript (2012 July version)
 
** mgEST_Xl4jul2012.fa - Michael Gilchrist's assembled transcript (2012 July version)
 
** XENLA_XBv5_cdna.fa - XenBase NCBI mRNA sequences (2012 June version)
 
** XENLA_XBv5_cdna.fa - XenBase NCBI mRNA sequences (2012 June version)
Line 22: Line 22:
 
** XENTR_UG52_uniq.fa - ''X. tropicalis'' UniGene (version 52).  
 
** XENTR_UG52_uniq.fa - ''X. tropicalis'' UniGene (version 52).  
 
** XENTR_xb201405_mrna.fa - XenBase NCBI mRNA sequences (2014 May version)
 
** XENTR_xb201405_mrna.fa - XenBase NCBI mRNA sequences (2014 May version)
 +
** XENLA_MK201406_08r2_tx.fa - PHROG gene model from Kirschner lab (published [http://www.cell.com/current-biology/abstract/S0960-9822(14)00609-5 here])
  
 
* Reference species proteome sequence - http://daudin.icmb.utexas.edu/xenopus-pub/ens72/
 
* Reference species proteome sequence - http://daudin.icmb.utexas.edu/xenopus-pub/ens72/
Line 30: Line 31:
 
** HUMAN_ens72_prot_annot_longest.fa - Human
 
** HUMAN_ens72_prot_annot_longest.fa - Human
  
* JGI gene annotation data -http://daudin.icmb.utexas.edu/xenopus-pub/annot/JGI/
+
* JGI gene annotation data -http://daudin.icmb.utexas.edu/xenopus-pub/annot/JGI/ (original) http://daudin.icmb.utexas.edu/xenopus-pub/tx/pub.201406 (processed)
 +
** XENLA_JGIv10p_pub.fa.gz - JGI v1.0. ('XlaevisJGIv1.0.primaryTrs.fa' with a header like 'Xelaev10020581m')
 +
** XENLA_JGIv14pCDS_pub.fa.gz - JGI v1.4 ('XlaevisJGIv1.4.primaryTrs.fa' with a header like 'Xelaev14000002m')
 +
** XENLA_JGIv15_primTx_pub.fa.gz - JGI v1.5 ('XlaevisJGIv1.5.primaryTrs.fa' with a header like 'Xelaev15034936m')
 +
** XENLA_JGIv610pCDS_pub.fa.gz - ??? ('Xenopus_laevisJGIL6RMv1.0.primaryTrs.fa' with a header like XeXenL6RMv10000001m)
 +
** XENLA_JGIv16pa_pub.fa.gz - JGI v1.6. ('Xlaevisv1.6.primaryTrs.fa.gz' with a header like 'Xelaev16061684m')
  
 
* De novo assembled transcripts from RNA-seq - http://daudin.icmb.utexas.edu/xenopus-pub/tx/pub.201406
 
* De novo assembled transcripts from RNA-seq - http://daudin.icmb.utexas.edu/xenopus-pub/tx/pub.201406
Line 46: Line 52:
 
** Ueno201210_XENLA_stage.cdna_pub.fa, Ueno201210_XENLA_tissue.cdna_pub.fa, Ueno201302_XENLA_stage.cdna_pub.fa - Masanori Taira/Naoto Ueno/Shuji Takahashi (genome consortium)
 
** Ueno201210_XENLA_stage.cdna_pub.fa, Ueno201210_XENLA_tissue.cdna_pub.fa, Ueno201302_XENLA_stage.cdna_pub.fa - Masanori Taira/Naoto Ueno/Shuji Takahashi (genome consortium)
  
= Merge =  
+
= First merging =  
 +
 
 +
http://daudin.icmb.utexas.edu/xenopus-pub/annot/201407_WorldCup/XENLA_UTA201407f_raw.fa.gz
 +
 
 
# Map on JGI ver 7.1 genome with GMAP (default setting).
 
# Map on JGI ver 7.1 genome with GMAP (default setting).
 
# Sort all transcripts based on CDS length identified by GMAP (from longest to shortest). For transcripts with identical CDS length, sort them based on exon length also identified by GMAP (from shortest to longest; when I did this second sorting in opposite way, there were so many fused genes produced so I decide to sacrifice long UTRs instead).   
 
# Sort all transcripts based on CDS length identified by GMAP (from longest to shortest). For transcripts with identical CDS length, sort them based on exon length also identified by GMAP (from shortest to longest; when I did this second sorting in opposite way, there were so many fused genes produced so I decide to sacrifice long UTRs instead).   
Line 54: Line 63:
 
# Translate non-redundant transcripts into all possible 6 frames, with standard codon usage table.  
 
# Translate non-redundant transcripts into all possible 6 frames, with standard codon usage table.  
 
# Search it against Reference species proteome (human, mouse, zebrafish, chicken, ''X. tropicalis''; EnsEMBL ver. 72)
 
# Search it against Reference species proteome (human, mouse, zebrafish, chicken, ''X. tropicalis''; EnsEMBL ver. 72)
# Determine the translation frame
+
# Determine the translation frame.
 +
#* If only one frame is mapped to reference proteome, take that frame.
 +
#* If there is multiple frames mapped to reference proteome, take most frequent frame (based on best hit) or frame with longest align length.
 +
#* If there is no clear candidate frame, assign the sequence as 'noncoding'.
 +
 
 +
= Gene name assignment based on phylogeny =
 +
 
 +
http://daudin.icmb.utexas.edu/xenopus-pub/annot/201407_WorldCup/XENLA_UTA201407f.names
 +
 
 +
* Mapping reference proteins (EnsEMBL + XenBase) to EggNOG database (v4)
 +
** Use median length sequences of each orthogroup defined at opiNOG (orthogroup of Opisthoknots) - http://daudin.icmb.utexas.edu/xenopus-pub/eggnog/eggnogv4_opiNOG_pep.fa.gz
 +
** Report hits with E-value < 1.0
 +
* Mapping query proteins to same database.
 +
* Group queries and references into each NOG.
 +
** If NOG does not have more than two reference sequences, discard it.
 +
** If NOG does not have a query, discard it.
 +
* Do multiple sequence alignment per orthogroup using [http://www.drive5.com/muscle/ muscle], and construct a phylogenetic tree with neighbor-joining (using Kimura distance as a measure of distance).
 +
* Calculate the distance from a query to
 +
** Closest ''X. tropicalis'' gene (either EnsEMBL or XenBase)
 +
** Closest human gene (EnsEMBL)
 +
** Closest reference gene
 +
** Those distances should be less than maximum distance in each orthogroup.
 +
* Cleaning Xenopus gene name to compatible to HGNC name: Remove ‘XXX-a/-b’, or ‘YYY.1/2/3’
 +
* If human_name == trop_name, take that name.
 +
* If human_name != trop_name,
 +
** If human_name is ‘NA’, take trop name
 +
** If trop_name is ‘unnamed’, take human name
 +
** If they have different ‘actual’ name, look at the distance from the query, and take the name of closest gene.
 +
 
 +
== Gene name assignment on JGI ver 1.6 ==
 +
* http://daudin.icmb.utexas.edu/xenopus-pub/annot/JGIv16/XENLA_JGIv16pa.names
 +
 
 +
== Gene name assignment on de novo assembled ''X. tropicalis'' transcripts ==
 +
* http://daudin.icmb.utexas.edu/xenopus-pub/annot/201407_WorldCup.XT/XENTR_UTA201407c.names
 +
 
 +
= Second merging =
 +
 
 +
* Assign gene name
 +
** http://daudin.icmb.utexas.edu/xenopus-pub/annot/201407_WorldCup/XENLA_UTA201407f_cdna_all.fa.gz
 +
** http://daudin.icmb.utexas.edu/xenopus-pub/annot/201407_WorldCup/XENLA_UTA201407f_prot_all.fa.gz
 +
 
 +
* Re-examine sequences with rules same as first merging step.
 +
* If there is sequences overlapped, and assigned with same gene name, choose a sequence having longer CDS.
 +
** http://daudin.icmb.utexas.edu/xenopus-pub/annot/201407_WorldCup/XENLA_UTA201407f_cdna_longest.fa.gz
 +
** http://daudin.icmb.utexas.edu/xenopus-pub/annot/201407_WorldCup/XENLA_UTA201407f_prot_longest.fa.gz
  
= Merge
+
----
 +
[[Category:XenopusGenome]]

Latest revision as of 18:49, 31 July 2014

Information about Xenopus laevis gene annotation released on July, 2014.

Contents

Genome Browser

Raw materials

  • Xenopus laevis Reference sequences - http://daudin.icmb.utexas.edu/xenopus-pub/ref/ (original) http://daudin.icmb.utexas.edu/xenopus-pub/tx/pub.201406 (processed)
    • mgEST_Xl4jul2012.fa - Michael Gilchrist's assembled transcript (2012 July version)
    • XENLA_XBv5_cdna.fa - XenBase NCBI mRNA sequences (2012 June version)
    • XENLA_UG94.fa - X. laevis UniGene (version 94)
    • XENLA_xb201405_mrna.fa - XenBase NCBI mRNA sequences (2014 May version)
    • XGI_022511_TC.fa - John Quackenbush's assembled ESTs (XGI 022511 version)
    • XENTR_UG52_uniq.fa - X. tropicalis UniGene (version 52).
    • XENTR_xb201405_mrna.fa - XenBase NCBI mRNA sequences (2014 May version)
    • XENLA_MK201406_08r2_tx.fa - PHROG gene model from Kirschner lab (published here)
  • Reference species proteome sequence - http://daudin.icmb.utexas.edu/xenopus-pub/ens72/
    • CHICK_ens72_prot_annot_longest.fa - Chicken
    • DANRE_ens72_prot_annot_longest.fa - Zebrafish
    • MOUSE_ens72_prot_annot_longest.fa - Mouse
    • XENTR_ens72_prot_annot_longest.fa - X. tropicalis
    • HUMAN_ens72_prot_annot_longest.fa - Human
  • JGI gene annotation data -http://daudin.icmb.utexas.edu/xenopus-pub/annot/JGI/ (original) http://daudin.icmb.utexas.edu/xenopus-pub/tx/pub.201406 (processed)
    • XENLA_JGIv10p_pub.fa.gz - JGI v1.0. ('XlaevisJGIv1.0.primaryTrs.fa' with a header like 'Xelaev10020581m')
    • XENLA_JGIv14pCDS_pub.fa.gz - JGI v1.4 ('XlaevisJGIv1.4.primaryTrs.fa' with a header like 'Xelaev14000002m')
    • XENLA_JGIv15_primTx_pub.fa.gz - JGI v1.5 ('XlaevisJGIv1.5.primaryTrs.fa' with a header like 'Xelaev15034936m')
    • XENLA_JGIv610pCDS_pub.fa.gz - ??? ('Xenopus_laevisJGIL6RMv1.0.primaryTrs.fa' with a header like XeXenL6RMv10000001m)
    • XENLA_JGIv16pa_pub.fa.gz - JGI v1.6. ('Xlaevisv1.6.primaryTrs.fa.gz' with a header like 'Xelaev16061684m')

First merging

http://daudin.icmb.utexas.edu/xenopus-pub/annot/201407_WorldCup/XENLA_UTA201407f_raw.fa.gz

  1. Map on JGI ver 7.1 genome with GMAP (default setting).
  2. Sort all transcripts based on CDS length identified by GMAP (from longest to shortest). For transcripts with identical CDS length, sort them based on exon length also identified by GMAP (from shortest to longest; when I did this second sorting in opposite way, there were so many fused genes produced so I decide to sacrifice long UTRs instead).
  3. Choose longest transcripts per give genome scaffold region and direction of transcription.

Translation

  1. Translate non-redundant transcripts into all possible 6 frames, with standard codon usage table.
  2. Search it against Reference species proteome (human, mouse, zebrafish, chicken, X. tropicalis; EnsEMBL ver. 72)
  3. Determine the translation frame.
    • If only one frame is mapped to reference proteome, take that frame.
    • If there is multiple frames mapped to reference proteome, take most frequent frame (based on best hit) or frame with longest align length.
    • If there is no clear candidate frame, assign the sequence as 'noncoding'.

Gene name assignment based on phylogeny

http://daudin.icmb.utexas.edu/xenopus-pub/annot/201407_WorldCup/XENLA_UTA201407f.names

  • Mapping reference proteins (EnsEMBL + XenBase) to EggNOG database (v4)
  • Mapping query proteins to same database.
  • Group queries and references into each NOG.
    • If NOG does not have more than two reference sequences, discard it.
    • If NOG does not have a query, discard it.
  • Do multiple sequence alignment per orthogroup using muscle, and construct a phylogenetic tree with neighbor-joining (using Kimura distance as a measure of distance).
  • Calculate the distance from a query to
    • Closest X. tropicalis gene (either EnsEMBL or XenBase)
    • Closest human gene (EnsEMBL)
    • Closest reference gene
    • Those distances should be less than maximum distance in each orthogroup.
  • Cleaning Xenopus gene name to compatible to HGNC name: Remove ‘XXX-a/-b’, or ‘YYY.1/2/3’
  • If human_name == trop_name, take that name.
  • If human_name != trop_name,
    • If human_name is ‘NA’, take trop name
    • If trop_name is ‘unnamed’, take human name
    • If they have different ‘actual’ name, look at the distance from the query, and take the name of closest gene.

Gene name assignment on JGI ver 1.6

Gene name assignment on de novo assembled X. tropicalis transcripts

Second merging