XENLA WorldCup

From Marcotte Lab
Revision as of 18:16, 31 July 2014 by TaejoonKwon (Talk | contribs)

Jump to: navigation, search

Information about Xenopus laevis gene annotation released on July, 2014.



Raw materials

  • Xenopus laevis Reference sequences - http://daudin.icmb.utexas.edu/xenopus-pub/ref/
    • mgEST_Xl4jul2012.fa - Michael Gilchrist's assembled transcript (2012 July version)
    • XENLA_XBv5_cdna.fa - XenBase NCBI mRNA sequences (2012 June version)
    • XENLA_UG94.fa - X. laevis UniGene (version 94)
    • XENLA_xb201405_mrna.fa - XenBase NCBI mRNA sequences (2014 May version)
    • XGI_022511_TC.fa - John Quackenbush's assembled ESTs (XGI 022511 version)
    • XENTR_UG52_uniq.fa - X. tropicalis UniGene (version 52).
    • XENTR_xb201405_mrna.fa - XenBase NCBI mRNA sequences (2014 May version)
  • Reference species proteome sequence - http://daudin.icmb.utexas.edu/xenopus-pub/ens72/
    • CHICK_ens72_prot_annot_longest.fa - Chicken
    • DANRE_ens72_prot_annot_longest.fa - Zebrafish
    • MOUSE_ens72_prot_annot_longest.fa - Mouse
    • XENTR_ens72_prot_annot_longest.fa - X. tropicalis
    • HUMAN_ens72_prot_annot_longest.fa - Human



  1. Map on JGI ver 7.1 genome with GMAP (default setting).
  2. Sort all transcripts based on CDS length identified by GMAP (from longest to shortest). For transcripts with identical CDS length, sort them based on exon length also identified by GMAP (from shortest to longest; when I did this second sorting in opposite way, there were so many fused genes produced so I decide to sacrifice long UTRs instead).
  3. Choose longest transcripts per give genome scaffold region and direction of transcription.


  1. Translate non-redundant transcripts into all possible 6 frames, with standard codon usage table.
  2. Search it against Reference species proteome (human, mouse, zebrafish, chicken, X. tropicalis; EnsEMBL ver. 72)
  3. Determine the translation frame.

Gene name assignment based on phylogeny


  • Mapping reference proteins (EnsEMBL + XenBase) to EggNOG database (v4)
  • Mapping query proteins to same database.
  • Group queries and references into each NOG.
    • If NOG does not have more than two reference sequences, discard it.
    • If NOG does not have a query, discard it.
  • Do multiple sequence alignment per orthogroup using muscle, and construct a phylogenetic tree with neighbor-joining (using Kimura distance as a measure of distance).
  • Calculate the distance from a query to
    • Closest X. tropicalis gene (either EnsEMBL or XenBase)
    • Closest human gene (EnsEMBL)
    • Closest reference gene
    • Those distances should be less than maximum distance in each orthogroup.
  • Cleaning Xenopus gene name to compatible to HGNC name: Remove ‘XXX-a/-b’, or ‘YYY.1/2/3’
  • If human_name == trop_name, take that name.
  • If human_name != trop_name,
    • If human_name is ‘NA’, take trop name
    • If trop_name is ‘unnamed’, take human name
    • If they have different ‘actual’ name, look at the distance from the query, and take the name of closest gene.

Second merging