XENLA Oktoberfest

From Marcotte Lab
Revision as of 16:29, 18 October 2012 by Taejoon (Talk | contribs)

Jump to: navigation, search

This is a page for integrated gene models of Xenopus laevis, released in October, 2012. "Oktoberfest" is a name of dataset (Beer glass logo is from http://www.webdesignhot.com/free-vector-graphics/lifelike-beer-glasses-and-beer-bubbles-vector-graphic/).


Input data

  • JGIv6 scaffold
  • Reference cDNA/EST
    • From XenBase (GenBank accession)
    • Mike Gilchrist's EST collection (mgEST*)
    • John Quakenbush's EST collection (TC*)
    • JGI's cDNA collection (XeXen*)
  • Assembled transcripts (14 different set)

Analysis procedures

  1. Remove JGIv6 scaffolds shorter than 10,000 bp (called JGIv6_lt10k scaffolds afterward).
  2. Map cDNA/EST/assembled transcripts to JGIv6_lt10k scaffolds using BLAT.
  3. Set align ratio(defined as '(align_len-mismatches-gap_bases)/query_len') cutoff that contain less than 1% of 'second best' hits. It was roughly 90% for reference set, and 95% for assembled transcripts.
  4. Cluster mapped regions (longest stretches). I call it as a 'genomic hit' afterward.
    • Use all these sequences for mapping figure.
  5. Select representative sequences per genomic hit. Sometimes multiple genes are clustered together, so I selected (1) the longest transcript and the second longest transcript, and (2) the third longest transcript if it is not covered by first two transcripts AND its length is longer than 20% of longest transcript. As a result, about 2-4 representative cDNA sequences are selected per genomic hit.
  6. Do 6-frame translation for those representative sequences.
  7. Run BLASTP with known protein sequences of other species (CHICK, MOUSE, HUMAN, XENTR and DANRE from EnsEMBL 66; XENLA_v5 and XENTR_v5 from XenBase Aug. 2012).
  8. Select top best 3 hits according to bit score (not E-value).
  9. Remove representative sequences if it has multiple frame candidates.
  10. Do multiple sequence alignment of proteins per genomic hit by MUSCLE. Use all top-3 model organism proteins and representative protein sequences translated from representative cDNA sequences.
    • Alignment results (CLW format) and tree2 info (Newick format) are stored.
  11. Generate ASCII tree figure from tree2 output.
  12. Calculate distances between nodes on tree. Check the closest model organism proteins per each representative protein sequence, and fetch its name (I changed all letters in gene name to Capital letter, number, and underscore('_')). For Zebrafish, all names with '(n of M)' are converted with '_nOFm_'.
  13. Assign this name to representative sequence.

Known issues

  • Orientation of the cDNA/EST mapping figure may be incorrect. Translation to protein coding is only checked for representative sequences (2-4 sequences per genomic hit), so some transcripts that still support gene structure may be oriented in opposite direction. It will be fixed in next release.
  • Some genomic hits do not have tree figures, because of newick utilities error. It will be fixed in next release.


  • Next release is planned near Thanksgiving, 2012. If it is delivered on schedule, it will be called as 'Thanksgiving' . (of course, it can become 'Christmas' or something else.. :-))
    • All known issues will be addressed.
    • Assembled transcripts part (esp. protein translation step) will be revised.
    • Phylogenetic analysis of duplicated genes (alloalleles) will be added.