XENLA Oktoberfest

From Marcotte Lab
Revision as of 11:01, 18 October 2012 by Taejoon (Talk | contribs)

Jump to: navigation, search

This is a page for integrated gene models of Xenopus laevis, released in October, 2012. "Oktoberfest" is a name of dataset.

Contents

Input data

  • JGIv6 scaffold
  • Reference cDNA/EST
    • From XenBase (GenBank accession)
    • Mike Gilchrist's EST collection (mgEST*)
    • John Quakenbush's EST collection (TC*)
    • JGI's cDNA collection (XeXen*)
  • Assembled transcripts

Analysis procedures

  1. Remove JGIv6 scaffolds shorter than 10,000 bp (called JGIv6_lt10k scaffolds afterward).
  2. Map cDNA/EST/assembled transcripts to JGIv6_lt10k scaffolds using BLAT.
  3. Set align ratio(defined as '(align_len-mismatches-gap_bases)/query_len') cutoff that contain less than 1% of 'second best' hits. It was roughly 90% for reference set, and 95% for assembled transcripts.
  4. Cluster mapped regions (longest stretches). I call it as a 'genomic hit' afterward.
    • Use all these sequences for mapping figure.
  5. Select representative sequences per genomic hit. Sometimes multiple genes are clustered together, so I selected (1) the longest transcript and the second longest transcript, and (2) the third longest transcript if it is not covered by first two transcripts AND its length is longer than 20% of longest transcript.

Known issues

  • Orientation of the cDNA/EST mapping figure may be incorrect. Translation to protein coding is only checked for representative sequences (2-3 sequences per genomic region), so some transcripts that still support gene structure may be oriented in opposite direction. It will be fixed in next release.
  • Some genomic regions do not have tree figures, because of newick utilities error. It will be fixed in next release.

Plan

  • Next release is planned near Thanksgiving, 2012.