Revision as of 07:15, 19 October 2012

This is a page for integrated gene models of Xenopus laevis, released in October, 2012. "Oktoberfest" is a name of dataset.

Result

Search: http://daudlin.icmb.utexas.edu/ or http://xenopus.marcottelab.org (not working yet).
Download sequences: http://daudlin.icmb.utexas.edu/pub/
- All: all representative sequences
- Longest: longest sequences per genomic hit AND name.

Total genomic hits: 28,084
Genomic hits without associated protein sequences: 3,626 (24,458 genomic hits in protein level analysis)
Genomic hits with model organism reference sequences: 24,372 (86 hits dropped)
Genomic hits without gene name: 7,300
Genomic hits with gene name: 20,788
- Total number of 'longest representative' sequences (unique gene name & genomic hit location; some genomic hits have more than one putative gene model): 25,537
- Total number of 'representative' sequences (all): 47,282
Associated gene names: 13,249
- Names with more than four gene modes: 1,199

Reference cDNA/EST
- From XenBase (GenBank accession)
- Mike Gilchrist's EST collection (mgEST*)
- John Quakenbush's EST collection (TC*)
- JGI's cDNA collection (XeXen*)
Assembled transcripts (14 different set, including large-scale J-strain RNA-seq set)

Remove JGIv6 scaffolds shorter than 10,000 bp (called JGIv6_lt10k scaffolds afterward).
Map cDNA/EST/assembled transcripts to JGIv6_lt10k scaffolds using BLAT.
Set align ratio(defined as '(align_len-mismatches-gap_bases)/query_len') cutoff that contain less than 1% of 'second best' hits. It was roughly 90% for reference set, and 95% for assembled transcripts.
Cluster mapped regions (longest stretches). I call it as a 'genomic hit' afterward.
- Use all these sequences for mapping figure.
- In total, 41,635 genomic hit candidates are identified from all dataset. But only 67% of them (28,084 hits) have multiple evidences (Distribution of singleton: XenBase=35, XGI=1806, mgEST=334, JGI=8118, J.oTx=1212, WT.oTx=2046).

Select representative sequences per genomic hit. Sometimes multiple genes are clustered together, so I selected (1) the longest transcript and the second longest transcript, and (2) the third longest transcript if it is not covered by first two transcripts AND its length is longer than 20% of longest transcript. As a result, about 2-4 representative cDNA sequences are selected per genomic hit.
Do 6-frame translation for those representative sequences.
Run BLASTP with known protein sequences of other species (CHICK, MOUSE, HUMAN, XENTR and DANRE from EnsEMBL 66; XENLA_v5 and XENTR_v5 from XenBase Aug. 2012).
Select top best 3 hits according to bit score (not E-value).
Remove representative sequences if it has multiple frame candidates.
Do multiple sequence alignment of proteins per genomic hit by MUSCLE. Use all top-3 model organism proteins and representative protein sequences translated from representative cDNA sequences.
- Alignment results (CLW format) and tree2 info (Newick format) are stored.
Generate ASCII tree figure from tree2 output.
Calculate distances between nodes on tree. Check the closest model organism proteins per each representative protein sequence, and fetch its name (I changed all letters in gene name to Capital letter, number, and underscore('_')). For Zebrafish, all names with '(n of M)' are converted with '_nOFm_'.
Assign this name to representative sequence.

Orientation of the cDNA/EST mapping figure may be incorrect. Translation to protein coding is only checked for representative sequences (2-4 sequences per genomic hit), so some transcripts that still support gene structure may be oriented in opposite direction. It will be fixed in next release.
Some genomic hits do not have tree figures, because of newick utilities error. It will be fixed in next release.

@@ Line 24: / Line 24: @@
 = Input data =
-* JGIv6 scaffold (From Danial Rokhsar & Richard Harland, UC Berkeley)
+* JGIv6 scaffold (From Daniel Rokhsar & Richard Harland, UC Berkeley)
 * Reference cDNA/EST