Difference between revisions of "XENLA Oktoberfest"
From Marcotte Lab
|Line 24:||Line 24:|
= Input data =
= Input data =
* JGIv6 scaffold (From
* JGIv6 scaffold (From Rokhsar & Richard Harland, UC Berkeley)
* Reference cDNA/EST
* Reference cDNA/EST
Revision as of 07:15, 19 October 2012
- Search: http://daudlin.icmb.utexas.edu/ or http://xenopus.marcottelab.org (not working yet).
- Download sequences: http://daudlin.icmb.utexas.edu/pub/
- All: all representative sequences
- Longest: longest sequences per genomic hit AND name.
- Total genomic hits: 28,084
- Genomic hits without associated protein sequences: 3,626 (24,458 genomic hits in protein level analysis)
- Genomic hits with model organism reference sequences: 24,372 (86 hits dropped)
- Genomic hits without gene name: 7,300
- Genomic hits with gene name: 20,788
- Total number of 'longest representative' sequences (unique gene name & genomic hit location; some genomic hits have more than one putative gene model): 25,537
- Total number of 'representative' sequences (all): 47,282
- Associated gene names: 13,249
- Names with one gene model: 1,365
- Names with two gene models: 4,740
- Names with three gene models: 1,835
- Names with four gene models: 4,110
- Names with more than four gene modes: 1,199
- JGIv6 scaffold (From Daniel Rokhsar & Richard Harland, UC Berkeley)
- Reference cDNA/EST
- From XenBase (GenBank accession)
- Mike Gilchrist's EST collection (mgEST*)
- John Quakenbush's EST collection (TC*)
- JGI's cDNA collection (XeXen*)
- Assembled transcripts (14 different set, including large-scale J-strain RNA-seq set)
- Remove JGIv6 scaffolds shorter than 10,000 bp (called JGIv6_lt10k scaffolds afterward).
- Map cDNA/EST/assembled transcripts to JGIv6_lt10k scaffolds using BLAT.
- Set align ratio(defined as '(align_len-mismatches-gap_bases)/query_len') cutoff that contain less than 1% of 'second best' hits. It was roughly 90% for reference set, and 95% for assembled transcripts.
- Cluster mapped regions (longest stretches). I call it as a 'genomic hit' afterward.
- Use all these sequences for mapping figure.
- In total, 41,635 genomic hit candidates are identified from all dataset. But only 67% of them (28,084 hits) have multiple evidences (Distribution of singleton: XenBase=35, XGI=1806, mgEST=334, JGI=8118, J.oTx=1212, WT.oTx=2046).
- Select representative sequences per genomic hit. Sometimes multiple genes are clustered together, so I selected (1) the longest transcript and the second longest transcript, and (2) the third longest transcript if it is not covered by first two transcripts AND its length is longer than 20% of longest transcript. As a result, about 2-4 representative cDNA sequences are selected per genomic hit.
- Do 6-frame translation for those representative sequences.
- Run BLASTP with known protein sequences of other species (CHICK, MOUSE, HUMAN, XENTR and DANRE from EnsEMBL 66; XENLA_v5 and XENTR_v5 from XenBase Aug. 2012).
- Select top best 3 hits according to bit score (not E-value).
- Remove representative sequences if it has multiple frame candidates.
- Do multiple sequence alignment of proteins per genomic hit by MUSCLE. Use all top-3 model organism proteins and representative protein sequences translated from representative cDNA sequences.
- Alignment results (CLW format) and tree2 info (Newick format) are stored.
- Generate ASCII tree figure from tree2 output.
- Calculate distances between nodes on tree. Check the closest model organism proteins per each representative protein sequence, and fetch its name (I changed all letters in gene name to Capital letter, number, and underscore('_')). For Zebrafish, all names with '(n of M)' are converted with '_nOFm_'.
- Assign this name to representative sequence.
- Orientation of the cDNA/EST mapping figure may be incorrect. Translation to protein coding is only checked for representative sequences (2-4 sequences per genomic hit), so some transcripts that still support gene structure may be oriented in opposite direction. It will be fixed in next release.
- Some genomic hits do not have tree figures, because of newick utilities error. It will be fixed in next release.
- Next release is planned near Thanksgiving, 2012. If it is delivered on schedule, it will be called as 'Thanksgiving' . (of course, it can become 'Christmas' or something else.. :-))
- All known issues will be addressed.
- Assembled transcripts part (esp. protein translation step) will be revised.
- Phylogenetic analysis of duplicated genes (alloalleles) will be added.
- Synteny structure of duplicated genome hits.
- Exon-intron boundaries for Morpholino design.
- Scaffold coordinate information (i.e. GTF or GFF3 file). --> it will be released soon, before new release.