Difference between revisions of "Texas Xenopus Genome Project/Species Identification"

From Marcotte Lab
Jump to: navigation, search
(Selection procedure)
Line 8: Line 8:
 
** [[:xdata:ID/XENTR_mRNA.xenbase20091127.fasta.gz]] 17 MB, gzipped.
 
** [[:xdata:ID/XENTR_mRNA.xenbase20091127.fasta.gz]] 17 MB, gzipped.
  
* Download CHORI-216 sequences (from XenBase) and CHORI-219 sequences (from NCBI GenBank).  
+
* Download CHORI-219 sequences (from NCBI GenBank).  
** [[:xdata:ID/XENTR_CH216.fasta.gz]] 1.2 MB, gzipped. (CHORI-216 sequences. 160 BAC sequences from ''X. tropicalis'' genome)
+
 
** [[:xdata:ID/XENLA_CH219.fasta.gz]] 6.5 MB, gzipped. (CHORI-219 sequences. 29 BAC sequences from ''X. laeves'' genome)
 
** [[:xdata:ID/XENLA_CH219.fasta.gz]] 6.5 MB, gzipped. (CHORI-219 sequences. 29 BAC sequences from ''X. laeves'' genome)
  
 
* Run BLAT (version 3.4, with default option) to known CHORI BAC sequences.
 
* Run BLAT (version 3.4, with default option) to known CHORI BAC sequences.
 
** [[:xdata:ID/XENTR_mRNA.XENLA_CH219.blat_pslx.gz]] 1.2 MB, gzipped.  
 
** [[:xdata:ID/XENTR_mRNA.XENLA_CH219.blat_pslx.gz]] 1.2 MB, gzipped.  
** [[:xdata:XENTR_mRNA.XENTR_CH216.blat_pslx.gz]] 20 MB, gzipped.
+
:<pre> blat XENLA_CH219.fasta XENTR_mRNA.xenbase20091127.fasta XENTR_mRNA.XENLA_CH219.blat_pslx -out=pslx</pre>
:<pre> blat XENTR_CH216.fasta XENTR_mRNA.xenbase20091127.fasta XENTR_mRNA.XENTR_CH216.blat_pslx -out=pslx</pre>
+
  
 
* Parse two BLAT output files with the following criteria.  
 
* Parse two BLAT output files with the following criteria.  
 
*# From ''X. tropicalis'' mRNA, only RefSeq (starts sith 'NM_') sequences are considered.  
 
*# From ''X. tropicalis'' mRNA, only RefSeq (starts sith 'NM_') sequences are considered.  
*# Select ''X. tropicalis'' mRNA sequences which hit both CHORI-219 and CHORI-216 (minimum match length is 200 bp to be called as a 'hit'). For CHORI-219 hits, I only consider 10 BACs which we already knew that they are available ('74I8','204L9','197E3','71P23','36I4','35I18','262A22','20I13','206K7','166K18').  
+
*# Select ''X. tropicalis'' mRNA sequences which hit both CHORI-219 (minimum match length is 200 bp to be called as a 'hit'). I only consider 10 CHORI-219 BACs which we already knew that they are available ('74I8','204L9','197E3','71P23','36I4','35I18','262A22','20I13','206K7','166K18').  
*# Survey each hit blocks. If the same mRNA fragment hits both CHORI-219 and CHORI-216, report three sequences: the query sequence from ''X. tropicalis'' mRNA, the target sequence from CHORI-219 BACs (''X. laevis'') and the target sequence from CHORI-216 BACs (''X. tropicalis''). ONE hit block is reported.
+
*# Survey each hit blocks. If the hit block is less than 200 bp, discard it. 42 hit blocks from 8 mRNAs are selected.
<pre>
+
*#* NM_001004837 Unnamed, predicted gene MGC69309 [http://www.ncbi.nlm.nih.gov/nuccore/52345577|NCBI][http://www.xenbase.org/gene/showgene.do?method=displayGeneSummary&geneId=5903347|XenBase]
>XENTR_NM_001142220_0 gi|213983084|ref|NM_001142220|
+
*#* NM_001007499 paired-like homeodomain 1 (pitx-1) [http://www.ncbi.nlm.nih.gov/nuccore/55926079|NCBI][http://www.xenbase.org/gene/showgene.do?method=displayGeneSummary&geneId=485440|XenBase]
ttatttgtgccctgggtacccctggaactatagcggggtgactgttaccccaatgtttctatatatct
+
*#* NM_001011405 Homeobox A5 (hoxa5) [http://www.ncbi.nlm.nih.gov/nuccore/58332665|NCBI][http://www.xenbase.org/gene/showgene.do?method=displayGeneSummary&geneId=486060|XebBase]
gtaaccttgttatgggctaagggggcccagcctgaaggccagttagggggggatttggggtgagtgc
+
*#* NM_001035121
ttatttgtgccctgggtacccctggaactatagcagggtgactgttaccccaatgtttctatatatct
+
*#* NM_001113032
gtaaccttgttatgggctaagggggcccagcctgaaggccagttagggggggatttggggtgagtgc
+
*#* NM_001127429
ttatttgtgccctgggtacccctggaactatagcagggtgac
+
*#* NM_001129937
>XENLA_CH219-20I13_0
+
*#* NM_001142220
ttatttgtgccctggatacccctggaactatagcagggtgactgttaccccaatgtttctatatatct
+
 
gtaaccttgttattagctaagggggcccagtctgaaggtcagttagggggagatttggggtgagggc
+
 
ttatttgtaccctgggtacccctggaactatagcagggtgactgttaccccaatgtttctatatatct
+
gtaaccttgttatgagctaagggggcccagtctgaaggccagttagggggagatatggggtgagtgt
+
ttatttgtgccctggttacccctggaactatagcagggtgac
+
>XENTR_CH216-2E23_0
+
tcaccccaaatccccccctaactggccttcaggctgggcccccttagctcataacaaggttacagatatatagaaacattggggtaacagtca
+
ccccgctatagttccaggggtacccagggcacaaataagcactcaccccaaatcatcccctaactggccttcaggctgggcccccttagccca
+
taacaaggttacagatatatagaaacattggggtaacagtcaccccgctatagttccaggggtacccagggcacaaataagcactcaccccaa
+
atc
+
  
</pre>
 
  
 
* Run MUSCLE (version 4.0, with default option) for multiple sequence alignment. Interestingly, sequence from CHORI-216 is somewhat different, compared to both XENTR_mRNA and CHORI-219 fragment.  
 
* Run MUSCLE (version 4.0, with default option) for multiple sequence alignment. Interestingly, sequence from CHORI-216 is somewhat different, compared to both XENTR_mRNA and CHORI-219 fragment.  

Revision as of 12:09, 9 December 2009

Target gene

Selection procedure

  • Download CHORI-219 sequences (from NCBI GenBank).
 blat XENLA_CH219.fasta XENTR_mRNA.xenbase20091127.fasta XENTR_mRNA.XENLA_CH219.blat_pslx -out=pslx
  • Parse two BLAT output files with the following criteria.
    1. From X. tropicalis mRNA, only RefSeq (starts sith 'NM_') sequences are considered.
    2. Select X. tropicalis mRNA sequences which hit both CHORI-219 (minimum match length is 200 bp to be called as a 'hit'). I only consider 10 CHORI-219 BACs which we already knew that they are available ('74I8','204L9','197E3','71P23','36I4','35I18','262A22','20I13','206K7','166K18').
    3. Survey each hit blocks. If the hit block is less than 200 bp, discard it. 42 hit blocks from 8 mRNAs are selected.
      • NM_001004837 Unnamed, predicted gene MGC69309 [1][2]
      • NM_001007499 paired-like homeodomain 1 (pitx-1) [3][4]
      • NM_001011405 Homeobox A5 (hoxa5) [5][6]
      • NM_001035121
      • NM_001113032
      • NM_001127429
      • NM_001129937
      • NM_001142220



  • Run MUSCLE (version 4.0, with default option) for multiple sequence alignment. Interestingly, sequence from CHORI-216 is somewhat different, compared to both XENTR_mRNA and CHORI-219 fragment.
 $ mus4 -i XENTR_CHORI.fasta -o XENTR_CHORI.muscle 
XENLA_CH219-20I1   1 + ttattt----------------------gtgccctggatacccctggaactatagcagggtgac 42
XENTR_NM_0011422   1 + ttattt----------------------gtgccctgggtacccctggaactatagcggggtgac 42
XENTR_CH216-2E23   1 + tcaccccaaatccccccctaactggccttcaggctgggcccccttag-ctcataacaaggttac 63
                       *.*...                      .....****...***.*.**...***.*..***.**

XENLA_CH219-20I1  43 + tgttaccccaatgtttctatatatctgtaaccttgttattagct-aagggggcccagtctgaag 105
XENTR_NM_0011422  43 + tgttaccccaatgtttctatatatctgtaaccttgttatgggct-aagggggcccagcctgaag 105
XENTR_CH216-2E23  64 + agatatatagaaacattggggtaacagtcaccccgctatagttccaggggtacccagggc---- 123
                       .*.**.....*....*.....**.*.**.***..*.*** .... *.***..***** ..****

XENLA_CH219-20I1 106 + gtcagttagggggagatttggggtgagggcttatttg-----taccctgggtacccctggaact 164
XENTR_NM_0011422 106 + gccagttagggggggatttggggtgagtgcttatttg-----tgccctgggtacccctggaact 164
XENTR_CH216-2E23 124 + -acaaataagcactcaccccaaatcatcccctaactggccttcaggctgggcccc-cttagccc 185
                       * **..**.*... .*.......*.*. .*.**..**     ....*****..*****....*.

XENLA_CH219-20I1 165 + atagcagggtgactgttaccccaatgtttctatatatctgtaaccttgttatgagctaa-gggg 227
XENTR_NM_0011422 165 + atagcagggtgactgttaccccaatgtttctatatatctgtaaccttgttatgggctaa-gggg 227
XENTR_CH216-2E23 186 + ataacaaggttacagatatatagaaacattggggtaacagtcaccccgctatagttccaggggt 249
                       ***.**.***.**.*.**.....*....*.....**.*.**.***..*.***......* ***.

XENLA_CH219-20I1 228 + gcccagtctgaaggccagttagggggagatatggggtgagtgtttatttgtgccctggttaccc 291
XENTR_NM_0011422 228 + gcccagcctgaaggccagttagggggggatttggggtgagtgcttatttgtgccctgggtaccc 291
XENTR_CH216-2E23 250 + acccagggca---------------caaataagcact----------------------caccc 276
                       .***** ...***************...**..*...****** *************** .****

XENLA_CH219-20I1 292 + ctggaactatagcagggtgac 312(341)
XENTR_NM_0011422 292 + ctggaactatagcagggtgac 312(341)
XENTR_CH216-2E23 277 + c---------------aaatc 282(341)
                       ****************....*
  • Run MUSCLE again, only with XENTR_mRNA and CHORI-219 sequence.
XENLA_CH219-20I1   1 + ttatttgtgccctggatacccctggaactatagcagggtgactgttaccccaatgtttctatat 64
XENTR_NM_0011422   1 + ttatttgtgccctgggtacccctggaactatagcggggtgactgttaccccaatgtttctatat 64
                       *************** ****************** *****************************

XENLA_CH219-20I1  65 + atctgtaaccttgttattagctaagggggcccagtctgaaggtcagttagggggagatttgggg 128
XENTR_NM_0011422  65 + atctgtaaccttgttatgggctaagggggcccagcctgaaggccagttagggggggatttgggg 128
                       *****************  *************** ******* *********** *********

XENLA_CH219-20I1 129 + tgagggcttatttgtaccctgggtacccctggaactatagcagggtgactgttaccccaatgtt 192
XENTR_NM_0011422 129 + tgagtgcttatttgtgccctgggtacccctggaactatagcagggtgactgttaccccaatgtt 192
                       **** ********** ************************************************

XENLA_CH219-20I1 193 + tctatatatctgtaaccttgttatgagctaagggggcccagtctgaaggccagttagggggaga 256
XENTR_NM_0011422 193 + tctatatatctgtaaccttgttatgggctaagggggcccagcctgaaggccagttaggggggga 256
                       ************************* *************** ******************* **

XENLA_CH219-20I1 257 + tatggggtgagtgtttatttgtgccctggttacccctggaactatagcagggtgac 312(1)
XENTR_NM_0011422 257 + tttggggtgagtgcttatttgtgccctgggtacccctggaactatagcagggtgac 312(1)
                       * *********** *************** **************************
  • However, it turns out that they are highly repetitive (~ 135 bp unit). See the 1st, 3rd and 5th line (or the 2nd and 4th line) in each sequences.