Difference between revisions of "Xenopus reference"

From Marcotte Lab
Jump to: navigation, search
m (moved TXGP Xenbase Data to TXGP reference: Better name after having Gilchrist's EST data)
Line 8: Line 8:
 
[[xdata:/release/XENLA_cDNA_ref.v2.fasta|XENLA_cDNA_ref.v2.fasta]] (9,943 sequences)
 
[[xdata:/release/XENLA_cDNA_ref.v2.fasta|XENLA_cDNA_ref.v2.fasta]] (9,943 sequences)
  
== Version 1. RefSeq of cDNA & protein ==
+
= Version 1. RefSeq of cDNA & protein =
 +
== Data processing method ==
 +
# Download 'NcbiMrnaXenbaseGene_laevis.txt' and 'xlaevisMRNA.fasta' from XenBase (downloaded on May, 01, 2011).
 
# Read gene name for each NCBI id from 'Ncbi...' file. Filter out genes with 'unnamed' in gene name field.
 
# Read gene name for each NCBI id from 'Ncbi...' file. Filter out genes with 'unnamed' in gene name field.
 
# Read all sequences from '.fasta' file. Convert all sequence character to upper case.
 
# Read all sequences from '.fasta' file. Convert all sequence character to upper case.
# If I find a sequence with '>gi|<gi number>|ref|<genbank accession>' header (means it is RefSeq entity), write it down.  
+
# If a sequence has '>gi|<gi number>|ref|<genbank accession>' header (means it is RefSeq entity), report it.
 
+
[[xdata:/release/XENLA_cDNA_ref.v1.fasta|XENLA_cDNA_ref.v1.fasta]] (8,879 sequences)
+
* Used XenBase files: NcbiMrnaXenbaseGene_laevis.txt, xlaevisMRNA.fasta
+
 
+
[[xdata:/release/XENLA_prot_ref.v1.fasta|XENLA_prot_ref.v1.fasta]] (8,878 sequences; 'taf5' is not annotated as RefSeq in protein, although its corresponding mRNA sequence is annotated as RefSeq.)
+
* Used XenBase files: NcbiProteinXenbaseGene_laevis.txt, xlaevisProtein.fasta
+
 
+
  
 +
== Files ==
 +
* cDNA: [[xdata:/release/XENLA_cDNA.v1_all.fasta|XENLA_cDNA.v1_all.fasta]] (8,879 sequences; previously called 'XENLA_cDNA_ref.v1.fasta')
 +
* proteins: [[xdata:/release/XENLA_prot.v1_all.fasta|XENLA_prot.v1_all.fasta]] (8,878 sequences; 'taf5' is not annotated as RefSeq in protein, although its corresponding mRNA sequence is annotated as RefSeq.)
  
 
----
 
----
 
[[Category:XenopusGenome]]
 
[[Category:XenopusGenome]]

Revision as of 10:58, 12 October 2011

All of these files are derived from XenBase (downloaded on May, 01, 2011).

Contents

Version 2. Non-redundant cDNA

  1. Cluster ref.v1 sequences with usearch(version 4.2.66) with %identity>80%, to remove redundancy. As a result, we got 8,164 non-redundant cDNA sequences. (XENLA_cDNA_ref.v1.uc080_fasta)
  2. From xlaevisMRNA.fasta file, collect all sequences that were not included in ref.v1. Mainly they are (1) annotated as RefSeq, but there is no designated gene name, or (2) not annotated as RefSeq (GenBank submitted sequences).
  3. Use non-redundant ref.v1 sequences (step 1) as database, and run 'usearch' with DB search + clustering mode. So non-RefSeq sequences in step 2 would be (1) clustered with ref.v1 sequence if their %id is greater than 80%, or (2) clustered as independent cluster if there is no sequences available in ref.v1 with %id>80%.

XENLA_cDNA_ref.v2.fasta (9,943 sequences)

Version 1. RefSeq of cDNA & protein

Data processing method

  1. Download 'NcbiMrnaXenbaseGene_laevis.txt' and 'xlaevisMRNA.fasta' from XenBase (downloaded on May, 01, 2011).
  2. Read gene name for each NCBI id from 'Ncbi...' file. Filter out genes with 'unnamed' in gene name field.
  3. Read all sequences from '.fasta' file. Convert all sequence character to upper case.
  4. If a sequence has '>gi|<gi number>|ref|<genbank accession>' header (means it is RefSeq entity), report it.

Files

  • cDNA: XENLA_cDNA.v1_all.fasta (8,879 sequences; previously called 'XENLA_cDNA_ref.v1.fasta')
  • proteins: XENLA_prot.v1_all.fasta (8,878 sequences; 'taf5' is not annotated as RefSeq in protein, although its corresponding mRNA sequence is annotated as RefSeq.)