Difference between revisions of "Xenopus reference"

From Marcotte Lab
Jump to: navigation, search
Line 1: Line 1:
All of these files are derived from XenBase (downloaded on May, 01, 2011).  
+
= Version 3. With Gilchrist EST set =
 +
Courtesy of [http://www.nimr.mrc.ac.uk/research/mike-gilchrist/|Mike Gilchrist] at National Institute for Medical Research, UK. You can directly access his assembled EST data (both ''X. laevis'' and ''X. tropicalis'') at http://genomics.nimr.mrc.ac.uk/online/xenopus/
  
== Version 2. Non-redundant cDNA ==
+
== Data processing method ==
# Cluster ref.v1 sequences with [http://www.drive5.com/usearch/ usearch](version 4.2.66) with %identity>80%, to remove redundancy. As a result, we got 8,164 non-redundant cDNA sequences. ([[xdata:/release/XENLA_cDNA_ref.v1.uc080_fasta|XENLA_cDNA_ref.v1.uc080_fasta]])
+
# From xlaevisMRNA.fasta file, collect all sequences that were not included in ref.v1. Mainly they are (1) annotated as RefSeq, but there is no designated gene name, or (2) not annotated as RefSeq (GenBank submitted sequences).
+
# Use non-redundant ref.v1 sequences (step 1) as database, and run 'usearch' with DB search + clustering mode. So non-RefSeq sequences in step 2 would be (1) clustered with ref.v1 sequence if their %id is greater than 80%, or (2) clustered as independent cluster if there is no sequences available in ref.v1 with %id>80%.
+
  
[[xdata:/release/XENLA_cDNA_ref.v2.fasta|XENLA_cDNA_ref.v2.fasta]] (9,943 sequences)
 
  
= Version 1. RefSeq of cDNA & protein =
+
= Version 2. With non-RefSeq cDNA =
 +
 
 +
== Data processing method ==
 +
# Cluster v1_all cDNA sequences with [http://www.drive5.com/usearch/ usearch](version 4.2.66) with %identity>80%, to remove redundancy (See [[TXGP_ens63_reference]] for more information about this %id cutoff). As a result, we got 8,164 non-redundant cDNA sequences (called 'v1_nr080'). Although we don't use it for further analysis, we also made non-redundant cDNA sequences with %identity>90%, more stringent criteria ('v1_nr090').
 +
# From xlaevisMRNA.fasta file to be used in version 1, collect all sequences that were not included in v1_all, because (1)they have no designated gene name, or (2) they are not annotated as RefSeq (GenBank submitted sequences). It contains 16,220 sequences (called 'v2_new').
 +
# Combine it to 'v1_all' sequences, to make 'v2_all' sequences (25,099 sequences).
 +
# Sort v2_all sequences with length.
 +
# Use non-redundant v1_nr080 sequences as database, and run 'usearch' with DB search + clustering mode. As a result, all 'v2_all' sequences would be either (1) clustered with v1_nr080 sequence if their %id is greater than 80%, or (2) formed as an independent cluster if there is no sequences available in v1_uc080 with %id>80%.
 +
# Combining all sequences in v1_uc080 (although some of them do not have hits in search, they are all included in v2 sequences) and seed sequences of each clusters (longest sequences in each clusters), make v2 sequences ('v2_nr080' for %id>0.80 and 'v2_nr090' for %id>0.90).
 +
 
 +
== Files ==
 +
* v2_nr080 cDNA: [[xdata:/release/XENLA_cDNA.v2_nr080.fasta|XENLA_cDNA.v2_nr080.fasta]] (9,941 non-redundant cDNA sequences)
 +
 
 +
== Supplement ==
 +
* v1_nr080 cDNA: [[xdata:/release/XENLA_cDNA.v1_nr080.fasta|XENLA_cDNA.v1_nr080.fasta]] (8,164 non-redundant cDNA sequences from v1.)
 +
* v1_nr090 cDNA: [[xdata:/release/XENLA_cDNA.v1_nr090.fasta|XENLA_cDNA.v1_nr090.fasta]] (8,573 non-redundant cDNA sequences from v1. Not used in further analysis.)
 +
* v2_new cDNA: [[xdata:/release/XENLA_cDNA.v2_new.fasta|XENLA_cDNA.v2_new.fasta]] (16,220 sequences newly added in v2.)
 +
* v2_all cDNA: [[xdata:/release/XENLA_cDNA.v2_all.fasta|XENLA_cDNA.v2_all.fasta]] (25,099 sequences. Combination of 'v2_new' and 'v1_all')
 +
* v2-on-v1 usearch output (%id>80%): [[xdata:/release/XENLA_cDNA.v2-on-v1.uc080|XENLA_cDNA.v2-on-v1.uc080]]
 +
* v2-on-v1 usearch output (%id>90%): [[xdata:/release/XENLA_cDNA.v2-on-v1.uc080|XENLA_cDNA.v2-on-v1.uc090]]
 +
* v2_nr090 cDNA: [[xdata:/release/XENLA_cDNA.v2_nr090.fasta|XENLA_cDNA.v2_nr090.fasta]] (10,520 non-redundant cDNA sequences)
 +
 
 +
= Version 1. RefSeq =
 
== Data processing method ==
 
== Data processing method ==
 
# Download 'NcbiMrnaXenbaseGene_laevis.txt' and 'xlaevisMRNA.fasta' from XenBase (downloaded on May, 01, 2011).
 
# Download 'NcbiMrnaXenbaseGene_laevis.txt' and 'xlaevisMRNA.fasta' from XenBase (downloaded on May, 01, 2011).

Revision as of 11:43, 12 October 2011

Contents

Version 3. With Gilchrist EST set

Courtesy of Gilchrist at National Institute for Medical Research, UK. You can directly access his assembled EST data (both X. laevis and X. tropicalis) at http://genomics.nimr.mrc.ac.uk/online/xenopus/

Data processing method

Version 2. With non-RefSeq cDNA

Data processing method

  1. Cluster v1_all cDNA sequences with usearch(version 4.2.66) with %identity>80%, to remove redundancy (See TXGP_ens63_reference for more information about this %id cutoff). As a result, we got 8,164 non-redundant cDNA sequences (called 'v1_nr080'). Although we don't use it for further analysis, we also made non-redundant cDNA sequences with %identity>90%, more stringent criteria ('v1_nr090').
  2. From xlaevisMRNA.fasta file to be used in version 1, collect all sequences that were not included in v1_all, because (1)they have no designated gene name, or (2) they are not annotated as RefSeq (GenBank submitted sequences). It contains 16,220 sequences (called 'v2_new').
  3. Combine it to 'v1_all' sequences, to make 'v2_all' sequences (25,099 sequences).
  4. Sort v2_all sequences with length.
  5. Use non-redundant v1_nr080 sequences as database, and run 'usearch' with DB search + clustering mode. As a result, all 'v2_all' sequences would be either (1) clustered with v1_nr080 sequence if their %id is greater than 80%, or (2) formed as an independent cluster if there is no sequences available in v1_uc080 with %id>80%.
  6. Combining all sequences in v1_uc080 (although some of them do not have hits in search, they are all included in v2 sequences) and seed sequences of each clusters (longest sequences in each clusters), make v2 sequences ('v2_nr080' for %id>0.80 and 'v2_nr090' for %id>0.90).

Files

Supplement

Version 1. RefSeq

Data processing method

  1. Download 'NcbiMrnaXenbaseGene_laevis.txt' and 'xlaevisMRNA.fasta' from XenBase (downloaded on May, 01, 2011).
  2. Read gene name for each NCBI id from 'Ncbi...' file. Filter out genes with 'unnamed' in gene name field.
  3. Read all sequences from '.fasta' file. Convert all sequence character to upper case.
  4. If a sequence has '>gi|<gi number>|ref|<genbank accession>' header (means it is RefSeq entity), report it.

Files

  • cDNA: XENLA_cDNA.v1_all.fasta (8,879 sequences; previously called 'XENLA_cDNA_ref.v1.fasta')
  • proteins: XENLA_prot.v1_all.fasta (8,878 sequences; 'taf5' is not annotated as RefSeq in protein, although its corresponding mRNA sequence is annotated as RefSeq.)