Difference between revisions of "XENLA Oktoberfest"

From Marcotte Lab
Jump to: navigation, search
(Statistics)
(Result)
 
(9 intermediate revisions by 2 users not shown)
Line 1: Line 1:
This is a page for integrated gene models of ''Xenopus laevis'', released in October, 2012. "Oktoberfest" is a name of dataset (Beer glass logo is from http://www.webdesignhot.com/free-vector-graphics/lifelike-beer-glasses-and-beer-bubbles-vector-graphic/).
+
[[File:Oktoberfest_small.png|Beer glass logo is from http://www.webdesignhot.com/free-vector-graphics/lifelike-beer-glasses-and-beer-bubbles-vector-graphic/]]
 +
This is a page for integrated gene models of ''Xenopus laevis'', released in October, 2012. "Oktoberfest" is a name of dataset.
  
 
= Result =
 
= Result =
* Search: http://daudlin.icmb.utexas.edu/ or http://xenopus.marcottelab.org (not working yet).  
+
* Search: http://daudin.icmb.utexas.edu/ or http://xenopus.marcottelab.org (not working yet).  
* Download sequences: http://daudlin.icmb.utexas.edu/pub/
+
* Download sequences: http://daudin.icmb.utexas.edu/pub/
 
** All: all representative sequences
 
** All: all representative sequences
 
** Longest: longest sequences per genomic hit AND name.
 
** Longest: longest sequences per genomic hit AND name.
Line 15: Line 16:
 
** Total number of 'longest representative' sequences (unique gene name & genomic hit location; some genomic hits have more than one putative gene model): 25,537
 
** Total number of 'longest representative' sequences (unique gene name & genomic hit location; some genomic hits have more than one putative gene model): 25,537
 
** Total number of 'representative' sequences (all): 47,282
 
** Total number of 'representative' sequences (all): 47,282
* <font color='orange'>Associated gene names: 13,249
+
* <font color='magenta'>Associated gene names: 13,249
 
** Names with one gene model: 1,365
 
** Names with one gene model: 1,365
 
** Names with two gene models: 4,740
 
** Names with two gene models: 4,740
Line 23: Line 24:
  
 
= Input data =
 
= Input data =
* JGIv6 scaffold (From Danial Rokhsar & Richard Harland, UC Berkeley)
+
* JGIv6 scaffold (From Daniel Rokhsar & Richard Harland, UC Berkeley)
  
 
* Reference cDNA/EST
 
* Reference cDNA/EST
Line 38: Line 39:
 
# Cluster mapped regions (longest stretches). I call it as a 'genomic hit' afterward.  
 
# Cluster mapped regions (longest stretches). I call it as a 'genomic hit' afterward.  
 
#* Use all these sequences for mapping figure.  
 
#* Use all these sequences for mapping figure.  
 +
#* In total, 41,635 genomic hit candidates are identified from all dataset. But only 67% of them (28,084 hits) have multiple evidences (Distribution of singleton: XenBase=35, XGI=1806, mgEST=334, JGI=8118, J.oTx=1212, WT.oTx=2046).
 
# Select representative sequences per genomic hit. Sometimes multiple genes are clustered together, so I selected (1) the longest transcript and the second longest transcript, and (2) the third longest transcript if it is not covered by first two transcripts AND its length is longer than 20% of longest transcript. As a result, about 2-4 representative cDNA sequences are selected per genomic hit.  
 
# Select representative sequences per genomic hit. Sometimes multiple genes are clustered together, so I selected (1) the longest transcript and the second longest transcript, and (2) the third longest transcript if it is not covered by first two transcripts AND its length is longer than 20% of longest transcript. As a result, about 2-4 representative cDNA sequences are selected per genomic hit.  
 
# Do 6-frame translation for those representative sequences.  
 
# Do 6-frame translation for those representative sequences.  
Line 47: Line 49:
 
# Generate ASCII tree figure from tree2 output.  
 
# Generate ASCII tree figure from tree2 output.  
 
# Calculate distances between nodes on tree. Check the closest model organism proteins per each representative protein sequence, and fetch its name (I changed all letters in gene name to Capital letter, number, and underscore('_')). For Zebrafish, all names with '(n of M)' are converted with '_nOFm_'.  
 
# Calculate distances between nodes on tree. Check the closest model organism proteins per each representative protein sequence, and fetch its name (I changed all letters in gene name to Capital letter, number, and underscore('_')). For Zebrafish, all names with '(n of M)' are converted with '_nOFm_'.  
# Assign this name to representative sequence.  
+
# Assign this name to representative sequence.
  
 
= Known issues =
 
= Known issues =
Line 59: Line 61:
 
** Phylogenetic analysis of duplicated genes (alloalleles) will be added.  
 
** Phylogenetic analysis of duplicated genes (alloalleles) will be added.  
 
** Synteny structure of duplicated genome hits.  
 
** Synteny structure of duplicated genome hits.  
 +
** Exon-intron boundaries for Morpholino design.
 +
** Scaffold coordinate information (i.e. GTF or GFF3 file). --> it will be released soon, before new release.
 
----
 
----
[[Category:Xenopus_Genome_Project]]
+
[[Category:XenopusGenome]]

Latest revision as of 10:43, 15 January 2014

Beer glass logo is from http://www.webdesignhot.com/free-vector-graphics/lifelike-beer-glasses-and-beer-bubbles-vector-graphic/ This is a page for integrated gene models of Xenopus laevis, released in October, 2012. "Oktoberfest" is a name of dataset.

Contents

Result

Statistics

  • Total genomic hits: 28,084
  • Genomic hits without associated protein sequences: 3,626 (24,458 genomic hits in protein level analysis)
  • Genomic hits with model organism reference sequences: 24,372 (86 hits dropped)
  • Genomic hits without gene name: 7,300
  • Genomic hits with gene name: 20,788
    • Total number of 'longest representative' sequences (unique gene name & genomic hit location; some genomic hits have more than one putative gene model): 25,537
    • Total number of 'representative' sequences (all): 47,282
  • Associated gene names: 13,249
    • Names with one gene model: 1,365
    • Names with two gene models: 4,740
    • Names with three gene models: 1,835
    • Names with four gene models: 4,110
    • Names with more than four gene modes: 1,199

Input data

  • JGIv6 scaffold (From Daniel Rokhsar & Richard Harland, UC Berkeley)
  • Reference cDNA/EST
    • From XenBase (GenBank accession)
    • Mike Gilchrist's EST collection (mgEST*)
    • John Quakenbush's EST collection (TC*)
    • JGI's cDNA collection (XeXen*)
  • Assembled transcripts (14 different set, including large-scale J-strain RNA-seq set)

Analysis procedures

  1. Remove JGIv6 scaffolds shorter than 10,000 bp (called JGIv6_lt10k scaffolds afterward).
  2. Map cDNA/EST/assembled transcripts to JGIv6_lt10k scaffolds using BLAT.
  3. Set align ratio(defined as '(align_len-mismatches-gap_bases)/query_len') cutoff that contain less than 1% of 'second best' hits. It was roughly 90% for reference set, and 95% for assembled transcripts.
  4. Cluster mapped regions (longest stretches). I call it as a 'genomic hit' afterward.
    • Use all these sequences for mapping figure.
    • In total, 41,635 genomic hit candidates are identified from all dataset. But only 67% of them (28,084 hits) have multiple evidences (Distribution of singleton: XenBase=35, XGI=1806, mgEST=334, JGI=8118, J.oTx=1212, WT.oTx=2046).
  5. Select representative sequences per genomic hit. Sometimes multiple genes are clustered together, so I selected (1) the longest transcript and the second longest transcript, and (2) the third longest transcript if it is not covered by first two transcripts AND its length is longer than 20% of longest transcript. As a result, about 2-4 representative cDNA sequences are selected per genomic hit.
  6. Do 6-frame translation for those representative sequences.
  7. Run BLASTP with known protein sequences of other species (CHICK, MOUSE, HUMAN, XENTR and DANRE from EnsEMBL 66; XENLA_v5 and XENTR_v5 from XenBase Aug. 2012).
  8. Select top best 3 hits according to bit score (not E-value).
  9. Remove representative sequences if it has multiple frame candidates.
  10. Do multiple sequence alignment of proteins per genomic hit by MUSCLE. Use all top-3 model organism proteins and representative protein sequences translated from representative cDNA sequences.
    • Alignment results (CLW format) and tree2 info (Newick format) are stored.
  11. Generate ASCII tree figure from tree2 output.
  12. Calculate distances between nodes on tree. Check the closest model organism proteins per each representative protein sequence, and fetch its name (I changed all letters in gene name to Capital letter, number, and underscore('_')). For Zebrafish, all names with '(n of M)' are converted with '_nOFm_'.
  13. Assign this name to representative sequence.

Known issues

  • Orientation of the cDNA/EST mapping figure may be incorrect. Translation to protein coding is only checked for representative sequences (2-4 sequences per genomic hit), so some transcripts that still support gene structure may be oriented in opposite direction. It will be fixed in next release.
  • Some genomic hits do not have tree figures, because of newick utilities error. It will be fixed in next release.

Plan

  • Next release is planned near Thanksgiving, 2012. If it is delivered on schedule, it will be called as 'Thanksgiving' . (of course, it can become 'Christmas' or something else.. :-))
    • All known issues will be addressed.
    • Assembled transcripts part (esp. protein translation step) will be revised.
    • Phylogenetic analysis of duplicated genes (alloalleles) will be added.
    • Synteny structure of duplicated genome hits.
    • Exon-intron boundaries for Morpholino design.
    • Scaffold coordinate information (i.e. GTF or GFF3 file). --> it will be released soon, before new release.