Submission Deadline for WS238

With the International Worm Meeting 2013 coming up soon, I would like to remind you that the submission for the WS238 release of WormBase will be at the end of April. If you have a large or complex dataset, please tell us well ahead of time, to allow for processing of the data.

While WS238 will most likely not be live during the meetings, it will be, soon after.

Legacy allele information in WormBase

As of WS236, published and unpublished allele phenotypes from the books,  C. elegans I and II, will be available through gene and allele pages displayed in the phenotype widget. This focused effort added 4,819 phenotype annotations to 1,431 genes. Until now, these data had been hidden in legacy information from C. elegans I and II,  and were available in WormBase as a block of text under the concise description in the overview widget of the gene page.  As these descriptions constitute valuable gene function information, we are pleased to have these data available for query and computational analysis.

New gene expression data

Interested in seeing the expression profile of your favorite gene during development? When you are on the gene page of interest, select the expression widget to open it, and click on the object whose description begins with ‘Developmental gene expression time-course’.  On the graph page, clicking on the graphs will result in larger images.  We have recently incorporated graphs for over 19,000 genes from the study published in Developmental Cell (Levin et al., 2012).  Check out tbx-34 as an example.

C. elegans Genome Reference Sequence Changes

WormBase has made several changes to the genomic reference sequence of C. elegans N2 for WormBase release WS235.

The changes were based on sequence error sites identified by the papers [1,2,3] which compared the reference sequence to short read sequences from N2 laboratory lines and various outgroup strains, each trying to resolve the differences between the lines to produce a set of differences between the reference genome sequence and a majority of the lines investigated.

There are three principle reasons why differences should be found between the reference N2 genome sequence and newly-sequenced N2 laboratory lines:

  • The original sequencing project [4] estimated that the error rate of almost all the sequence is < 10e−4. This would give a potential 10,000 errors in the genome. This appears to be an over-estimate of the actual error rate.
  • The reference genome was originally constructed from a mosaic of clones, YACs and fosmids derived from different laboratory lines of N2. There may have been variation between the genomes of these lines when they were used in the original sequencing project.
  • The laboratory lines of N2 used in the papers will have diverged from each other since the ancestral Bristol isolate was taken from a mushroom compost heap in 1947 by L.N. Staniland [5]. The papers identify those potential error sites where a majority of the lines investigated differ from the reference sequence.

Merging the potential error sites from the three papers resulted in 2225 unique sites, comprised of 404 Deletions, 900 Insertions and 920 Substitutions. Only 208 of the 920 Substitutions appear in all three papers and only 757 of the 1304 indels appear in both papers which looked at indels.

Because the papers indicate that there is substantial variation between the laboratory lines of N2 investigated and because there was disagreement between the three papers on which potential error sites were most likely to be correct, a multi-stage process was agreed to find error sites that are most likely to reflect the genomes of N2 lines commonly held in laboratories and which would most affect the structure and composition of coding genes.

The sites identified by any of the three papers were inspected using Ensembl Variant Effect Predictor software [6] to find those that would have a potential impact on the translation or splicing of curated CDS models.  Each of these sites was manually inspected by two curators and a judgement was made whether to accept or reject the proposed change, based on (a) all available alignment evidence (RNASeq reads from all available libraries, ESTs and proteins); (b) the impact on the gene model; and (c) the degree of consensus among the 3 projects.  Additionally, all indel sites not overlapping genes were manually inspected to find those that would affect the structure of nearby genes by allowing (for example) novel exons to be added before or after the existing structure.

For substitutions with no putative consequence according to the Variant Effect Predictor, an automatic strategy was used to determine whether to accept or reject each site. Specifically, a proposed change was rejected if either (a) it was only proposed by one of the three groups, or (b) it was covered by at least 100 RNASeq reads and more than 10% of reads supported the reference base call.

This resulted in 558 Insertions, 230 Deletions, and 614 Substitutions.

There were 824 clones changed plus a further 19 clones which were changed because of a change in an corresponding position of an overlapping clone region.

There were 87 improvements to gene structures that were enabled by the sequence corrections, most of these were changes to correct a poor structure near a frameshift but they also include include the following notable ones:

  • C02H6.3 – New gene.
  • C50C3.19 – New gene.
  • T28B8.4 – converted from a Pseudogene to a CDS.
  • K02A2.7 – converted from a Pseudogene to a CDS.
  • K03A1.1 – converted from a Pseudogene to a CDS.
  • ZK418.6 – split to make ZK418.6 and ZK418.13
  • Y46D2A.2 and Y46D2A.5 – Merged.
  • ZK616.8 and ZK616.7 – Merged.
  • C10A4.2 – Five new exons have been added.
  • R13A1.10 – One new exon has been added.
  • F22B7.2 – Now matches the START codon given in the mRNA AY941160.
  • F59E11.8b – Converted this isoform into a non-coding isoform.

References

1. McGrath PT, Xu Y, Ailion M, Garrison JL, Butcher RA, Bargmann CI.
Nature. 2011 Aug 17;477(7364):321-5. doi: 10.1038/nature10378.
“Parallel evolution of domesticated Caenorhabditis species targets pheromone receptor genes.”

2. Weber KP, De S, Kozarewa I, Turner DJ, Babu MM, de Bono M.
PLoS One. 2010 Nov 11;5(11):e13922.
“Whole genome sequencing highlights genetic changes associated with laboratory domestication of C. elegans.”

3. Doitsidou M, Poole RJ, Sarin S, Bigelow H, Hobert O.
PLoS One. 2010 Nov 8;5(11):e15435.
“C. elegans mutant identification with a one-step whole-genome-sequencing and SNP mapping strategy.”

4. C. elegans Sequencing Consortium.
Science. 1998 Dec 11;282(5396):2012-8.
“Genome sequence of the nematode C. elegans: a platform for investigating biology.”

5. Nicholas W L, Dougherty E C, Hansen E L.
Ann. N.Y. Acad. Sci. 1959;77:218–236.
“Axenic cultivation of C. briggsae (Nematoda: Rhabditidae) with chemically undefined supplements; comparative studies with related nematodes.”

6. McLaren W, Pritchard B, Rios D, Chen Y, Flicek P, Cunningham F.
Bioinformatics. 2010 Aug 15;26(16):2069-70.
“Deriving the consequences of genomic variants with the Ensembl API and SNP Effect Predictor.”

An interactive virtual worm

Would you like to browse a virtual 3D interactive atlas of the worm?  Check out the ‘virtual worm’ at the Virtual Worm Project, a computer generated 3D model of the adult hermaphrodite of Caenorhabditis elegans at cellular resolution. Browse the model interactively in the 3D graphics program, Blender, or through any WebGL-enabled web browser, via the Open Worm Browser. See this page for a list of supporting browsers.  The Virtual Worm site includes several ~5 minute video tutorials explaining how to use and view the model in Blender.

Currently, WormBase is using this model to generate expression pattern images based on curated expression pattern objects in the database.  The Virtual Worm image can be viewed by opening the expression widget on the WormBase gene page. If the gene has post-embryonic spatial expression information, a Virtual Worm image is depicted of the cells and tissues in which it is expressed.  In the future we hope to embed the browsable Virtual Worm model directly into the WormBase website, with added anatomy meta-data, as well as hyperlinks to relevant anatomy pages. We also plan on developing 3D models of embryonic and larval stages as well as the adult male. For any questions, comments, or suggestions, please contact Chris Grove at [email protected].