C. elegans Genome Reference Sequence Changes

WormBase has made several changes to the genomic reference sequence of C. elegans N2 for WormBase release WS235.

The changes were based on sequence error sites identified by the papers [1,2,3] which compared the reference sequence to short read sequences from N2 laboratory lines and various outgroup strains, each trying to resolve the differences between the lines to produce a set of differences between the reference genome sequence and a majority of the lines investigated.

There are three principle reasons why differences should be found between the reference N2 genome sequence and newly-sequenced N2 laboratory lines:

  • The original sequencing project [4] estimated that the error rate of almost all the sequence is < 10e−4. This would give a potential 10,000 errors in the genome. This appears to be an over-estimate of the actual error rate.
  • The reference genome was originally constructed from a mosaic of clones, YACs and fosmids derived from different laboratory lines of N2. There may have been variation between the genomes of these lines when they were used in the original sequencing project.
  • The laboratory lines of N2 used in the papers will have diverged from each other since the ancestral Bristol isolate was taken from a mushroom compost heap in 1947 by L.N. Staniland [5]. The papers identify those potential error sites where a majority of the lines investigated differ from the reference sequence.

Merging the potential error sites from the three papers resulted in 2225 unique sites, comprised of 404 Deletions, 900 Insertions and 920 Substitutions. Only 208 of the 920 Substitutions appear in all three papers and only 757 of the 1304 indels appear in both papers which looked at indels.

Because the papers indicate that there is substantial variation between the laboratory lines of N2 investigated and because there was disagreement between the three papers on which potential error sites were most likely to be correct, a multi-stage process was agreed to find error sites that are most likely to reflect the genomes of N2 lines commonly held in laboratories and which would most affect the structure and composition of coding genes.

The sites identified by any of the three papers were inspected using Ensembl Variant Effect Predictor software [6] to find those that would have a potential impact on the translation or splicing of curated CDS models.  Each of these sites was manually inspected by two curators and a judgement was made whether to accept or reject the proposed change, based on (a) all available alignment evidence (RNASeq reads from all available libraries, ESTs and proteins); (b) the impact on the gene model; and (c) the degree of consensus among the 3 projects.  Additionally, all indel sites not overlapping genes were manually inspected to find those that would affect the structure of nearby genes by allowing (for example) novel exons to be added before or after the existing structure.

For substitutions with no putative consequence according to the Variant Effect Predictor, an automatic strategy was used to determine whether to accept or reject each site. Specifically, a proposed change was rejected if either (a) it was only proposed by one of the three groups, or (b) it was covered by at least 100 RNASeq reads and more than 10% of reads supported the reference base call.

This resulted in 558 Insertions, 230 Deletions, and 614 Substitutions.

There were 824 clones changed plus a further 19 clones which were changed because of a change in an corresponding position of an overlapping clone region.

There were 87 improvements to gene structures that were enabled by the sequence corrections, most of these were changes to correct a poor structure near a frameshift but they also include include the following notable ones:

  • C02H6.3 – New gene.
  • C50C3.19 – New gene.
  • T28B8.4 – converted from a Pseudogene to a CDS.
  • K02A2.7 – converted from a Pseudogene to a CDS.
  • K03A1.1 – converted from a Pseudogene to a CDS.
  • ZK418.6 – split to make ZK418.6 and ZK418.13
  • Y46D2A.2 and Y46D2A.5 – Merged.
  • ZK616.8 and ZK616.7 – Merged.
  • C10A4.2 – Five new exons have been added.
  • R13A1.10 – One new exon has been added.
  • F22B7.2 – Now matches the START codon given in the mRNA AY941160.
  • F59E11.8b – Converted this isoform into a non-coding isoform.

References

1. McGrath PT, Xu Y, Ailion M, Garrison JL, Butcher RA, Bargmann CI.
Nature. 2011 Aug 17;477(7364):321-5. doi: 10.1038/nature10378.
“Parallel evolution of domesticated Caenorhabditis species targets pheromone receptor genes.”

2. Weber KP, De S, Kozarewa I, Turner DJ, Babu MM, de Bono M.
PLoS One. 2010 Nov 11;5(11):e13922.
“Whole genome sequencing highlights genetic changes associated with laboratory domestication of C. elegans.”

3. Doitsidou M, Poole RJ, Sarin S, Bigelow H, Hobert O.
PLoS One. 2010 Nov 8;5(11):e15435.
“C. elegans mutant identification with a one-step whole-genome-sequencing and SNP mapping strategy.”

4. C. elegans Sequencing Consortium.
Science. 1998 Dec 11;282(5396):2012-8.
“Genome sequence of the nematode C. elegans: a platform for investigating biology.”

5. Nicholas W L, Dougherty E C, Hansen E L.
Ann. N.Y. Acad. Sci. 1959;77:218–236.
“Axenic cultivation of C. briggsae (Nematoda: Rhabditidae) with chemically undefined supplements; comparative studies with related nematodes.”

6. McLaren W, Pritchard B, Rios D, Chen Y, Flicek P, Cunningham F.
Bioinformatics. 2010 Aug 15;26(16):2069-70.
“Deriving the consequences of genomic variants with the Ensembl API and SNP Effect Predictor.”

Leave a Reply

Your email address will not be published. Required fields are marked *