WormBase updates in WS277 – new protein schematic images!

We have released the 277th version of WormBase! As always, for a detailed report please look at the WS277 release notes.

New features: We’ve updated the protein schematic image in the homology widget on protein pages (see, for example, the UNC-2, isoform a page.) This image displays protein domains and exons mapped to amino acid coordinates, making it easy to see which regions of a transcript correspond to specific features. Driven by JBrowse — the genome browser at WormBase — users can click through the image to an interactive view in amino acid coordinates. From that familiar interface, one can scroll, export sequence (can they?), get additional information on specific features, and zoom in to single amino acid residue resolution, color-coded by chemical property.  The previous static protein schematic image remains available by clicking on the “Legacy Protein Schematics” link, but will be removed after the WS278 release.

New data sets:

C. elegans: Additional Nanopore transcript data has been added.

C. elegans VC20210: The VC2010 strain data includes gene annotation which has been manually curated to improve the coding gene structures lifted over from the N2 strain annotation. This process is substantially complete, with some work still to be done on chromosome X. The number of coding genes which do not map correctly has been reduced so far from around 400 down to 50 genes which cannot yet be located and 10 which appear to be pseudogenes in VC2010. There are 20 genes which appear to be duplicated in N2 and have disappeared in VC2010. There have been 39 novel coding genes created.

New gene descriptions for the WS268 release of WormBase

The WS268 release of WormBase features new automated gene descriptions (displayed in the ‘Overview’ widget at the top of the gene page). In addition to the code being entirely rewritten, we have also added gene descriptions for a new worm species – Trichuris muris.   These automated gene descriptions are highly structured and are based on curated data such as orthology data, Gene Ontology (GO) annotations, Disease Ontology (DO) annotations, gene expression data, etc., in WormBase.

The following data-types are included in the description of a given gene, when available: orthology to human genes, molecular function, biological processes, sub-cellular localization and tissue expression. A new addition to the gene descriptions is human disease relevance data in cases where the gene is used experimentally to study a disease or if it’s human orthologs are implicated in a disease.

Data content changes: For lesser known, data-poor genes, we’ve included:
1. protein domain data
2. Orthologous human gene function data drawn from the Alliance of Genome Resources.
3. Perturbation by other genes and/or chemicals and tissue enrichment data based on large scale data such as microarray, tiling arrays and RNA seq data.

Data display changes: The ‘Overview’ widget now displays only the automated gene description, however our legacy manual gene descriptions (when they do exist for genes) are collapsed but available for viewing.

Note that while the automated gene descriptions are not generated directly from the literature, most of the annotations on which they are based are manually curated from the primary literature by WormBase curators.

WS264 release

C. elegans sORFs

sORFs.org is a public repository of small open reading frames (sORFs) identified by ribosome profiling (RIBO-seq).

It contains predicted sORF regions for several species, including C. elegans.

We have annotated 118 predicted sORF regions as coding (CDS) isoforms of the existing genes. It is likely that in the next release, where these isoforms do not overlap with existing isoforms, these sORF regions will be changed to be individual genes and not isoforms.

52 of these annotated sORF regions do not start with the canonical Methionine AUG initiation codon. It is possible that they use a non-canonical initiation codon. Some of these non-canonical initiation codons are not the expected non-canonical initiation codon Isoleucine, but code for residues like Valine.

Trichuris muris

This release we see the integrated of the Edinburgh strain of Trichuris muris version TMUE3.1. This species has been fully integrated as a core species meaning there are stable IDs and tracking with inclusion in all additional pipelines and analysis.
The genome assembly and gene annotation has been taken directly from the Pathogen Genomics group at the WTSI. Additional mapping of gene mergers, splits and transfer of IDs from the TMUE2.2 has been done to allow users to identify their genes of interest.

Caenorhabditis nigoni

This release includes the Caenorhabditis nigoni genome assembly and gene set described in “Rapid genome shrinkage in a self-fertile nematode reveals sperm competition proteins” by Da Yin, et. al (Science 359,55-61 2018) as non-core species set.
This species should be of special interest, due to its phylogenetic closeness to C.briggsae and its differences in sexual reproduction.

The data is available as files on the WormBase FTP site, as well as the JBrowse genome browser.

Gene Transfer Format (GTF) files now available

WormBase now provides the canonical gene set for each species in Gene Transfer Format (GTF, http://mblab.wustl.edu/GTF22.html). These files can be used directly by a number of popular sequence analyses tools (e.g. Cufflinks).

The GTF files are available from the WormBase FTP site, for example, the GTF file for C. elegans, c_elegans.PRJNA13758.WS253.canonical_geneset.gtf.gz, is available here.