Data explained: gene descriptions

WormBase writes and displays short summaries about genes, in the ‘Overview’ widget on the very top of gene pages.  When we realized we couldn’t keep up with both updating and writing new gene descriptions, we developed an automated gene descriptions data pipeline that looks at primary data from the most recent WormBase release, in order to write gene descriptions for the next WormBase release (eg., the gene descriptions for the WS262 release of WormBase are based on the WS261 WormBase data release). The data we currently include in a gene description are – orthology to human for C. elegans genes and orthology to C. elegans for non-elegans species genes (such as C. briggsae), biological process, molecular function and cellular localization (based on Gene Ontology (GO) annotations), and tissue expression data.  For poorly studied genes with no functional data, we include expression and regulation data summaries from large scale experiments such as microarray, tiling array and RNA sequencing. For every new release, scripts add new data that has been curated between the releases, in the above categories, to the gene descriptions.  We currently have over 140,000 gene descriptions for nine species. The descriptions for the non-elegans genes such as C. briggsae, C. japonica, etc. can be found in the ‘Overview’ widget on their respective gene pages.  In addition, we also make available a file with all the gene descriptions for a given species by release, on our FTP site for download, for eg., for C. elegans, the ‘c_elegans.PRJNA13758.WS262.functional_descriptions.txt’ file is available here.  Files for other species can be found by going down a similar directory structure in the WS262 release directory.

Specify genome locations with 30 nucleotides of flanking sequence!

Many of us have had the experience of trying to reconstruct what someone has done and been frustrated trying to find the exact sequence. Relative coordinates do not last: gene models often change so that “Leu234” in a protein is no longer there and our knowledge of genome sequence changes (or we are working with a different strains) so the EcoR1 site 5’ to your favorite gene is not there.  There is an easy solution: always specify a location by sequence. Thirty nucleotides is sufficient in essentially all cases to uniquely locate the site. Your simple effort in specifying a genome location by sequence, when you are writing a paper will make experiments easily reproducible, as well as help WormBase in curating such studies.

Call to authors: provide complete information about genetic entities and reagents

WormBase requests that authors provide complete information about genetic entities such as strains, alleles, transgenes, etc., in published papers. Providing a clear list of the experimental genetic entities used, in the paper, along with complete information about them would make curating your paper easier and quicker, saving time and effort that curators spend in searching for such information and/or writing to authors. For example, if you use strains, please provide the complete list of strain names and their genotypes, including transgenes, markers and any additional components. Papers with incomplete information about genetic entities and reagents can only be partially curated or not curated at all, making valuable data about worm models of biology and disease unavailable to both the worm and biomedical research communities.