WormBase writes and displays short summaries about genes, in the ‘Overview’ widget on the very top of gene pages. When we realized we couldn’t keep up with both updating and writing new gene descriptions, we developed an automated gene descriptions data pipeline that looks at primary data from the most recent WormBase release, in order to write gene descriptions for the next WormBase release (eg., the gene descriptions for the WS262 release of WormBase are based on the WS261 WormBase data release). The data we currently include in a gene description are – orthology to human for C. elegans genes and orthology to C. elegans for non-elegans species genes (such as C. briggsae), biological process, molecular function and cellular localization (based on Gene Ontology (GO) annotations), and tissue expression data. For poorly studied genes with no functional data, we include expression and regulation data summaries from large scale experiments such as microarray, tiling array and RNA sequencing. For every new release, scripts add new data that has been curated between the releases, in the above categories, to the gene descriptions. We currently have over 140,000 gene descriptions for nine species. The descriptions for the non-elegans genes such as C. briggsae, C. japonica, etc. can be found in the ‘Overview’ widget on their respective gene pages. In addition, we also make available a file with all the gene descriptions for a given species by release, on our FTP site for download, for eg., for C. elegans, the ‘c_elegans.PRJNA13758.WS262.functional_descriptions.txt’ file is available here. Files for other species can be found by going down a similar directory structure in the WS262 release directory.
If you are looking for interaction data for C. elegans look no further than our FTP site. At every release WormBase deposits data files on the FTP site under the relevant release number and species directories. The files are named after the data they contain. The interactions file for C. elegans for the WS261 release can be found here and is called c_elegans.PRJNA13758.WS261.interactions.txt.gz
WormBase now provides the canonical gene set for each species in Gene Transfer Format (GTF, http://mblab.wustl.edu/GTF22.html). These files can be used directly by a number of popular sequence analyses tools (e.g. Cufflinks).
The GTF files are available from the WormBase FTP site, for example, the GTF file for C. elegans, c_elegans.PRJNA13758.WS253.canonical_geneset.gtf.gz, is available here.
We would like to remind users that our FTP site provides access to various data files. WormBase has recently improved the organization of it’s FTP site so that users can easily browse and find different data files. We have made nearly every file directly accessible without needing to know what the current version of WormBase is. For example, the following link will always point to the most current release of WormBase:
WormBase maintains a public FTP site where you can find many commonly requested files and datasets, the WormBase software and prepackaged databases. DNA sequence data for the genomes of C. elegans, C. briggsae, C. remanei, etc., are available in FASTA format, as is protein data. Microarray data like the up-to-date mapping of microarray probes to WormBase genes for Affymetrix, Agilent, Washington University Genome Sequencing Center and Stanford Microarray Database (SMD) chips, is also made available. For C. elegans, the following files are down-loadable from the FTP site: confirmed_genes — which lists curated C. elegans genes that have been confirmed by transcriptional data; wormpep — FASTA-format files containing predicted and confirmed protein translations, and many other files.
Take a look at our FTP site at ftp://ftp.wormbase.org/pub/wormbase/. Be sure to look at the README file in each directory for a listing of the contents of that directory.