Short WormBase downtime yesterday

WormBase went down at 2.54pm ET yesterday. The problem was caused ironically by a fault recovery mechanism that was introduced recently. The fault recovery mechanism takes faulty servers out of service and replaces them with healthy ones. The process involves two components:the detection of faulty servers and the creation of new ones.

Two simultaneous problems in the two components lead to the failure. A false positive in detecting the faulty server caused the sever to be taken out of service, and a missing hard drive snapshot caused the failure to start new servers.

To avoid such future occurrences we are looking into ways to make the fault detection more specific and prevent important disk snapshots from being deleted by automated processes.

Most of the web services were restored around 5 pm ET. Apologies for any inconvenience caused.

Leave a Reply

Your email address will not be published. Required fields are marked *