Surviving a RADOS outage

Hi ~okeanos users,

Recently ~okeanos service faced some disruptions, as we informed you on a previous blog post. Let's shed some light on what caused them.

The ~okeanos storage backend (Archipelago) is backed by Ceph, an open-source distributed software-defined storage solution. On Friday, September 9, at 9 p.m., Object Storage Daemons (OSDs) of the RADOS cluster became unstable resulting in the cluster erroneously marking OSDs as down. This incident led to I/O freeze and cluster malfunctioning. In order to ensure data integrity, we had to resort to special handling and tooling and managed to make the cluster fully functional again on Tuesday, September 13.

If you are interested in more technical details, you may find an analysis of the incident, described by our NOC team, in this blog post.

We would like to thank you for your support on handling this unexpected outage. The lessons learned from this incident will be used to better tune and monitor the service, which will help towards our commitment for high levels of service availability.

the ~okeanos team