Monday 15th April 2019

Cloud platform Short outage Cloud PLatform

One of our Ceph nodes started blocking requests starting at 2019-04-15 16:01:06 CEST

At 2019-04-15 16:09:18 CET I issued a reboot on that node. From that moment on the requests were further processed by our other nodes in the cluster. It took some time until everything was normal again (including Ceph recovery of the rebooted node). Around 16:15:00 that node was back online and recovery of Ceph began.

March the 29th we experienced a very similar issue. After comparing the 2 incidents we found out that this issue is related to our update process. We have further enhanced our update process and will keep monitoring both the current situation as well as upcoming updates.