Degraded performance for some On Demand Small customers. July 2nd
Yesterday Litium experienced a degraded performance issue. It was limited to only one of our webservers and affected some of our OnDemand Small customers.
Litium would like to apologize for the impact this have caused the affected customers. With this postmortem we would like to explain what happened and steps taken to ensure this doesn’t happen again.
The problem was related to the amount of writes and reads to disk at the same time as a backup job was running. Due to the stress on backend storage this job also took much longer to complete then normally. When the job was done, we rebooted the server to free resources up. This, together with the fact that the backup job had been completed, made the performance normal again.
The effect of this incident was that our customers on this server could experience slower sites than usual. Besides that, a service interruption for about 10 minutes, while we rebooted the server.
We have taken a few steps to ensure that this will not happen again.
For starters, we have moved the server's storage to faster SSD storage. For the sake of safety, the long-term plan is also to move some of the workload from the server.