Degraded performance for some On Demand Small customers.
Incident Report for Litium AB
Postmortem

Post-incident report

Regarding incident:
Degraded performance for some On Demand Small customers. July 2nd

Yesterday Litium experienced a degraded performance issue. It was limited to only one of our webservers and affected some of our OnDemand Small customers.

Litium would like to apologize for the impact this have caused the affected customers. With this postmortem we would like to explain what happened and steps taken to ensure this doesn’t happen again.

What happened

The problem was related to the amount of writes and reads to disk at the same time as a backup job was running. Due to the stress on backend storage this job also took much longer to complete then normally. When the job was done, we rebooted the server to free resources up. This, together with the fact that the backup job had been completed, made the performance normal again.

The effect of this incident was that our customers on this server could experience slower sites than usual. Besides that, a service interruption for about 10 minutes, while we rebooted the server.

Steps taken

We have taken a few steps to ensure that this will not happen again.

For starters, we have moved the server's storage to faster SSD storage. For the sake of safety, the long-term plan is also to move some of the workload from the server.

Posted Jul 04, 2019 - 16:03 CEST

Resolved
We have monitored the situation since the fix implemented last night and it has been stable.
More information will come in the postmortem.
Posted Jul 04, 2019 - 14:35 CEST
Monitoring
A fix has been implemented. We will actively monitor the situation for the rest of the evening and night.
Posted Jul 03, 2019 - 17:39 CEST
Update
As a part of the solution we will have to restart the affected server.
We expect a brief downtime of about 15 minutes.
Posted Jul 03, 2019 - 17:05 CEST
Identified
We have identified the problem and are working on a solution. We will continue to update when we got more information.
Posted Jul 03, 2019 - 14:57 CEST
Investigating
Degrade performance affecting some of our On Demand Small customers.
We are investigating the issue and will update when we have more information.
Posted Jul 03, 2019 - 09:28 CEST
This incident affected: Cloud (Web servers).