- Posted by James on April 14, 2014
Incident Report - 14th April 2014
This article was originally published on the ShareLaTeX blog and is reproduced here for archival purposes.
On Sunday the 13th of April between approximately 4.15am and 11pm (UTC), ShareLaTeX was down for the longest time since it was launched, sorry. The end result was that we had to restore some user accounts and projects back to how they were on Saturday morning (12th April, at 6am UTC) in order to bring the site back up. If you were one of the accounts who was affected, you can still access the latest versions of your projects via https://www.sharelatex.com/restore (update 20/02/2019: archival link, no longer active), and we have no reason to think there was any significant data loss - it's just a bit of mess, sorry!
The purpose of the rest of this post is to let you know what happened in more detail and to explain what we're going to do to prevent something like this in future.
At approximately 4am UTC, there was a power cut in the data center that contains our servers, and ShareLaTeX was effectively turned off at the wall. The servers that power ShareLaTeX slowly came back online over the next 6 hours, and everything was powered on by 10am.
Unfortunately, when we brought our database servers back online, we discovered that the abrupt power cut had corrupted one of our servers. We run our database servers in a replica set so that if one fails, another one should be able to take over. However, for a reason that we are still investigating, the corruption either replicated to all of the database servers, or put the replica set into a state where it wasn't happy.
Over the next 8-10 hours, we attempted to repair the database. This took a lot longer than expected, since we first had problems bringing up a duplicate of our database server to attempt the repair on. This was to do with an unfortunate (and seemingly unrelated) hardware problem with our hosting company. We then tried to repair the database twice, each time taking a few hours and both unfortunately failing.
At this point we decided to instead restore any corrupted data from our latest backup. The latest backup that we had was from Saturday morning (around 6am UTC). The power cut was happened just before our Sunday morning back up would have been taken. From around 8pm to 11pm, built up a new database from the backup to replace the corrupted parts of our running database.
This leaves us in the situation we are in now, where some projects and user accounts may be in an older state from Saturday morning. Fortunately we also have another, more up to date set of project backups which are unrelated to the database. These can not be easily be added directly into ShareLaTeX, but you can get the latest versions of your projects from the URL at https://www.sharelatex.com/restore (update 20/02/2019: archival link, no longer active).
We plan to make a few changes in our policies in response to this incident.
- We will be taking more regular database backups, preferably every hour, but certainly more regularly than daily.
- We are investigating why the entire replica set was affected by a fault with one database server, and how we can prevent this problem in future.
- We should have begun the procedure of restoring the backup as soon as we knew there was a problem with the database. We might not have needed it, but having it ready could have saved us a few hours downtime.
Please accept our heartfelt apologies for this incident. Sunday was a very stressful day for us (and for lots of you with deadlines!), but some of your kind comments and words of encouragement were really helpful, thank you. Having done a PhD myself, I know the pain and fear of not being able to access important work, and I am beyond frustrated with our large downtime in this case. We will do everything we can to stop this happening again.
James and Henry