Thank god that's over!
So we're back, with no loss of time on the site (no r-word), as promised.
You may have already read the explanation of what happened on our downtime message, but if you missed it (or want to know more), I'll go ahead and explain it here.
For the past few months, one of our biggest threats to stability has been that our database file is gradually becoming corrupted (think of someone pulling files out of a filing cabinet and tossing them all over the floor). We're currently talking to MySQL, the people who make our database, and trying to determine if this is a problem in our usage, or (more likely) a bug in the database program itself, in which case we will help them track it down and fix it.
The temporary solution to this is a complex procedure where we dump the database completely and "re-import" it, creating a totally new database file free of any corruption. There is a way we can do this transparently, with only minimal downtime. The procedure takes about a day.
Unfortunately, we were too late - the site crashed while we were in the middle of dumping the database. Both copies of the database were so corrupt that they would crash immediately upon being loaded.
Our first choice, after realizing that we had this problem, was to restore from a backup made earlier that day, and then "replay" all of the events on Subeta that had happened that day (from about 10:30 am that morning until 7:45 pm when the site crashed). This is possible because the database logs all changes, and we can use those logs to replay events.
However, this method of recovery proved ineffectual - replaying yesterday's events against the backup merely brought the database to the same corrupt state that it was in before.
So we had no other choice but to go ahead and finish dumping and reimporting data (from the backup made yesterday morning). Once that completed (around 1:15 this afternoon), we began replaying yesterday's events, until the site caught up completely, and we had a copy of the database free of corruption.
There are upsides, however. Dumping and reimporting the database optimizes it (it shrank by about 66% as a result). An optimized database is a faster database.
So yeah, that's the whole story. We're working with MySQL to fix the corruption issue, and if that is taking too long or not turning out the way we want it to, we are going to look into alternative database systems. One way or another, we'll find a solution to the problem.
--Alex
AHHHHHHHHHHHHH