Hi everyone. Unfortunately, as many of you noticed we had 2 moderate (more than 15 minutes) periods of downtime today, one this morning and one this evening. I wanted to let you guys know what happened, and what needs to happen as a result. :)
This morning, our database server locked up (which happens to us occasionally, you could call it "growing pains.") This morning's lockup was particularly bad as all 3 web servers became overloaded with people trying to load the site, and so slow that it took forever for me to log into any of them (for example, to take the site down and stop people from hammering the web servers xD)
To further complicate matters, the database became corrupted as a result of the lockup, and refused to start. At that point we switched to our backup database, which is synchronized with the main one, and brought the site back up.
The downtime this evening was the result of db2 locking up, which was corrected with a restart of the database.
This left our primary database server offline though, so Subeta will be taken down over Monday/Tuesday (Tuesday morning) at 3 am in order to synchronize the two and return to operating on db1.
In case you're curious why the database locks up like this: Subeta uses caching to save the answers to frequently asked questions that would otherwise be asked of the database (for example: "give me a random avatar to show on the homepage?"). The cache expires after a certain time, however, and if there's lots of users on the site, hundreds of users suddenly ask the database the same question. You'd think this wouldn't cause a problem, but the database really doesn't know that lots of people are asking the same thing, so it dutifully tries and answer the same question hundreds of times. Meanwhile, all the normal Subeta queries get backed up while 200 database threads compete over access to the same information, and eventually it gets so bogged down that it only answers 100 or so questions/second, as opposed to the 5,000 we normally run.
There are ways around this, of course, and we've been implementing them whenever we come across this problem. For the moment, though, whenever we fix one of these problems (like the random HA on the home page), another one steps up to take its place. Rest assured that each time this happens, progress is being made!
Anyway, in summary, site will be down Tuesday at 3am - 4:30am to resync the database servers.
-Alex

thank you for the explanation, you keep a lot of people from pulling out their hair ^^