Replies

Oct 23, 2014 11 years ago Official
Keith
is sweet
User Avatar
Eradication

We were sleeping, literally.

When we woke up this morning and were made aware of the site being down - we took to fixing it. There was no one on staff who could post on the twitter or facebook (not everyone has those permissions, or should have those permissions) or could actually comment on the status of the servers.

When we have information for you, you'll get it. Posting "The site is down!" on twitter is just as useful as the site actually, and obviously, being down.

[edit]

And, this wasn't a DDoS attack. Battling overnight killed the cache server (battling is the most resource intensive thing we have on the site) which in turn makes the site run REALLY SLOW because everything has to go through the database instead of hitting cache. Unfortunately the cache server still showed itself as up (a green dot on our dashboard) and because the web servers were technically still able to make calls to the database, they were considered to be running. That means that no alert was sent to my phone to let me know that the servers were offline at 4AM, or whatever.

I'm going to look into how we can solve that particular problem going forward. Other than that, and the DDoS attack the other day ( related) we've had pretty good uptime during this event, and very little lag.

💖 ✨ 🤗

Oct 27, 2014 11 years ago Official
Keith
is sweet
User Avatar
Eradication

I'm not really going to read through the pages that I've missed since the last time that I did because this thread took a weird turn, but I will say some things and this will probably be my last response here.

I spent the last part of last week building out more internal tools for staff members, which has a few features that I think will really come in handy.

  • Social Media: Some staff members have been given the ability to tweet / post to facebook. Tumblr is a little bit harder (their API is dumb) so we need to figure this out. Just adding accounts to the tumblr isn't really the right way to do it (just like giving out the PW to the twitter account) because we need to be able to revoke access easily. So they've been built into a panel.
  • Server administration. Some staff members ( and included) now have the ability to add additional server capacity to the site, cycle servers that seem to be having issues and most importantly (because it was what prompted the crash that prompted this thread) restart the memcache server specifically.

Things I'm spending the start of this week working on:

  • Some sort of user reporting tool, either automatic or manual that tracks the amount of time it takes the (server side) parts of the page to load, and if that gets beyond 10s, alert us in slack (our staff chat program).


We're listening to you. There are some things that are absolutely out of our control (DDoS attacks) that the best thing we can do is try to mitigate it when it starts. Typically we'd be protected from them (because of excellent support from cloudflare and our own server side protections) but in that case the attackers scoped out Subeta and found some pages that no data was cached (mostly by finding pages that took the longer to load on a "cold" refresh) and attacked those pages - which is a little different from a typical DDoS attack and took us a lot longer to figure out wtf was going on. We've patched up those pages but I'm sure there are other existing flaws like that, that are very difficult for us to find on our end.

And there are some things that are absolutely within our control which means keeping the site up most of the time otherwise, which we're always in the process of working.

💖 ✨ 🤗

Oct 27, 2014 11 years ago Official
Keith
is sweet
User Avatar
Eradication

I'd like to clarify, just because I didn't read pages 4-7 of this thread doesn't mean that other staff members didn't. We're in a chat room with each other 24/7 and communicate about threads like this (and all over the site) and where each of us should respond. The information got to me, and in the meantime I was coding the very response mechanisms that I talked about and that are still being coded. Just because I don't respond to something (or don't read it directly) doesn't mean that I'm not being told what is said in the thread and given a task list of things from or to program based on the feedback from a thread.

We have staff members who work through the night (UAs answering tickets) who now have the ability to cycle or bring up additional servers if the site needs it. Those staff members now also have access to posting to twitter and facebook saying that is what they're doing, as well as bring the site down or add a bar to the top of the pages saying thats what they're doing.


This wouldn't have been possible two weeks ago before infrastructure changes that were posted about in the site announcement forum. We're always trying to move forward and keep the site stable.

I don't think that staff communication is the problem that this thread is making it out to be. We've been using the site feedback group, I (and other staff members) are incredibly communicative on changes and what is happening. We've been responding to feedback and suggestion and bug forums very frequently. This one time the site went down for a long span and there was no response, but just the week before when the site was having problems for the same amount of time there was responses to it all over the site. Obviously we need to get better at making sure this happens in every situation, but this time the ball was dropped and we've made changes to try to prevent that in the future.

💖 ✨ 🤗

Oct 27, 2014 11 years ago Official
Keith
is sweet
User Avatar
Eradication

COMPLETELY OFF TOPIC but yes omg the tinyspec people are AWESOME. They're really great and slack is an amazing product built out of love of what they do. Anyone with a company or team should look into it. ❤

💖 ✨ 🤗

Oct 28, 2014 11 years ago Official
Carol
had too many
User Avatar

Quote by JESSYTA

Not trying to be nitpicky but...
We&;re in a chat room with each other 24/7
Making statements like that, which any intelligent player can see is inaccurate, as it is simply not possible, does more harm than good. It contradicts the "we were all sleeping" reasoning we we given for the extended long time earlier in the thread. Please... don't blow smoke up our asses.

Unless I'm misunderstanding something (very possible), it's 100% true. We used to use IRC for our staff chat, which required clients, logging in and out, etc, which didn't allow us to be in constant contact. We now use Slack, which has totally changed the way we're able to communicate with one another. Even if one of us isn't at a computer, we're connected via our phones and tablets. You can set up permissions in Slack to allow yourself to be pinged, and it actually alerts your phone that someone is looking for you. The admin group all have that set up, as well as old fashioned phone numbers. Now, we try to be considerate of one another's time, as we all work long hours, and some of us have families, multiple jobs, etc, so we hesitate to go crazy with the pings unless it's truly an emergency.

In terms of what happened last Thursday/Friday overnight, it was basically a perfect storm of "whatever can go wrong, will go wrong" on our end. The site went down after the last of our admin group had logged off for the night, and since I'm several hours ahead of a large portion of our staff, I was the first one to notice a major issue when I logged on at 7:00 am to start my day and realized that I was not able to bring the site back up despite the typical measures needed to do so. The very small group of people we did have to check in overnight made assumptions that other people already knew or were working on the issue, and it wasn't escalated as it should have been. As soon as I realized it wasn't going to be something that I could manage on my own, I posted on our Facebook page apologizing for the downtime. That was at approximately 9:30 am Subeta time, or 8:30 am where I am located. Keith was actively working on the issue from that point forward, and it did actually take him some time to correct the issue with the cache servers that caused the outage. As soon as the site came back up, about 2 hours after that, I posted on Facebook again that we were up.

SO, to sum, because TL;DR, here's where we dropped the ball:

  1. We did not have staff who were empowered to alert the proper people about the outage overnight when it occurred
  2. Those of us who were on early in the morning did not have the technical capabilities to fix the issues on our own, and did not escalate the issue to "emergency" status as quickly as we should have.
  3. While I updated Facebook as soon as I knew it was a big problem, I did not have access to the Twitter at the time, so I couldn't push updates there as well

What we have DONE since then to make sure we're improving:

  1. Specified staff members who we know are on overnight hours, given them both the permissions and the access they need to both troubleshoot server issues OR alert the right people if it's something major or out of their control
  2. Improved our admin panel with additional options to not only control uptime, but site speed AND communication, that staff members who are NOT programmers can understand and use (gives more people the ability to try initial measures before having to contact an actual programmer)
  3. Set up a protocol for taking the site down, including posting a warning alert when we're able to. (We did this on Friday night when we took the site down to cycle the servers for ~5 minutes), which is the last time we have had to do so since then.
  4. Clarified with each admin staff member their preferred form of communication during non-working hours, and when it's appropriate to contact them in an emergency.

Keith also mentioned several measures that he is personally working on, all of which are meant to improve our processes and our abilities to keep the site running smoothly with the number of people we have to manage it.

This site is never going to be 100% perfect - it's just not technically possible. Even the largest websites and services (Facebook, Twitter, Gmail, Ticketmaster, etc) have lag, downtime, and unexpected outages. The difference between them and us is obviously the sheer volume of people they have to correct and communicate issues, versus the small crew we have working 24/7 to make this game work for you guys, and the even smaller crew of people who are able to technically manage the back end of the site and the servers.

That does not mean that it's not important to each and every one of us, it just means that we don't always handle it in the best way or as quickly as we'd like to. We are constantly working to continue to improve that - unfortunately it's not just a switch we're able to flip. What we can absolutely continue to improve on is the communication when issues occur, and hopefully you guys have seen, and will continue to see, the measurable and actionable items we have implemented in order to do so in the last several months.

Please log in to reply to this topic.