1 day ago
Thursday, August 25, 2016
April Linden Explains Tuesday's Grid Problems, "We're Sorry About This Outage"
Unscheduled Maintenance" notice at 11:03 AM stating this, and asking residents to temporarily stop making purchases inworld and buying Lindens.
We are aware that some users are currently being logged out, and all users are currently unable to log in to Second Life. Additionally, please refrain from transacting on the LindeX on in-world as purchases are likely to fail. Some regions are also unavailable at this time, and teleports to other regions may fail currently. We have identified the issue causing this situation and are currently working to resolve it. Please monitor this blog for additional updates.
It was 2:49 PM when they finally gave the all-clear. A few residents I talked to later mentioned the downtime. One would later tell me adjustments she had made at the time while she was going through a number of sets of clothes had been undone, and wondered if what was going on with the Grid had something to do with it.
April Linden decided to make a statement the following day explaining what happened.
Shortly after 10:30am, the master node of one of the central databases crashed. This is the same type of crash we’ve experienced before, and we handled it in the same way. We shut down a lot of services (including logins) so we could bring services back up in an orderly manner, and then promptly selected a new master and promoted it up the chain. This took roughly an hour, as it usually does.
A few minutes before 11:30am we started the process of restoring all services to the Grid. When we enabled logins, we did it in our usual method - turning on about half of the servers at once. Normally this works out as a throttle pretty well, but in this case, we were well into a very busy part of the day. Demand to login was very high, and the number of Residents trying to log in at once was more than the new master database node could handle.
Around noon we made the call to close off logins again and allow the system to cool off. While we were waiting for things to settle down we did some digging to try to figure out what was unique about this failure, and what we’ll need to do to prevent it next time.
We tried again at roughly 12:30pm, doing a third of the login hosts at a time, but this too was too much. We had to stop on that attempt and shut down all logins again around 1:00pm.
On our third attempt, which started once the system cooled down again, we took it really slowly, and brought up each login host one at a time. This worked, and everything was back to normal around 2:30pm.
My team is trying to figure out why we had to turn the login servers back on much more slowly than in the past. We’re still not sure. It’s a pretty interesting challenge, and solving hard problems is part of the fun of running Second Life.
Voice services also went down around this time, but for a completely unrelated reason. It was just bad luck and timing.
We did have one bright spot! Our status blog handled the load of thousands of Residents checking it all at once much better. We know it wasn’t perfect, but it showed much improvement over the last central database failure, and we’ll keep getting better
My team takes the stability of Second Life very seriously, and we’re sorry about this outage. We now have a new challenging problem to solve, and we’re on it.
"We're sorry." Not words that we hear from Linden Lab every day. But this time, it's on the record that they spoke them.
Hat Tip: Inara Pey