More Details On What Caused the Sidekick Disaster

Want more geeky details on what happened at Microsoft/Danger? The short of it is that the SAN took a nose-dive and took out the drives that could have repaired the data with it. A total of 800TB of data was lost. There was an off-site tape backup, a reasonable backup measure, but with the way the SAN died the entire RAID array needed rebuilding and 800TB of data is a lot of data to rebuild. There are more details on server moves in the info below.

The following is reportedly from “someone close to the action”:

“Here’s the actual scoop, from someone involved in the recovery:

Danger, purchased by Microsoft, was moved into a Verizon Business datacenter in Kent, WA a short while ago. While this had to do with the MS assimilation, it was done as a one for one move from Danger to a DC that MS uses heavily. (MS didn’t re-write, port, migrate to winblows, etc.) The backend service uses a variety of hardware, load balancers, firewalls, web and application servers, and an EMC SAN (Storage Area Network, think huge drive array connected with fiber.)

Well last Tuesday, the EMC SAN took a dump on itself. What I mean by that is the backplane let the magic blue smoke out. While usually in the heavy iron class of datacenter products like an EMC SAN this means you fail over to the redundant backplane and life continues on. Not this time folks. In the process of dying, it took out the parity drives. What does that mean? It means the fancy RAID lost it’s ability to actually be a RAID. How much data got eaten by this mega-oops? 800TB. Why wasn’t it backed up? It was, to offsite tape, like it’s supposed to. But when the array is toast, can’t just start copying shit back.

Apparently EMC has been on site since Tuesday, but didn’t actually inform Danger/MS that their data is in the crapper until Friday afternoon. On top of that, EMC has done nothing to bring in replacement equipment between Tuesday and Friday. (In the Enterprise support world, that’s fucking retarded, multi-million dollar support contracts are that expensive for a reason.)

So what’s being done? Well the good news is that the complex was slated to be migrated into the Verizon Business cloud services (not MS’s cloud per se, but it’s MS’s effort.) And as a part of that migration a newer shinier SAN array was in process of being implemented. But space isn’t ready for it on the datacenter floor, and you can’t just toss the EMC raid and place this one in it’s place, it’s a different vendor and is 2 racks instead of one. This means it’s being shoehorned into a different part of the datacenter than was originally planned, one that doesn’t have the necessary 3 phase power installed. So there’s a bit of work to be done. Not to mention the restoral of 800TB of backup data from offsite tape.

Time to restoral? Looking like Wednesday at the earliest with techs working all weekend.”

Sounds like they know what they’re talking about, but since we haven’t been able to confirm this directly ourselves, we’re keeping it labeled as a rumor.

UPDATE (2009-10-15 01:02 PST): We’ve confirmed that Danger does indeed have servers in a Verizon Business Data Center, however it appears to be one in California, NOT Kent, WA. If you want to confirm, do a traceroute on one of Danger’s web proxies and you’ll find it ends up at danger-gw.customer.alter.net (157.130.202.122), an IP owned by Verizon (MCI) that appears to be around the San Jose/Santa Clara area. It’s possible (although unlikely) that the web proxy servers are kept separate from the user data servers though.

19 Responses to “More Details On What Caused the Sidekick Disaster”

  1. richmond Says:

    So the sidekick franchise is moviing to verizon?

  2. TheMassiveBri Says:

    So this is what I actually do for a living, and yes, unless you have everything prepped and have actually simulated disasters, it is this hard. I’ll be impressed if they are back up by wednesday. Even if they were using really nice new tape drives, they could still be looking at 800+ tapes to have to work through. I would not want to be these guys.

  3. Mikey Says:

    A chance of all our data coming back? Hey, I’m game.

  4. Cathy P Says:

    Errr…. Imma lil slow lol. So considering there are back up tapes, once all the computer stuff is replaced/repaired can they put the back up info in and restore our contacts? I originally thought there was no backup copy. So if
    that’s true I’ll be happy. Granted it would take a long time, but I don’t mind the wait if I can get my contacts back. A lot of business networking info went “poof” and I’d love to get it back.

  5. Lina Inverse Says:

    There are several issues with this account:

    No one at Microsoft/Danger cared enough to closely question EMC for 3 days of (as this is reported) total inaccessibility of the data? All while the clock is running on the damages due T-Mobile according to their SLA?

    (Does this crash even feel like a sudden total loss of the data?)

    Microsoft/Danger told T-Mobile it was hopeless when it’s just a matter of time till they try to recover the data? That’s quite a reputation hit.

    Microsoft/Danger is attempting a drastically sped up move of the datacenter (which will take a fair amount of configuration time, very possibly more than restoring nearly a petabyte), instead of restoring in place?

    Possible, especially given poor enough management. But you would think that several people high up in Microsoft, especially the ones about to roll out their Azure cloud computing venture, would be trying to arrange better damage control, at least after T-Mobile dropped their “no hope” bombshell blaming Microsoft/Danger for the mess.

  6. stenny Says:

    I hate MS I was pissed when the bought danger.. Everyone is jumping ship and I doubt the sidekick will every be the same

  7. Some Guy Says:

    “No one at Microsoft/Danger cared enough to closely question EMC for 3 days of (as this is reported) total inaccessibility of the data?”

    How do we know they didn’t? The first thing that would happen would be the engineers looking at stuff, figuring out what went wrong, what can be fixed, putting an action plan together, etc. It takes some time to go from that to management and PR folks to decide exactly what to say and when, etc. Also, if the rumors about MS needing to hire ex-employees is true, it could have taken a couple days to get from “Oh ****.” at the realization of the scope of things to having an idea of what can and can’t be fixed (and based on the official periodic statements, that has been a moving target, anyways).

    “(Does this crash even feel like a sudden total loss of the data?)”

    Yes. Keep in mind that phone/sms/mms kept on working. The data connection (the G/3G) is a connection to Danger’s servers, not t-mobile’s data network. So a total loss of the servers in the data center WOULD result in a sudden loss of data connectivity and what we saw when the outage started.

    “Microsoft/Danger told T-Mobile it was hopeless when it’s just a matter of time till they try to recover the data?”

    Can’t argue with that. Though I suspect it’s more an issue of bad PR than anything.

    “Microsoft/Danger is attempting a drastically sped up move of the datacenter (which will take a fair amount of configuration time, very possibly more than restoring nearly a petabyte), instead of restoring in place?”

    The individual quoted in the blog post said that they moved the data center first, before the crash happened. They are restoring in place. I think the key thing to point out here is that there wasn’t a change to the hardware/software/etc on the backend (though according to past rumors, MS may have made some poor choices regarding personnel. If true, that could have resulted how long it’s taken to get things under control).

    Overall, I think the previous poster hit the nail on the head that the PR was handled horribly for all of this. The only way left to save any face at all is, after everything is fixed and running, for TMo/MS/Danger to come out and explain exactly what happened, why it happened, why the fix took as long as it did and exactly what is being done to prevent it from happening again.

  8. Lina Inverse Says:

    ““No one at Microsoft/Danger cared enough to closely question EMC for 3 days of (as this is reported) total inaccessibility of the data?””

    “How do we know they didn’t?””

    Good point; this paints EMC as the bad actor from beginning to end (ignoring the PR mistakes), and if Microsoft/Danger asks, EMC doesn’t *have* to give them a real reply.

    ““Microsoft/Danger is attempting a drastically sped up move of the datacenter (which will take a fair amount of configuration time, very possibly more than restoring nearly a petabyte), instead of restoring in place?”

    “The individual quoted in the blog post said that they moved the data center first, before the crash happened….””

    No, reread, e.g. “Well the good news is that the complex was slated to be migrated…”, and the author then adds how they don’t have enough space and need to get 3-phase power to it.

    [ We also agree that the PR has been very poor. ]

  9. Some Guy Says:

    Ah, yeah, I read that last bit too quickly. Still, I have more faith in the actual engineers working on the best/quickest fix than I do the management at MS or TMo or the PR folks handling the situation. I just hope they don’t end the whole thing with just a “Everything’s working again, here’s your $100 credit if you lost your data. Sorry ’bout that.” type of comment. If they do, they’re going to have a hard time getting people to trust them with their data again.

    I think the next OTA needs to include an option to backup/restore data to the memory card. USUALLY, having the data on the servers is a boon to keeping all your data safe in the event of a hardware failure or whatnot (as evidenced by the number of people who never gave backing up their data a second thought for years, despite changing hardware from one sidekick to the next over that time). But a SINGLE point of failure is always going to have the potential for data loss. If we could do our own periodic backups, then we could have the best of both worlds… And it would go a long way towards making people feel safer about their data on the sidekick going into the future.

  10. AG Says:

    Kent? Freaking KENT?! As in “Kent, currently planning for a flood-plain event engineers say is pretty much a lock for this winter?” Kent as in “we’re not actually worried about the overbuilt berm we call a dam bursting ’cause we’ll release water into the flood plain long before that happens?”

    ( http://www.seattlepi.com/local/411029_flood13.html , if you’re not from around here; plenty more where that came from if you run a search on either the P-I or the Times)

    Brilliant. As if EMC’s behavior doesn’t add enough poo to the pasta, now Sidekick users get to watch the weather south of Seattle and wonder if their server has water wings. People are scrambling to move their data OUT of south King County right now, not to it. Sheesh.

  11. TMOBILEISRETARTED Says:

    microsoft destroys another thing, not suprised

  12. someone on the inside Says:

    No EMC gear at a Verizon site in Kent, WA. One of those three things (vendor, site owner, or location) is wrong.

  13. Secret Squirrel Says:

    I’m a storage admin and this explanation gets to the bottom of A LOT of the situation. I’ve not found any storage vendor is “BAD” just incompetent techs + bad communication from management! And this situation was FULL of that. I understand you don’t want to set expectations high, but the “you lost your data” would NOT have been my first communication to anyone up the chain. I think someone somewhere gave up too fast. Regardless, with some rough calculations it would take 5 days minimum to recover from more than 100 backup tapes! Makes sense for the recommendation of warm (NON battery removed) reboots/power cycles to begin re-synching! =)

  14. Steve Says:

    There setup may sound like a micky mouse ‘all the eggs in one basket’ setup, but there are MANY companies doing exactly the same thing!! Big names too, very big names. People you use everyday in fact. The phone company, the power company, some banks, just to name a few. I bet the secert squirrel does it. They put all their data in one big SAN (an EMC Symmetrix DMX) and they think that by backing up off site they are cover. Surprise! They’re not! Sure you have a backup, but it will take a week to recover!

    Now, Google on the other hand, uses cheap desktop drive in millions of servers. The email you mistakenly sent to the boss about how much of an idiot he is was copied 100’s of times across the globe. Sorry, that email isn’t going to be lost due to a crappy EMC back plane crash taking out the parity drive that makes a RAID into a JBOD!!

  15. Alessandra Says:

    800TB of data?! WOW! That will take awhile to restore, and from just reading that I don’t think we’ll be seeing the DI anytime soon, but damn…but if they can restore every one of our data then I’m happy but let’s not get our hopes high. That’s ALOT of work to do, and then after that they’ll probably work on an OTA for the SidekickLX’09.

  16. Lina Inverse Says:

    With Microsoft/Danger now saying “Never mind, we’re restoring most if not all of your data”, this rumor is now the best one out there (assuming Microsoft is not lying now, at least at some level).

    Only question remains is the Sunday? “Repent, Sinners, for your data is lost forever!” communication … from T-Mobile.

    Did they get wrong info from Microsoft/Danger? I doubt they wanted to gratuitous stick it to Microsoft since the fall out to T-Mobile is so bad.

    One possibility is that Microsoft decided it would cost too much to do a full restore (I suspect problems with their backups) or the people Roz Ho had access to didn’t have what it took, so T-Mobile sent out that message to light a fire under them (with Azure’s release only months away). T-Mobile has 100 million reasons ($100 times a million Sidekick users) to do anything they can to get a good outcome, so maybe they got desperate and threw down the gauntlet and then Microsoft decided to bite the bullet and do the hard work.

  17. shannon Says:

    just an fyi to the first comment, just becuase the server/s are located in a verizon data centre does not necessarily mean verizon is taking over; I’m pretty sure the phone and data companies are seperate.

    Plus Verizon uses CDMA which is useless

  18. Lina Inverse Says:

    shannon: Yes, Verizon and Cellco Partnership doing business as Verizon Wireless are two separate companies, the latter is a a 55-45% joint venture between Verizon and Vodafone Group.

    I wouldn’t day CDMA is useless (in fact it’s very clever superior technology and it’s no accident that Verizon Wireless with it’s emphasis on quality of service chose it) but for a variety of reasons it has a small share of the world’s networks and with Qualcomm being such a junkyard dog it’s going to stay that way.

  19. shannon Says:

    CDMA is essentially sunset technology now anyway, with carriers either building LTE (preferred) networks or wimax, so itd be pointless having a cdma sidekick (there were plans back in the sk color days)

    But like I said, thered be nothing to worry about. Still find it strange only tmo usa was signiicantly affected in this outage: tmo uk and telstra au generally had no problems



Subscribe to our RSS feed or Twitter for up to the minute news, or subscribe to our daily email feed to get Hiptop3.com in your inbox!

Get Hiptop3.com in your email: