Archive for the 'Microsoft' Category

Patching A Database Back Together

Wednesday, October 21st, 2009
Broken Danger Royal

How is Microsoft recovering all the user data without having a backup to turn to? Inside sources tell Daniel Eran Dilger that Sun (the storage vendor) and Oracle (makers of the database software) have sent in their best people to attempt to stitch the database back together. Why is it taking so long?

“The first thing to do is wheel in a big pile of new disk space, and copy the individual disks so there is a raw backup. This is like making a copy of a jigsaw puzzle one piece at a time. Then they would assemble the puzzle using the copied pieces, in case any pieces need to be re-made from the original.

“This is very hard, requires detailed inside knowledge of how SAN addresses and volume manager layouts fit together with Oracle tables. Finally, they need to start up the database on top of the assembled puzzle, and Oracle will do its own clean up to get into a consistent state.

“The next thing you do is a fresh backup (several days), before you allow any users access to it. So it’s not surprising that this would take over a week, even after it was possible to say that the data is recoverable.”

I’m assuming that Microsoft considers the data “recovered” if they’re letting users access it. But we’re hearing reports of people missing contacts, some phone numbers for individual contacts, and other weird behavior including not even being able to download their info from the Desktop Interface. How is the data recovery working out for you guys?

Steve Ballmer Speaks Regarding the Sidekick Debacle

Wednesday, October 21st, 2009
Steve Ballmer Eats Sidekick

PCWorld has an article on Microsoft’s CEO, Steve Ballmer, and his attitude towards the Sidekick disaster. He was quoted as saying:

“It is something we are going to have to address and explain to customers our method and process and quality approach and what went wrong in that case and how we are making sure that it does not happen again”

“It is not clear there was data loss,” he said. “Initially we thought there was. We are working hard to get all the users’s data back in the Sidekick case. I think we believe we will get all user data back at this juncture.”

So rest assured, even the CEO of Microsoft is noticing this mess-up.

Good News, Everyone! Contact Recovery

Tuesday, October 20th, 2009

Good News, Everyone!

T-Mobile has posted instructions on how to download your contacts and restore them to your Sidekick. Basically you log in to your account at my.t-mobile.com and a new link will appear on the front page, “Restore your contacts”

mytmofrontpage

Once you click on the link to Restore your contacts, it’s a 2 step process to fully restore contacts. Download the contacts to your computer, and then upload the downloaded file to your Desktop Interface. It will probably take a few minutes for them to appear on your device after this.

This is only the first phase of recovery, as T-Mobile, Danger and Microsoft have stated that the remaining PIM data (Calendar, Notes and ToDo’s) as well as Photos saved to the backend and not your SD card will be returning shortly. Perhaps through a similar method.

Let us know in the comments how the recovery goes and if you were in the group that lost all their contacts, or some of them. We’re curious to know!

Sidekick Backup Problems Blamed On Management

Monday, October 19th, 2009
microsoft-doesnt-backup

I reported early on that the cause of the Sidekick disaster was a SAN upgrade that failed without proper back-ups in place. (see also: More Details On What Caused the Sidekick Disaster) Daniel Eran Dilger, who has chimed in several times on this situation, has confirmed this through another source as well. He has an excellent article detailing many of the factors involved in all of this mess. If you have the time, I recommend reading the article.

But for those that don’t have time, here’s a quick summary:
Why weren’t there backups?
– Confirmation from a source that the loss of data was caused by stopping a 6-day backup by order of Microsoft / Roz Ho against the recommendations of Danger engineers. Because the backup was started, they had to remove a backup from a couple months ago in order to make space available. So they were left with an incomplete backup, since it was only run for 2 of the 6 days necessary. The SAN upgrade proceeded and things apparently went wrong.

No big deal, Microsoft says they recoved “most if not all” of our data
– Daniel Eran Dilger is skeptical that this is actually the case. He believes that it is possibly just Microsoft in denial. He writes:

If the company has stumbled upon a novel recovery avenue or some unknown backup that somehow remained missing for nearly two weeks, then this is great news for Sidekick users and helps to wipe some of the egg from the company’s cloud computing services face, although the situation still remains as the worst datacenter failure to ever impact mobile users as well as one of the most absurd responses pertaining to lost data as well.

However, Microsoft is also well known for advertising bullshit it can’t deliver.

f Microsoft strings along users long enough, it will be able to pat itself on the back with a “mission accomplished” even if it ultimately never actually delivered anything. It’s like saying you’ll call somebody back after a date and then just waiting until they figure out that you’re not really interested. After two weeks, the party on the other end begins blaming itself for waiting around.

Final thoughts
I’m paraphrasing here:
– If Microsoft can deliver even most of most users’ data, that’s awesome but doesn’t make up for the fact this happened in the first place.
– If Microsoft can’t deliver the data and this is just public relations BS to keep the negative press away until mainstream media forgets about the whole issue, and things fade into the background, then this is ridiculous.

Further, with this announcement, even if the company has no real data to recover, it will have erected a plausible story for denying anything significant ever happened. Know somebody who actually lost their important Sidekick data? You’ll be able to write them off as “one of the few who didn’t benefit from Microsoft’s miraculous data recovery.” It will be their word against Microsoft’s PR. Nobody will have records of who was impacted and whose data was recovered apart from Microsoft and probably T-Mobile, and the provider will likely have its records sealed by court order when it gets its big SLA settlement from Microsoft.

His closing paragraph makes it very clear that we need to keep this issue documented and not let Microsoft weasel their way out of another situation. I’ve opened up SKFail.com as a forum for this. Please head over there and leave posts detailing your experience, whether you lost data, if you’ve had it restored, etc.

Danger Microsoft Able to Recover “Most” Data

Thursday, October 15th, 2009
u can haz ur datas back now

What started out as “the worst loss of consumer data” will probably now be called “Microsoft’s Worst PR Disaster”.

An update from Microsoft’s own Roz Ho was posted early this morning on the T-Mobile Sidekick forum. She says that they have recovered “most, if not all, customer data” and that it will all soon be restored according to plan. She also confirms that it was indeed a system failure that wiped out the database and back-ups.

So here’s hoping that the Sidekick Disaster will soon be over and everything will be back to normal. I’m still wondering confirmation of the technical details on how this all happened and why Microsoft was so quick to say that all the data was “almost certainly has been lost” early on. Oh and one last handy tip to Microsoft/Danger: Make an application so that Sidekick users can ACTUALLY backup their data themselves.

The full post from Microsoft:

Updated: 10/15/2009 1:00 AM PDT

Microsoft Confirms Data Recovery for Sidekick Users

Data Restoration to Begin as Soon as Possible for Affected Customers

Dear T-Mobile Sidekick customers,

On behalf of Microsoft, I want to apologize for the recent problems with the Sidekick service and give you an update on the steps we have taken to resolve these problems.

We are pleased to report that we have recovered most, if not all, customer data for those Sidekick customers whose data was affected by the recent outage. We plan to begin restoring users’ personal data as soon as possible, starting with personal contacts, after we have validated the data and our restoration plan. We will then continue to work around the clock to restore data to all affected users, including calendar, notes, tasks, photographs and high scores, as quickly as possible.

We now believe that data loss affected a minority of Sidekick users. If your Sidekick account was among those affected, please continue to log into these forums for the latest updates about when data restoration will begin, and any steps you may need to take. We will work with T-Mobile to post the next update on data restoration timing no later than Saturday.

We have determined that the outage was caused by a system failure that created data loss in the core database and the back-up. We rebuilt the system component by component, recovering data along the way. This careful process has taken a significant amount of time, but was necessary to preserve the integrity of the data.

We will continue working closely with T-Mobile to restore user data as quickly as possible. We are eager to deliver the level of reliable service that our incredibly loyal customers have become accustomed to, and we are taking immediate steps to help ensure this does not happen again. Specifically, we have made changes to improve the overall stability of the Sidekick Service and initiated a more resilient backup process to ensure that the integrity of our database backups is maintained.

Once again, we apologize for this situation and the inconvenience that it has created. Please know that we are working all-out to resolve this situation and restore the reliability of the service.

Sincerely,
Roz Ho
Corporate Vice President
Premium Mobile Experiences, Microsoft Corporation

Class Action Lawsuits Filed Over Sidekick Disaster

Wednesday, October 14th, 2009
Sidekick LX 2009 fail

We all knew it wouldn’t be long before lawsuits were filed over the Sidekick outage and data loss. As Microsoft/Danger still struggles to restore data and get things back up and stable, people are already filing lawsuits, “claiming negligence and false claims.”

A suit filed for a Bakersfield, CA man and “all others similarly situated” says that Danger failed to handle Sidekick user’s data and that they advertised in a misleading manner. He’s asking for monetary damages as well as the court to order Microsoft to fix the Sidekick service or offer a full refund. The attorney handling the case was quoted as saying: “We are hopeful that T-Mobile and the rest of the defendants will do the right thing, use this as an opportunity to redesign the system as a new standard for cloud computing storage, and provide full compensation for the data loss.”

Another class action law suit (PDF of filing) was filed for Maureen Thompson and again “all others similarly situated” against T-Mobile/Danger/Microsoft for the outage and loss of data. Same sort of thing.

And there’s yet another suit (PDF) filed by Oren Rosenthal against T-Mobile for the negligence, breach of contract, blah blah blah.

Should be interesting to see how these play out. If you hear of any others, let us know over on skfail.com‘s Lawsuit forum.

image via kcoury

More Details On What Caused the Sidekick Disaster

Wednesday, October 14th, 2009

Want more geeky details on what happened at Microsoft/Danger? The short of it is that the SAN took a nose-dive and took out the drives that could have repaired the data with it. A total of 800TB of data was lost. There was an off-site tape backup, a reasonable backup measure, but with the way the SAN died the entire RAID array needed rebuilding and 800TB of data is a lot of data to rebuild. There are more details on server moves in the info below.

The following is reportedly from “someone close to the action”:

“Here’s the actual scoop, from someone involved in the recovery:

Danger, purchased by Microsoft, was moved into a Verizon Business datacenter in Kent, WA a short while ago. While this had to do with the MS assimilation, it was done as a one for one move from Danger to a DC that MS uses heavily. (MS didn’t re-write, port, migrate to winblows, etc.) The backend service uses a variety of hardware, load balancers, firewalls, web and application servers, and an EMC SAN (Storage Area Network, think huge drive array connected with fiber.)

Well last Tuesday, the EMC SAN took a dump on itself. What I mean by that is the backplane let the magic blue smoke out. While usually in the heavy iron class of datacenter products like an EMC SAN this means you fail over to the redundant backplane and life continues on. Not this time folks. In the process of dying, it took out the parity drives. What does that mean? It means the fancy RAID lost it’s ability to actually be a RAID. How much data got eaten by this mega-oops? 800TB. Why wasn’t it backed up? It was, to offsite tape, like it’s supposed to. But when the array is toast, can’t just start copying shit back.

Apparently EMC has been on site since Tuesday, but didn’t actually inform Danger/MS that their data is in the crapper until Friday afternoon. On top of that, EMC has done nothing to bring in replacement equipment between Tuesday and Friday. (In the Enterprise support world, that’s fucking retarded, multi-million dollar support contracts are that expensive for a reason.)

So what’s being done? Well the good news is that the complex was slated to be migrated into the Verizon Business cloud services (not MS’s cloud per se, but it’s MS’s effort.) And as a part of that migration a newer shinier SAN array was in process of being implemented. But space isn’t ready for it on the datacenter floor, and you can’t just toss the EMC raid and place this one in it’s place, it’s a different vendor and is 2 racks instead of one. This means it’s being shoehorned into a different part of the datacenter than was originally planned, one that doesn’t have the necessary 3 phase power installed. So there’s a bit of work to be done. Not to mention the restoral of 800TB of backup data from offsite tape.

Time to restoral? Looking like Wednesday at the earliest with techs working all weekend.”

Sounds like they know what they’re talking about, but since we haven’t been able to confirm this directly ourselves, we’re keeping it labeled as a rumor.

UPDATE (2009-10-15 01:02 PST): We’ve confirmed that Danger does indeed have servers in a Verizon Business Data Center, however it appears to be one in California, NOT Kent, WA. If you want to confirm, do a traceroute on one of Danger’s web proxies and you’ll find it ends up at danger-gw.customer.alter.net (157.130.202.122), an IP owned by Verizon (MCI) that appears to be around the San Jose/Santa Clara area. It’s possible (although unlikely) that the web proxy servers are kept separate from the user data servers though.

Data Syncing Is Back

Wednesday, October 14th, 2009

T-Mobile reports:

Microsoft/Danger have now made the necessary fixes to their network to restore the ability to sync your data. This means that your contacts, calendar entries, to-do lists, tasks, etc. will now sync on the network, just like they did prior to this data disruption. However, you must power cycle your device following the steps below in order to begin the synching process.

But they do add the discaimer that things are still “unstable” and that you should back up all your data. Note that a power cycle is going to Menu -> Power Off, letting your Sidekick turn off on its own and then powering it back up. Do NOT do a hard reset as your data will be lost.

New Uncensored Forum to Discuss the Sidekick

Tuesday, October 13th, 2009

After hearing reports about people having posts deleted and accounts banned on the official T-Mobile Sidekick forum and on PoweredByDanger.com there was a pressing need to have a completely uncensored forum where users can discuss how this Sidekick disaster has impacted them and the experiences they’ve had with T-Mobile’s customer service reps. Enter the appropriately named Skfail.com.

We want to collect user’s experiences as well as news articles, statements from T-Mobile, etc. in one place that people can go. So head on over to the forums and speak up!

T-Mobile To Give Out $100 To Eligible Customers

Monday, October 12th, 2009
$100 bill Sidekick Theme

T-Mobile released their Monday statement. The short of it? If you lost your data, you’re getting $100. The long of it? Read below:

Updated: 10/12/2009 5:15 PM PDT
T-MOBILE STATUS UPDATE ON SIDEKICK DATA DISRUPTION, MON., OCT. 12

Dear valued T-Mobile Sidekick customers:

We are thankful for your continued patience as Microsoft/Danger continues to work on preserving platform stability and restoring all services for our Sidekick customers. We have made significant progress this past weekend, restoring services to virtually every customer. Microsoft/Danger has teams of experts in place who are working around-the-clock to ensure this stability is maintained.

Regarding those of you who have lost personal content, T-Mobile and Microsoft/Danger continue to do all we can to recover and return any lost information. Recent efforts indicate the prospects of recovering some lost content may now be possible. We will continue to keep you updated on this front; we know how important this is to you.

In the event certain customers have experienced a significant and permanent loss of personal content, T-Mobile will be sending these customers a $100 customer appreciation card. This will be in addition to the free month of data service that already went to Sidekick data customers. This card can be used towards T-Mobile products and services, or a customer’s T-Mobile bill. For those who fall into this category, details will be sent out in the next 14 days – there is no action needed on the part of these customers. We however remain hopeful that for the majority of our customers, personal content can be recovered.

via the Sidekick Forums

The interesting point is that T-Mobile knows who to send these $100 customer appreciation cards to. That means they know who lost data, which means they know what data was lost on the server. So the hope is that they were able to recover some user’s data and it wasn’t totally lost.