If you're like one particular service that I started using a couple of months ago, you ruin all of your good will and hard work inside of a week. I'm not about to call out who this service is, but I will gladly tell you exactly how they went about this epic failure of standard practices. So let's get started.
Step One: Plan a maintenance window
You've got to keep these customers happy with new features, not to mention, you've got to make room for the unexpected growth you've seen.
So let's do it all at once:
- Lots of new customer facing features? Check
- Lots of backend features to handle unexpected growth? Check
- Move to new hardware? Check
- Database updates across billion-row tables? Check
- Test plan using mirrored copy of current data set? Ain't nobody got time for dat.
- Backout plan in case something goes wrong? Pshaw, we won't need one!
- Timetable for how long this operation will take? Eh, can't take too long, right?
Here's the thing folks; never, and I mean never, run a maintenance window that involves multiple moving parts. If you can't perform these actions individually, then you've messed up somewhere in building this thing out from the start. If your upgrades require moving to new hardware, then do it separate from the rest.
Move to new hardware with the existing application and data. Don't mix features that aren't interdependent and make sure to test any database migrations on a full backup data set (or at least a good portion of it) before doing it on the live data set.
Next, always have a way to go back. If you're upgrading a database, have some way to revert back to the original in an instant. Whether it's a backup, a snapshot, or whatever. Don't depend on being able to back out the change you've made (i.e. running more SQL commands on the live data). You want a pristine place to draw from.
But now that you've screwed that up, let's continue on.
Step Two: Don't ever take the blame
Now that you've pushed this "New and Improved" version of your service in the most obscenely unprofessional way, you will definitely have something go wrong. It's not an if or when, it's just going to happen. First off, don't bother checking to make sure everything went well. Just go to bed and pat yourself on the back.
When you wake up in the morning and see things aren't working as you expect, don't bother replying to the customers' cries for help just yet. Let them know who's boss and who runs this joint. You, that's right. After awhile, give a little update. Remember what your english teacher taught you; less is more. Something like this will do just fine:
We're working hard to make our service better. Please bear with us while we continue to do so.
Some companies forget that their customers are not stupid. We know when something's wrong and in cases like this, we know you screwed up somehow. Don't make it worse by glossing it over. Gives us the straight poop, so to speak. The main word here is transparency. Most companies have this knee-jerk reaction of trying to make it look like "oh, we just had some bad luck, we didn't do anything wrong."
Believe me, even if you aren't transparent, the fact that you aren't is pretty transparent to us. Just like when my kids were 3 years old, I could tell when they were lying ("But dad, I swear, it was the dog that ate all the cookies mommy just baked"...as cookie crumbs fell from his face).
Now that you're busy ignoring the flames on the customer front, let's take care of this problem.
Step Three: Take your time
Who's in a hurry? Not this guy! Amiright? It's already broken, you have no capability to revert all of this crap, and you have to get these new features to the masses else what was it all for? Just keep pushing forward like an angry crowd at a music festival.
What you need to remember is that all of this work is for naught if your customers all leave. You have to be able to suck it up and back all of this out to get back to a stable base and come back at it later. The particular failure that I saw this past week may have been able to revert all of the mess they started. I don't know (they didn't talk much). I can only assume that a) they couldn't revert (bad planning) or b) they were so consumed with making this work, they decided to push forward.
It's hard to say which scenario is worse, but if you do find yourself in a position where an upgrade has broken things and you are able to revert back to a known good state, don't let your ego get the best of you. Just REVERT, go back to the drawing board, perform some postmortem, and start up again on a fresh day.
Lastly, above all else, talk to your customers regularly through the process. Waiting 4-15 hours between updates is a super bad idea. Making these status updates vague and absolving yourself of any responsibility is mistake number two.
Just remember, if you own a particularly new market, the only time a competitor will even think of jumping in is when you lose the trust of your customers. They will capitalize on your mistakes.