This isn’t from the point of view of a furious subscriber to a service, but that of a peer in a similar industry.
From Lord Of The Rings Online:
As the final stage of our datacenter move, all Turbine games will be offline on Tuesday February 23, from 4:00AM – 4:00PM Eastern Time (-5 GMT). Websites, including myaccount.turbine.com, forums, wikis, and social networks will be available, but players may be unable to log in or access their account information during this time. We thank you for your patience while we complete the move!
If I went to my boss and told her that “We need to do a datacenter move, which, in my consulting with every other operations team, will require a 12 hour downtime,” I think she’d laugh in my face and go talk to HR about my further employment.
I’ve worked in providing internet access, either to the business traveler, in a dial-up ISP, a web-hosting ISP, the worlds largest Tier1 network provider (at the time), and now providing internet-based services. I’m amazed when a business can provide this type of inept service to their customer base.
It just goes to show you that the real players in the internet-services space know how to build in redundancy, scale, and resilience into their product.
Redundancy: How about multiple datacenters guys? I understand the need to have centralized shards and back end database servers, but when your entire product goes offline because you’ve got a single point of failure somewhere shows that you need to push the data closer to the front-end servers.
Scale: If you’ve got at least 11 shards, that’s probably 10 too many. I understand the need to lower latency, really, I completely understand; jitter is my enemy. However, if none of these survives because your login server or front-end access servers can’t scale beyond a certain number of concurrent logged in users, you need to look at doing it differently, especially when all the graphics and all the maps and all the physics and all the etc is handled on the 11 Gig client installed on your users’ computers, essentially at the core of your shards is a long term storage database, a short term storage database, and tens of thousands of UDP updates that can be highly localized so that only the information the character would ever see will be sent their way. Maybe look at different hardware (Sun has some highly threaded servers now that can handle the amazing amount of UDP packets required should you need to handle 20,000 users with 20ms update packetization) to break out of the norm.
Resiliency: If you have single points of failure that take down your entire system, then you need to look at developing a system that allows for diminished running should that single point go down. For example, if your huge honking 32 processor Sun/Oracle database server dies, can your customers still subscribe to your service, and use your service in a normal or degraded state? Yes, somewhere something has to track that these 6 characters defeated Kranluk, but does that need to be stored centrally or can it make its way to the central DB eventually?
I interviewed for a job which had the complete change management and maintenance mode one in any operations group would dream for; at 5pm on Friday night, they shut down their service. From that time until 8am on Sunday morning, they had full reign to re-install servers, update router software, make firewall changes, etc. But it had to be up at 5pm on Sunday or millions of dollars in transactions would be lost per minute. As much as the cellphone industry has allowed other companies to provide the same level of poor service, the Financial Industry knows no such lack of service. I think Turbine should aspire more towards the loftier goal of a no outage service like those who run financial companies than be like those who manage cellular networks.