Archive for Work

Clear DNS Cache in Windows and Linux

As a reminder to myself, the proper way to clear the DNS cache on a windows machine (be it Windows 7, Vista, or XP) is to do:

ipconfig /flushdns

In some versions of Linux (Centos 5.6 confirmed), usually the name service cache daemon is installed to manage DNS cache. To clear NSCD do

/etc/init.d/nscd restart

And… If you happen to have installed bind9 as a caching name server, use:
rndc flush

Comments

I’m in the Wrong Business.

This isn’t from the point of view of a furious subscriber to a service, but that of a peer in a similar industry.

From Lord Of The Rings Online:

As the final stage of our datacenter move, all Turbine games will be offline on Tuesday February 23, from 4:00AM – 4:00PM Eastern Time (-5 GMT). Websites, including myaccount.turbine.com, forums, wikis, and social networks will be available, but players may be unable to log in or access their account information during this time. We thank you for your patience while we complete the move!

If I went to my boss and told her that “We need to do a datacenter move, which, in my consulting with every other operations team, will require a 12 hour downtime,” I think she’d laugh in my face and go talk to HR about my further employment.

I’ve worked in providing internet access, either to the business traveler, in a dial-up ISP, a web-hosting ISP, the worlds largest Tier1 network provider (at the time), and now providing internet-based services. I’m amazed when a business can provide this type of inept service to their customer base.

It just goes to show you that the real players in the internet-services space know how to build in redundancy, scale, and resilience into their product.

Redundancy: How about multiple datacenters guys? I understand the need to have centralized shards and back end database servers, but when your entire product goes offline because you’ve got a single point of failure somewhere shows that you need to push the data closer to the front-end servers.

Scale: If you’ve got at least 11 shards, that’s probably 10 too many. I understand the need to lower latency, really, I completely understand; jitter is my enemy. However, if none of these survives because your login server or front-end access servers can’t scale beyond a certain number of concurrent logged in users, you need to look at doing it differently, especially when all the graphics and all the maps and all the physics and all the etc is handled on the 11 Gig client installed on your users’ computers, essentially at the core of your shards is a long term storage database, a short term storage database, and tens of thousands of UDP updates that can be highly localized so that only the information the character would ever see will be sent their way. Maybe look at different hardware (Sun has some highly threaded servers now that can handle the amazing amount of UDP packets required should you need to handle 20,000 users with 20ms update packetization) to break out of the norm.

Resiliency: If you have single points of failure that take down your entire system, then you need to look at developing a system that allows for diminished running should that single point go down. For example, if your huge honking 32 processor Sun/Oracle database server dies, can your customers still subscribe to your service, and use your service in a normal or degraded state? Yes, somewhere something has to track that these 6 characters defeated Kranluk, but does that need to be stored centrally or can it make its way to the central DB eventually?

I interviewed for a job which had the complete change management and maintenance mode one in any operations group would dream for; at 5pm on Friday night, they shut down their service. From that time until 8am on Sunday morning, they had full reign to re-install servers, update router software, make firewall changes, etc. But it had to be up at 5pm on Sunday or millions of dollars in transactions would be lost per minute. As much as the cellphone industry has allowed other companies to provide the same level of poor service, the Financial Industry knows no such lack of service. I think Turbine should aspire more towards the loftier goal of a no outage service like those who run financial companies than be like those who manage cellular networks.

Comments

Inadvertent Randomness

A friend related to me a story about an issue he had with a script he wrote.

It’s a web-based password reset program for resetting users’ passwords, using the standard debian-provided dictionary file from the “dictionaries-common“. In essence, it takes two random words from the dictionary file, puts them together using a punctuation mark and a number, and randomly capitalizes any easy to see letters (that is, it will capitlize a Z or G but not an I or L or O).

So, someone from his MIS department was using this tool to reset a woman’s password today, and it came up with “dUmb%4bitCh”.

Needless to say, he’s been tasked to create a package based off the debian package that removes certain words.

Checking my server’s own /usr/share/dict/words file, I came up with some other printable splices that could cause HR nightmares:

  • large!3member
  • auto^8cunnilingus
  • vomiting*9vagina
  • sexless@1nerd
  • infected.1gonorrhea

etc…etc…etc…

Which is why you never automate a tool like this to end users, and always have a admin/tech do these things for the end user.

Comments

Always on Call

It’s the curse of the high level network engineer… To always be on call. Because your job requires you to engineer mission critical and company wide networks, no matter how well you document (or how well you forget to document due to your extreme lack of time) your network, something will always go wrong, and you will get a call/page. So the object is to engineer the network to a point that it’s simple to maintain. I call it the Drunken Master network engineering technique.

Imagine, you’re over at a friends house party, which happens to feature 12 different brands of Vodka and whatever you wish to mix with it. (My friend Matt says if you mix anything more than ice with it, it’s an insult to the Vodka, but I’m all about making my Vodka angry with me; an angry Vodka tends to get one drunker faster). Now, you’ve had quite a few drinks, and at about 1:45am, you get a page… The network is down and it’s impacting customers. What do you do? You know you can’t call on anyone else; they can’t fix anything and will probably make things worse. You can’t call your boss; he’ll just keep asking “Is it fixed yet? How about now?”. This is why you engineered your network to the Drunken Master technique. Your network is so simple, you can fix it and troubleshoot it three sheets to the wind.

You get on the router, fix the Ospf issue (Damn, you forgot to set the DR priority correctly), and go back to having your brain smashed out by a golden brick wrapped by a lemon.

Woe be the network engineer that makes his network unnecessarily complicated; No partying for You!

Comments