A Bad SAN

2

September 23, 2009 by Mike Hillwig

Several years ago, I was working on a project at a utility company, deploying their new customer billing system. After the system went live, I was asked if I would work the night shift, doing all of the operational work, importing the payments and meter readings, generating the bills, etc. Since the system was new, these things hadn’t been automated yet. Part of my job was to lay the ground work for automating the whole process.

It wasn’t too long before we started noticing some serious performance problems. Stuff was taking twice as long as we had predicted.

Our database server was a high-end system, connected to a high-end SAN. We also had three other boxes used for testing, development, and the like.  One was a high-end server with local disks. We also had two low-end servers, one attached to the same SAN and one with local disks. The bad sketch below should give you an idea of what I’m talking about.

Further testing showed that we frequently got much better performance from the the high-end server with local disks. What ensued was potentially the biggest pissing and finger-pointing contest I’ve ever seen in my career. The corporate DBA team blamed the application design. The project DBA blamed the infrastructure. The infrastructure team blamed the vendor. I had a hunch that it was a problem with the SAN. I was the guy who ran everything overnight, and the performance problems had the biggest impact on my job.

One night, I asked my DBA to help me prove a hunch. After the 5PM backup of the production server, he restored that exact same database to the other three servers. He had no idea what I was about to do, but he realized I was onto something.

I ran the exact same processes on all four servers and compared the execution times. What did I find? The high-end server with local disks performed the best. The others perfomed in the following order: Low-end server with local disks, high-end server on SAN, low-end server on SAN.  Armed with the facts, the director I worked for found this nugget of information in her e-mail when she got in the following morning.

When I walked in the next day, my director practically met me at the door. She wasn’t exactly happy that I did this testing without consulting her first, and rightfully so. At the same time, she confessed that I pretty much proved her suspicions, too. We had a bad SAN configuration. Of course, the director who managed the infrastrucutre claimed that it had to be different databases. Obviously that wasn’t the case. Then he said it had to have been that the primary server had users attached to it. Right, all two of them at that hour. This problem and the finger-pointing contest persisted for several months.

About two months after my contract ended, I got an e-mail from my old director. She said that they had purchased a new SAN because they were outgrowing the old one. While they were implementing the new one, the vendor took a look at the old one because some users were complaining about performance problems (like I had been). The vendor immediately found a configuration problem. Nightly processing immediately went from twelve hours down to five.

I told my director that I hope she rubbed it in so hard it left stains.

Yes, SANs can be an amazing thing on so many levels. But never underestimate the hell a poorly configured SAN can put you through.