SAN – Chris Wolfe

Two UPSs and a CX4

One of the big things I had to get done by mid-September was a data migration project that involved working to decommission an EMC CX4-960 array after moving all of its data to a VNX 7500 system. This took many months for several reasons:

At the beginning, we wanted to do a full set of testing of different VNX drive configurations since the VNX gave us more options than were available on the old CX4 (and we wanted to be quite certain of performance before committing).
It took some time to move the data as non-intrusively as possible. There was only one time that several production server reboots were necessary (for upgrading HBA drivers and multipathing software) before those servers could connect to the VNX.
A whole lotta little daily projects and firefights that came up in the meantime while this was going on.

Fortunately I work with a great bunch of people who were involved in the data migration and once a date was set for the CX4 to be powered down, we worked together to make it happen. And it was a great feeling to turn that sucker off… well, after the lurking, incriminating doubts subsided (“Wait, wait, am I absolutely SURE there’s nothing else connected to the array?!?”). I learned a lot about array-based migration tools for EMC arrays like SAN Copy and MirrorView, and especially about how to automate them via NaviCLI scripting! This will come in handy in the near future as I’m currently working on the second part of this VNX migration project; we have a second older CX4 that will be migrated to a second new VNX 7500. Fun times!

—-

The other huge project in September involved removing two big UPS systems from our computer room and replacing them with newer ones. It started as a meeting with the electricians, the UPS company rep, and several of us from the company. Since this outage was going to take several days, it was decided early on to happen over a weekend to minimize impact and we scheduled it a few months out. Because of how we have our computer room powered, we were going to lose power to half of the hardware in the computer room twice… once while the power feed was cut over to our generator, then a second time to bring power back over to the UPS lines. For my group, we were responsible for identifying the systems that would have to be gracefully powered down before their power was cut. Wow, that took some time because we haven’t been the best at, er, keeping internal documentation up-to-date and the power sockets all labeled (but it’s much better now! :D).

Once the hardware was identified, all the other groups here at work that used those hardware systems had to go over the list and decide the order that the systems were to be brought down. To date, this was probably our biggest internal change that affected the most number of people in our office (save the times we’ve physically moved from one office space to another). Our Change Management group took on this part of the project and did a great job organizing the sequence of events and all the inter-dependencies. It took MANY meetings and much explanation and clarification along the way to be sure everyone was fully aware of and understood the extent of this outage.

All of that organization was worth it. Starting Friday morning, non-essential systems were taken offline. This progressed through the day with orderly shutdowns of the affected components, then all power circuits for the UPSs were shut off by 7pm. The electricians moved these circuits over to generator power and we got everything started up again in reverse order. Of course there were a few minor hiccups, but it did go very smoothly. One thing that kept going through my mind, over and over… “No surprises!”, meaning that if I missed something major early in in the project, like misidentifying what hardware was on which circuit and then having an UNplanned outage… well, I was breathing a lot easier by this point.

By Sunday morning the electricians and UPS engineers were ready for us to cut over our power from generator to the new UPS power. They were about eight or so hours ahead of schedule so people got called in early and the whole shutdown/startup process happened again. Again, probably a few hiccups, but it seemed to go swimmingly.

It was quite a feeling of accomplishment being involved with so many people working together in such coordination to make a huge project come together. There were some who put in a lot more hours than than me on this and I hope they got the credit they deserved. I feel that sense of accomplishment every time we have worked on things like this in the past… but this one felt an order of magnitude larger. I’m certain there will be more like it on the horizon, but it’s sure nice that they don’t happen very frequently!

Blame the SAN

Over the years, I wish I would have kept track of all the times that my SAN at work was blamed for causing problems. It’s on my mind today after some work we did…

In our main facility, we have a Cisco MDS 9513 Director class chassis, with eight internal switch modules. Looking at it from a high level, the switches work similarly to ethernet switches, as they basically allow connections between the end devices plugged into the ports. Connections are controlled by zones which can be compared to network VLANs, in that a zone allows its included devices or end points to talk to each other. A basic zone, for example, could be configured to include Server A and the ports associated with Storage Array B so that the server could access the luns on that storage array. In my switches and probably a common setting for most others, if a given server is not included in a zone, it can’t talk to anything else. Good for security as well as sanity!

So, this morning a couple FC attached tape drives were installed, and I connected them up to the MDS. Once powered up and with the switch ports activated I configured them just like normal. I zoned them up with several OpenVMS servers as they would be the ones using the tape drives for backup. After the servers scanned for available new connections, they were drawing a blank. Why weren’t the drives showing up? It must be a problem with the SAN. I double checked my end a couple times and nothing was amiss. It was a similar configuration to other FC attached tape drives we have had online for years, so I was highly doubtful that now some aspect of it would be failing.

It turned out that there were some OS-specific scanning options that needed to be done so that the server systems could recognize the new drives, so all was well in the end. And it only took a few hours to get to that point.

I am not writing this to vent or to complain, because I believe everything we do, right or wrong, is a learning experience for those involved. I am not trying to put the blame back on any other system administrator, because I too have probably been guilty before of the mentality that says it can’t be a problem with my stuff, it has to be yours. I do know, though, from many years of experience that a lot of times I’ve seen fellow workers get very defensive when a problem comes up and have been quick to point fingers at others only to find later it was their own issue. I do know that my own systems, like the FC switches, have worked without issue for a very long time, and I trust that a new change similar to what I’ve done dozens of times in the past is going to work just like normal.

I also know that I am willing to do what I can to help someone out in trying to figure out an issue, especially if it involves my hardware. I may not know everything (well, of course not!) but I’ll give you what I can. Please, don’t just keep saying it’s the SAN… what are you doing on your end to help figure it out?

Data migrations

By the middle of last year, I was up against a data crunch. I have three older model HP EVA 8×00 arrays totally full of fiber channel drives (one rack each, no expansions), and the two oldest were around 95% full of data (yes… the growth got away from me with several major requests for lun space). The third one was my safety valve but was also filling up faster than I liked. Decisions… buy expansion racks or maybe try to migrate in higher capacity drives? Either option was a lot of work, and it was pretty obvious due to the cost in time and effort and ongoing maintenance, not to mention the ill feelings about investing in older technology, that I needed to do some shopping for new hardware.

Working with our HP vendor, we arranged for a new EVA p6550 that got installed a couple months ago. With newer, 2.5″ SAS drives, and only half populated, it would easily hold all the data from those two old EVAs, with room to spare! Using the built-in Continuous Access software, we started a data migration plan that will be complete by next month. The data replication groups were VERY easy to configure, took only a few days to sync up, and run in the background without any issues. It does unfortunately require the reboot of all servers that are connected to the EVAs (HP Alphaservers and Itaniums running OpenVMS… hmmm, that’s worth another post…) to convert from their old array connections to the new p6550 but this has worked out well in coordination with our regularly scheduled maintenance. I am also relieved that I can add in more drives to accommodate for 100% growth all in that single rack system.

Our first “big” SAN array, way back in the day, was an HP EVA5000. It was a bit buggy at the time, but each iteration of the EVA that we implemented has only gotten better and they’ve been a reliable and solid bedrock upon which our company has built our storage infrastructure.