Background:
Current Environment
Primary Datacenter located in New Orleans, La. running VMWare 2.5.4 with an EMC CX3-80 backend.
Backup Datacenter located in Dallas, Tx. running VMWare 2.5.4 with an EMC CX3-80 backend
45mb point to point connection running between sites.
All VMWare LUNs setup to replicate using Mirrorview/a with a policy of Time finished +60 minutes schedule.
60 luns with 8tb of data in use and mirrored between the sites.
Based on our Corporate, Disaster Recovery policy, we begin the process of moving our environment from New Orleans to Dallas once Level X is reached. Level X is defined as a hurricane with winds of X, and landfall expected within X hours.
Once Level X is reached, we begin the process of moving our servers to our secondary site, using the following high-level steps.
Run vb script which will update all ip addresses of the current vm's to a network range that is already configured at the secondary site.
Manually shutdown all vm's in a specific order, with our virtual SQL server being last.
On each ESX server, run our "copy config" script which copies all currently registered VM information to each ESX server's partner server at the other location.
Manually remove the luns from the New Orleans EMC VMWare Storage Group
Manually sync each luns remote mirror group
Manually set each mirror to manual sync policy to ensure that the primary location stays "pristine" until we have successfully restarted at secondary site and tested by for replicating the changes in the reverse direction.
Once each mirror is synced, promote the secondary image.
Manually add each lun to the Dallas EMC VMWare Storage Group
On each ESX Server, run the "register vm" script to register the vms based on the information that was copied in one of the previous steps.
Reboot each ESX server, so that the VMs can auto-restart, rebooting the one with SQL first.
With this configuration, there are two major issues.
Since we are changing the IP addresses of the VMs and up dating DNS, some of our remote sites may not see the updated addresses for anywhere from 4 to 6 hours.
LUNs were not layed out to allow for Tier roll over. All luns have to finish replicating before we can start any servers up. We can still get our servers restarted in less than 4 hours, but all servers are down until all data is replicated.
SAN work requires a lot of manual steps.
About 2 weeks ago we started the build out of our new VMWare 3.5 environment. We had the following high level goals.
No ip address changes
Roll of servers based on a priority design
Remove as many manual steps as possible.
Ability to run Production systems at both sites with the ability to move either site if necessary.
Single script regardless of site or direction of move
Configuration of New Environment
VMWare 3.5
EMC CX3-80
Tiered configuration, which each Tier being assigned it's own LUNs
A New Orleans production set of servers and a Dallas set of servers, with each having it's own set of tiers, luns and subnet
Single Script that can move set in a priority fashion to either side, with minimal user steps.
Still requires manually moving subnet, but takes about 5-10 minutes to complete the required steps. (NO VM ip address changes needed)
Attached files
Business Continuity Script Assumptions.doc - This file contains all assumptions of our design and some script requirements.
SAN_Information.csv - This file is an input to the script and documents the EMC configuration for each VMWare Datastore. This information must be accurate for the script to work. The script includes a menu option 99, which allows the testing of the "move" of a lun before it is used in production. This will allow us to validate the information in the csv file
Business Continuity.ps1 - The actual Script file. It includes a number of functions that perform the following goals.
Run_Navi_Cmds - Input is primary SAN location and Datastore that script is actively working with. Will perform the following actions.
Move the lun associated with this datastore out of the primary san's storage group
Sync the mirror associated with this lun with the secondary image at the secondary site
Once synced, will promote the secondary image to primary
Once promoted, will add the new primary to the ESX storage group at the secondary site
Determine_Tier - used by other functions to determine which tier is being actively worked on.
Determine_Move_Direction - Let the user define which Data Center is primary and which group of servers is being moved. It will also create the files holding the each tiers VM and datastore information
Display_Tier_Datastores - Verifies that any Datastores in use by a given tier is not being used by VMs in another tier
Delete_VMs - will allow user to pick tier being worked on, using the vm tier file, will try and shutdown any powered on VM, once powered off, will remove VM from inventory. Once all VMs are removed, will use the tier's datastore file to run the Run_Navi_Cmds function for each datastore in use.
Register_VMs - will allow user to pick tier being worked on, using the vm tier file, will try register all VMs on the appropriate ESX server at the other site
This is an active on going project and our new VMWare 3.5 environment has not been moved to production. We have successfully tested all aspects of our new design without issues. However, we are not finished our testing and will likely be adding additional features to this procedure.
For example, we will probably add to the Run_Navi_cmds function a routine to set all mirrors to manual (no auto syncing). As a policy, we do not replicate from the new primary site to the original primary site until after we have successfully restarted all machines. This gives us a rollback option if we have startup issues. We will also add a menu option to allow us to put a tiers mirror back in auto once we have successfully restarted the VMs at the new primary site.
Based on our preliminary testing, we should see a big improvement in our failover process. For our Tier 1 servers, we expect them to be down no more than 30 minutes once we start our roll process. This alone will allow our upper management to have access to email, blackberry's and other critical data 3 hours faster than our current system setup. During a hurricane event, 3 hrs can be significant
Use at your own risk. This script was tailored for a very specific setup and having a script manipulate your SAN can cause serious issues.