This post has already been read 34981 times!
Stretched clustering is one of the most challenging topics I get when meeting with customers. Many customers think that stretched clustering is the ultimate disaster recovery solution and that it makes SRM obsolete. This is due to the fact that people think that HA will solve all their problems when it comes down to DR and that they still have the advantage of vMotion to have workload mobility between two data centers.
This is NOT true. BUT stretching it makes a very good disaster avoidance solution!!!
a vSphere Metro Storage Cluster (vMSC) is a typical solution which still needs a good DR recovery solution (most of the time)
Disaster Avoidance
This is a process that allows proactive behavior to avoid an impending outage to services. Disasters tend to affect an entire site or have an impact on the services of the entire site even if only a partial site failure is sustained. Disaster avoidance technologies allow for configuration of a vSphere host, cluster or an entire site in such a fashion that irrespective of disaster, the services being provided will continue with minimum interruption. In most cases, disaster avoidance involves brief outages to services at a site followed by an orderly restart at a recovery site. A minimum outage sustained under controlled circumstances is typically considered acceptable as an alternative to sustaining an uncontrolled and extended outage associated with a true disaster.
Downtime Avoidance
Downtime avoidance differs from disaster avoidance as the former migrates the workloads between systems or sites with no downtime and no loss of data. vSphere technologies such as vMotion and Storage vMotion facilitate moving virtual machines or virtual machine storage with no interruption of the services they provide. Configuring vMotion and Storage vMotion requires that vSphere hosts are managed within a single VMware vCenter Server datacenter object and are configured with shared access to storage and network segments.
Disaster Recovery
This process assists rapid recovery from unplanned outages that bring down services in a fashion that makes local recovery within an acceptable time unlikely. In disaster recovery scenarios the goal is to rapidly return to operational status of the services, usually in a different datacenter in a safe location. Disaster recovery solutions will help automate return to operations of services that have stopped due to catastrophic failure of infrastructure.
Host Level
- Disaster avoidance = vMotion to avoid disaster and outage (non-disruptive)
- Disaster recovery = HA restarts VMs (disruptive)
Site Level
- Disaster avoidance = vMotion over distance to avoid disaster and outage (non-disruptive)
- Disaster recovery = SRM or scripted register/power-on of VMs at recovery site (disruptive)
Types of vSphere Metro Storage Cluster (vMSC) Implementations
Single stretched vSphere cluster
- Intra-cluster vMotions are parallelized
- vMotion network requirements = 622Mbps/5ms RTT, L2 equivalence for VMkernel (support requirement) and VM network traffic (operational requirement) (10 ms with vSphere 5 Enterprise Plus/Metro vMotion) This is round-trip time without factoring in replication traffic.
Multiple vSphere clusters
- Inter-cluster vMotions are serialized
- vMotion network requirements = 622Mbps/5ms RTT, L2 equivalence for VMkernel (support requirement) and VM network traffic (operational requirement) (10 ms with vSphere 5 Enterprise Plus/Metro vMotion) This is round-trip time without factoring in replication traffic.
My Experience
On a previous project we implemented a stretched cluster solution onto a greenfield container terminal. A typical use case for a vSphere Metro Storage Cluster (vMSC) solution! We build 2 datacenters 7 km apart and established a very low RPO and RTO. The need of these two datacenters to be close to the key cranes (7km apart) makes this a perfect solution for stretched clustering.
questions that came to my mind where: – What happens when there is a big disaster and we lose the key cranes? There is no operation possible what so ever!.
If the complete port is gone, we can allow for a much longer RTO (Recovery TIME Objective) but we don’t allow much data to be gone (RPO)
This allowed us to allow the DR solution to be replicated backups to a second port 100km away, and use stretched clustering on the site itself to be very flexible and have a very good RPO and RTO in case of smaller disasters (let’s say a fire in one datacenter or a lose of one building one of the datacenters is located.
“Sidedness / preferred side” and other tips
If the dedicated connectivity between VPLEX Metro Clusters is lost, but both Clusters are still up, the very real possibility for split brain exists.
To prevent this split brain scenario and ensure that only one side of the Metro Cluster continues to allow writes to the stretched LUN, VPLEX introduces the concepts of preferred LUNS and
sidedness.
also without running VM’s on a preferred side, VM’s in one site could be accessing storage in another site – Creates additional latency for every I/O operation. (in case of cross connect)
With Sidedness: – VM’s run on their preferred side and storage is accessed locally.
Prior to and including vSphere4.1, you can’t control HA/DRS behavior for “sidedness”
There is no supported way to control VMware HA primary/secondary node selection with vSphere 4.x – Limits cluster size to 8 hosts (4 in each site) – No supported mechanism for controlling/specifying primary/secondary node selection. Methods for increasing the number of primary nodes also not supported by VMware.
As from vSphere 5.*, you can use DRS host affinity rules to control HA/DRS behavior.
vSphere 5 VM HA implementation changes things.
You’ll need to use multiple isolation addresses in your VMware HA configuration! minimal one on each side.
Downside, it needs smart people… what if you’re the smartest person in the room and your organization requires operational simplicity if you’re involved in the disaster? SRM is an easy push-button mechanism.
Downside2, Stretched HA/DRS clusters (and inter-cluster vMotion also) require a stretched Layer 2 network. Complicates the network infrastructure.
The network lacks site awareness, so stretched clusters introduce new networking challenges!
I have collected a lot of documents and links to share.
Victor van der Berg created a nice presentation about “vSphere – What option do you choose for Disaster Recovery” for th Dutch VMUG.
vSphere Metro Stretched Cluster whitepapers
VMware whitepapers about vSphere Metro Storage Cluster (stretched cluster):
- Whitepaper: VMware vSphere Metro Storage Cluster Case Study
- Whitepaper: Stretched Clusters and VMware vCenter Site Recovery Manager Understanding the Options and Goals
- VMware Hardware Compatbility Guide for Storage, select Metro Cluster Storage under Array Test Configuration.
EMC VPlex:
- VMware KB Article 2007545: Implementing vSphere Metro Storage Cluster (vMSC) using EMC VPLEX
- Whitepaper: Using VMware vSphere with EMC vPlex
- Understanding vSphere Stretched Clusters, Disaster Recovery
Hitachi Data Systems:
HP LeftHand:
- VMware KB Article 2020097: Implementing vSphere Metro Storage Cluster using HP LeftHand Multi-Site
- Whitepaper: Implementing VMware vSphere Metro Storage Cluster with HP LeftHand Multi -Site storage
NetApp MetroCluster:
- VMware KB Article 1001783: VMware support with NetApp MetroCluster
- Whitepaper: A Continuous-Availability Solution for VMware vSphere and NetApp
- Whitepaper: Best Practices for MetroCluster Design and Implementation
Some good articles on All Paths Down (APD) and Permanent Device Loss (PDL), this information is required when working with Stretched Clusters. Also take a look at the differences between vSphere 5.0 update 1 and vSphere 5.1.
- vSphere 5.0 Storage Features Part 8 – Handling the All Paths Down (APD) condition by Cormac Hogan on the VMware vSphere Blog
- vSphere 5.1 Storage Enhancements – Part 4: All Paths Down (APD), also by Cormac Hogan on his own blog.
Site Recovery Manager
For Site Revovery Manager the following whitepapers are available:
VMware Whitepapers:
- Evaluation guide: VMware vCenter Site Recovery Manager 5.0
- Technical documentation for SRM 5.1
VMware SRM Storage Replication Adapters:
- SRA Compatibility guide is available here.
- VMware vCenter Site Recovery Manager Storage Partner Compatibility Matrix
- Download the SRA’s through the VMware website (login required)
SRM implementation guides: (Note: this list is not complete but includes some popular solutions)
- Dell: Disaster Recovery with Dell Equallogic PS Series SANs and VMware vSphere Site Recovery Manager 5
- EMC: Using EMC SRDF Adapter for VMware vCenter Site Recover Manager 5.0
- HDS: Deploying VMware Site Recovery Manager 5.0 with VMware vSphere 5.0 on Hitachi Virtual Storage Platform
- IBM: IBM Storwize 7000 Unified, Sonas, and VMware Site Recovery Manager
- NetApp: Deploying VMware vCenter Site Recovery Manager 5.1 with NetApp FAS/V-Series Storage Systems