VMware announced this week the availability of stretched availability zone clusters in VMware Cloud on AWS. This has stirred up a lot of interest in the offering and customers are jumping on it. To me it is very reminiscent of the old days of metro-clusters. When we announced metro-clusters a few years back people were scrambling to implement them – whether they really needed them or not. It was cool and shiny and fancy – all the things us techies love! But as architects we need to take a step back and look at the business requirements and whether the complexity is warranted.
Along with this new offering we have Site Recovery as-a-Service that has been available for some time in VMware Cloud. This is very familiar to many of us as its pretty much SRM in the cloud. This offers a robust Disaster Recovery solution for on-prem to cloud or even cloud to cloud.
I’m not going to get in to the technical nitty gritty of these solutions, some of our fantastic VMware resources have already done this. To get up to speed on the solutions I recommend reading the following posts:
What I really want to discuss is the difference between the two and the use cases for each.
- Use Cases:
- Very low RTO/RPO application requirements (Sub 5 minute)
- For use with high availability application architectures
- Fully automated protection from AWS AZ outages intra-region
- Fully automated
- Very low RTO/RPO
- Low latency spanned L2
- VPN will automatically come up at available site – depending on the AZ that fails connections may drop and need to be reconnected
- Application startup order is random, currently HA restart priorities are not available (roadmap item)
- Data egress charges – this can be quite costly depending on workload – billed at a lower rate than normal egress charges – 1c/GB
- Additional cluster charge for stretched-AZ feature
- Requires enough capacity at each site for all workloads
- Possible performance degradation due to writes being committed on both sides
- Does not protect from data corruption, nor offer multiple recovery points
- Only same region supported
- Requires minimum 6 hosts
- Hosts must be added 2 at a time
Site Recovery as-a-Service:
- Use Cases:
- Site to site recovery (either on-prem to VMC or VMC to VMC)
- Orchestrated recovery
- Planned migrations
- Recovery plans are testable – this is a HUGE value, to me you don’t have DR unless you test it.
- Does not require synchronous storage or spanned L2
- Orchestrated failover plans ensure application availability
- Protects from region failures
- Multiple point-in-time restore options
- Recovery site can be used for non-critical workloads that are automatically powered off in recovery plan
- Can utilize a low number of hosts in recovery site and scale up extremely fast in the event of a disaster
- Requires manual failover initiation – someone needs to hit the button
- Per VM cost
- Lowest RPO available is 5 minutes
- Requires enough capacity at failover site for all workloads
- Requires minimum 4 hosts at each site
This is just a brief overview of the solutions, an in-depth discussion with your customers really needs to be had to figure out which solution is right for them. It might also not be an “or” but an “and” as these solutions can be combined.
To me this really boils down to the age old discussion of “availability” versus “recoverability” which I wont get in to as its a blog post on its own – read my compatriots blog post (https://cloud.vmware.com/community/2018/05/01/business-continuity-planning-basics-vmware-cloud-aws/) for that info. These concepts are very often interchanged but solve very different problems. I hope this post has highlighted those differences in our offering and will help in making a decision for you and your customers.