Archive for May, 2009:
VMware Fault Tolerance
vSphere was just released to general availability today, and one of the best features of this upgrade is the addition of VMware Fault Tolerance. From the VMware site:
VMware Fault Tolerance is leading edge technology that provides continuous availability for applications in the event of server failures, by creating a live shadow instance of a virtual machine that is in virtual lockstep with the primary instance. By allowing instantaneous failover between the two instances in the event of hardware failure, VMware Fault Tolerance eliminates even the smallest of data loss or disruption.
At VMworld 2008 they let us play with a demo of VMware FT, and it really is an amazing technology. Almost like watching your first VMotion (“You mean the VM moved from this server to that server?”). VMware FT will allow you to have two running versions of the same virtual machine. If you lose a host, the VM will continue running with no dataloss and minimal downtime (technically just a couple pings drop, but your users would not be likely to notice a disruption of service). VMware FT does this by sending the same CPU instructions to both CPU’s via a FT logging NIC, which is a dedicated gigabit or better ethernet NIC on your vSphere hosts.
With any software that gives you that kind of power, there are some caveats and requirements to make FT work in your environment. I felt it was a good idea to start a blog post that I could update with the various requirements for the use of FT with vSphere. This list is my no means all-inclusive, but simply a place where I can keep track of the needs and caveats of FT. Read more for my listing of requirements that I’ve found thus far.
Invalid arguments: Virtual machine has no snapshots
I ran into an very interesting issue today with a client who is using Veeam Backup and Replication to keep their virtual machines replicated to a remote ESX server for disaster recovery. Veeam starts a replication job and will take a snapshot of the virtual machine and then replicate the main VMDK disk file to the remote site. When the backup job finishes Veeam will tell VMware to remove the snapshot until the next replication schedule runs. Since we are replicating our VM’s across a slow WAN connection (600Kbps optimized with Citrix WANScalers) the replication can often timeout, or hang. Today I noticed that the replication had not updated since last night. So I needed to stop the replication and re-start it. Since the Citrix WANScalers can cache as well as compress, restarting a failed replication job is usually pretty quick, as most of the data was previously cached on the Citrix boxes.
Here are the details of what I found, and how I fixed it…
