For over 5 years now Wind River has been exposed to OpenStack, and Ron Breault believes he has grown blasé to the Live Migration feature. It just does what it is supposed to and does it well.
Guest blog by Ron Breault.
But when you stop and think about it, OpenStack Live Migration is really an extraordinary feature. That’s especially true now with the all recent improvements that have been made to it.
Why do you call Live Migration extraordinary?
Because of everything that happens ‘under the covers’ to make it work, and because of what Live Migration enables. With just a few clicks of the mouse in Horizon, a VM running on one physical server can be automatically moved to another physical server.
‘Automatically’ makes it sound simple, but there’s a whole lot of work going on to pull it off: replicating all the VM’s static and dynamic memory – while the VM is running; copying and establishing the VM’s complete network infrastructure on the target node; copying local block storage (if used) to the target node; and briefly pausing and then resuming the VM to complete the process. Depending on the size of the VM, the overall migration interval can be measured in seconds to minutes.
What does Live Migration enable?
All sorts of things important to the operation of an always on, production cloud. Live Migration enables physical servers to be gracefully powered off and upgraded without taking hosted virtual servers offline. In a similar way, important host security updates or bug fixes can be delivered and deployed across servers without stopping any of the hosted VMs.
For example, when using the in-service upgrade feature in Wind River’s Titanium Cloud virtualisation software products, a complete cloud infrastructure can be upgraded from one release to the next, live and in production, thanks in part to the capabilities of Live Migration.
A report was recently issued by the OpenStack Innovation Centre titled ‘High Availability of Live Migration.’ It makes for good reading, and details a thorough study and testing they performed on OpenStack’s Live Migration capability. While Wind River doesn’t want to dissuade customers from reading the full report, the key line from the summary was, for me, this statement: “In conclusion, we were able to prove that Live Migration works.”
If they had asked Wind River first, a lot of work could have been saved: Titanium Cloud has been successfully leveraging Live Migration for many years. While the vanilla OpenStack distribution still has a few kinks to work out, Titanium Cloud has Live Migration down to an art form, and it has been pushing changes upstream to make it even better.
As validated though independent third party testing, Titanium Cloud can execute a Live Migration with under 150ms of VM down-time.
Now with the most recent release of Titanium Cloud, which includes both upstream OpenStack work and Wind River updates, Live Migration is getting even better than ever! Here are just two improvements that the company thinks warrant particular attention:
Performance Increase: Under the latest Titanium Cloud release, testing shows that Live Migration throughput has been significantly increased. In the labs, the company has seen throughput improved by as much as five times over prior releases. That kind of change can make a big difference with large VMs, resulting in a substantially reduced Live Migration interval. Faster migrations can mean reduced timing for planned maintenance activities – the operator simply spends less time waiting for Live Migrations to complete.
Auto-Convergence: The new Auto-Convergence feature is an especially cool innovation. Some VMs can take a long time to migrate due to heavy memory write activities – as fast as OpenStack is able to copy the ‘dirty’ memory contents of the VM from the source to the target, the VM is able to ‘dirty’ its memory again. This means OpenStack might barely keep up, or in some cases, might never catch up – the VM is simply just too busy writing to memory. The new Auto-Converge feature changes that by intelligently slowing down the virtual CPU on the VM so that it can’t dirty its pages as quickly. Since its memory writes are slower, Live Migration proceeds without stalling and is able to stay ahead of the VM. This feature is optional, so if you don’t want to use it with certain VMs, the feature can be turned off; flexibility is key.
There are other interesting changes as well: the ability to dynamically update the maximum Live Migration interval (some VMs always take longer to migrate than others – this helps to avoid timeouts); periodic logging of Live Migration throughput and estimated downtime; reduced maximum default for timeouts from 800s to 180s to name a few.
With all these changes taken all together, Live Migration under the latest release of Titanium Cloud is the best that Wind River has delivered to date. If you manage critical infrastructure using the cloud, Live Migration is an indispensable feature.