Valid HTML 4.01 Transitional

Batch Jobs in the Cloud

James F. Carter <jimc@math.ucla.edu>, 2011-11-17

In a scientific research computing environment, various users each submit multiple long-running jobs to be executed on a plethora of hosts, often but not necessarily homogeneous and integrated into a cluster. It takes software infrastructure to make these jobs run. Our present job queueing system is way back version, and we now have a new compute cluster, so we need to either pick a modern job queueing system or to refurbish the one we have. In addition, cloud virtualization is now feasible and offers significant advantages.

Advance Conclusions

With links to the various sections.
Mathnet's Users and Jobs

Our users want bulk CPU power and don't want a lot of sophisticated features that they have to learn. They will accept whatever we give them if they can just run their jobs.

Cloud Computing Using Virtualization

When we have extremely long jobs (a month), we gain a lot in flexibility, reliability and responsiveness to the users if we run such jobs in a virtual machine. It isn't clear yet whether these VM's should be created for each job or should be permanent.

Workflow on a Virtual Machine

Creating a virtual machine is nontrivial but is not a prohibitive burden, if we go with creating virtual machines on the fly.

Available Cloud Managers

Cloud manager software exists for creating a public cloud of virtual machines. I doubt that we want a public cloud. We should create our virtual machines using ad-hoc scripting. Permanent virtual machines can be considered, versus creating them on the fly for each job.

Job Queueing Systems

Sun Grid Engine as we know it is expired; if we put in Open Grid Manager (its credible successor) it's equivalent to a whole new queueing system. The currently hot job queueing package, possibly the market leader, is SLURM.

Hypervisors

We need to pick a hypervisor, and there are several credible choices; Xen does not win by default. Jimc sees KVM as the leading contender.

Mathnet's Users and Jobs

The users and jobs in our environment have these characteristics:

Whatever software infrastructure we pick, we as sysadmins want it to give these services to our users:

Cloud Computing Using Virtualization

In the previous millenium, several job queueing systems were developed which accept a user's job and run it on one of a collection of hosts. In the present day, any relevant hardware can run a virtual machine, and several cloud managers are popular, whose job is to make a virtual machine appear at the user's request; he then runs his job there. Our initial decision should be, should we stick with the job queueing model, adopt a cloud (virtualized) model, or use some hybrid? Here is a list of advantages and disadvantages of the cloud model.

I am coming to the conclusion that we should run our users' jobs in individual virtual machines which are created on the fly for each job. An alternative is to pre-create and boot up the virtual machines, whereupon a traditional job queueing system can utilize them. But they can be paused and checkpointed, the same as for aleatory virtual machines, as long as the job queue manager does not become upset if it cannot contact the job shepherd process temporarily.

Workflow on a Virtual Machine

What would the user experience be like in such a regime? And how would the job queueing infrastructure be designed? The users would go through these steps:

What's being described here is the operation of a standard job queueing system, not a public cloud. While the operations needed to create a virtual machine for each job are fairly extensive, they are well within the reach of normal scripting.

Available Cloud Managers

At present there are several commercial cloud vendors, such as Amazon EC2 and Microsoft Azure. Their business model is to create a virtual machine for you and charge you proportional to the amount of time you use it (plus network communication charges). There are several commercial and open source cloud managers by which the owner of a hardware cluster can implement the same business model. But that is not exactly what we want to do with our cloud. Nonetheless, let's look at published evaluations of various cloud managers.

This article gives quite a lot of useful information about three popular cloud managers:

A Comparison and Critique of Eucalyptus, OpenNebula and Nimbus by Peter Sempolinski and Douglas Thain, Univ. of Notre Dame, dated probably 2009.

They approach these systems as cluster managers and in anticipation of picking a job queueing system for a new cluster. Here is their summary table:

Item Eucalyptus OpenNebula Nimbus
Philosophy Imitate Amazon EC2 Private customizable cloud For scientific research
Customization Not too much Totally Many parts
Internal Security Tight, root required Looser, can be raised Fairly tight
User Security Custom credentials Log into head node Register X.509 cert
Ideal Use Case Many hosts, semi-trusted users Fewer hosts, highly trusted users Semi-trusted or untrusted users
Network Issues DHCP on head node only Manual (fixed IP?) Needs dhcpd on every node; Nimbus assigns the MAC

These three cloud managers are oriented to providing virtual machines for the users to run their jobs in, rather than running jobs on the bare metal. Their emphasis is on providing cloud computing on a local cluster. They mention Sun Grid Engine, Condor and Amazon EC2 as systems with alternative philosophies. In their setting, typical users need a fairly large number of cores, and the users know how to handle coordination between multiple virtual machines, e.g. via MPICH.

It is important to remember that a lot of infrastructure is needed in addition to the cloud manager. The authors identify these issues in particular:

The authors note, when testing cloud managers, that invariably the most frustrating aspect was to reconcile the software's assumptions about network organization with what the actual network was willing to provide.

Another useful resource is:

FermiCloud: Technology Evaluation and Pilot Service Deployment (slides from a talk, PDF or PowerPoint, dated 2011-03-22).

This slideshow describes the whole process of setting up a cloud service, in which the cloud manager is only one part of the puzzle. They also evaluated Eucalyptus, OpenNebula and Nimbus (one packed slide for each). They ended up deploying all three. The distribution of emphasis: the physical nodes are dual quad core Xeons, and they put 8 on OpenNebula, 7 on Nimbus and 3 on Eucalyptus. The OpenNebula framework does the best at meeting our requirements, they say. They will focus on OpenNebula in the future. The main positive aspect of OpenNebula is that it can be adapted easily to their needs (which differ from Mathnet's).

Here are summaries of the features of these three cloud managers:

Eucalyptus

ATS is, or recently was, investigating Eucalyptus as the cloud manager for one of its big clusters. See this Wikipedia article about Eucalyptus. Its major features are:

OpenNebula

This cloud manager is suited to a medium-sized cluster in which the users are trustworthy. Its major features are:

Nimbus

This cloud manager is advertised as being for scientific computing. See this rather skimpy Wikipedia article about Nimbus. Its major features are:

Do It Yourself

I'm afraid that all of these cloud managers are overkill for us. The steps to create a virtual machine have been detailed above, and the professional cloud managers provide a lot of services beyond this that are of little or no value to us. It would be prudent to at least install them and try them out, but we may very well decide that this layer of complexity is way overkill, and that we want to write our own virtual machine generator.

Job Queueing Systems

Since about 2002 we have used Sun Grid Engine (SGE) as our job queueing system on the Sixpac and Bamboo clusters, later extended to the Nemo cluster. Prior to that we used PBS on the Ficus cluster. We now have a new cluster, called Joshua, and it is time to think about upgrading our job queueing system, or at the very least, refurbishing SGE with the latest version.

The relation of the virtual cloud to the job queueing system can take two forms.

These are the job queueing systems extant in late 2011:

Sun Grid Engine (SGE)

This is the devil that we know. But our version is almost 10 years old, and the installation is a mess, so we really should upgrade to the latest version. But in the aftermath of economic chaos, Sun Microsystems no longer exists, and the product is now known as Oracle Grid Engine. See this Wikipedia article about SGE for the history. Due to license and ownership changes the project has forked as follows:

Portable Batch System (PBS)

We used this software on the Ficus cluster in the 1990's. Administrative experience was pretty bad. However, the command line interface of PBS has been enshrined in the POSIX standards, and in fact PBS does much of what you would want a job scheduler to do. It is not too clear how much interest there currently is in PBS, or the license status of the various forks, but one could consult this Wikipedia article about PBS. Apparently the fork that is currently in active use is TORQUE.

Condor

This software, from University of Wisconsin at Madison, has been around since the 1990's and is actively used and developed. It is subject to the Apache license (open source). It is documented for both cluster environments and cycle scavenging on end-users' workstations when otherwise idle, for which purpose it can do power management such as wake-on-LAN and hibernation. If jobs are linked with their hacked I/O library they can be paused and restarted, possibly on a different host. However, they have bowed to reality and can deal with non-pausable jobs.

Simple Linux Utility for Resource Management (SLURM)

This software, licensed under GPL2, is currently actively used and developed. It appears to be a no-frills queueing system but it has plugins for many extensions.

Given the license situation and development ambiguity I am inclined to not try to deal with Open Grid Scheduler; because it is ten years in advance of our present SGE version I'm pretty sure that we would not just be upgrading SGE, we would be putting in a whole new job queueing system.

To my mind PBS (or its modern equivalent, Torque) is even less attractive given our poor experience with the original product.

Condor is intriguing, but we have no investment in this product and it may not be as flexible as we might want. I wouldn't object to investigating it, but am not advocating it either.

SLURM appears to be the up-and-coming job queueuing system, and may actually be the current market leader, given the number of TOP 500 clusters that use it. Information from their website suggests that it should be easy to try out and that it is likely to meet our needs once tried. I recommend that we start our search with this product.

Hypervisors

If we are going to have virtual machines, we need a hypervisor to control them. They come in two variants. In the bare metal style the hypervisor manages the hardware and one of the virtual machines, called Dom0 in Xen, has the management role and has special privileges with the hypervisor. In the userland style there is a normal kernel and operating system, and the hypervisor runs as a normal user process, possibly with special kernel support.

Here's a list of the currently popular hypervisors.

User Mode Linux (UML)

This is the prototype of the userland hypervisors. Unfortunately it seems to have lost momentum in the face of recently appearing good and better supported alternatives. One of its very nice features was copy-on-write. It was possible to have a readonly filesystem image, used by multiple guests, which was overlain by a sparse file, one per guest, containing only changed sectors, and which therefore was much smaller.

Jimc is disappointed that UML has gone the way of the dodo.

VMWare

This was the first commercial product for virtualization. Earlier versions used the userland style, and likely this continues today. It is well regarded and widely supported by cloud managers. VMWare Workstation 8, the current version, costs $200 for a standard license, or $120 for academics. (Quantity discounts are possible but I did not investigate if they exist.) Each guest can have up to 8 cores, 2e12 bytes of disc, and 2^36 bytes (64Gb) memory. It is not clear how many simultaneously running guests your license allows. The license appears to be for one physical host, so it would cost $1920 to cover the Joshua cluster, and more in the likely case that we upgrade the job queueing system on the existing clusters.

Jimc does not expect that $1920 will be forthcoming.

Hyper-V or Viridian

This is Microsoft's hypervisor, of the bare-metal style. It requires Windows Server 2008 x86_64 to be the master instance (Dom0). It can support up to 4 cores per guest, and up to 384 guests per host. Guests can be 32-bit or 64-bit (x86_64). The guest OS can be Linux or all variants of Windows (excluding home versions). Hyper-V is a no-charge option. Windows Server 2008 is not free, but I believe that our site license covers it.

Jimc does not seriously expect Hyper-V to be adopted for our compute clusters, but is including this hypervisor for completeness.

Oracle VirtualBox

VirtualBox is widely used, particularly in small-scale projects, and is well regarded. In a 2010 survey, more than 50% of respondents who used a virtualization product used this one. Jimc has had good experience with Sun's VirtualBox. See this Wikipedia article about VirtualBox. Here are its major features:

Xen

Xen uses the bare metal style. It has the major advantage that it is included in the main OpenSuSE distro, and Mathnet has limited and occasionally rocky experience with it. See this Wikipedia article about Xen. Here are some of its features:

Kernel-Based Virtual Machine (KVM)

KVM is currently in active use on TOP 500 clusters and is supported by all of the cloud managers considered. It is provided in the main OpenSuSE distro. But nobody at Mathnet has used it so far. See this Wikipedia article about KVM.

Here is a summary of the hypervisors:

Name Style Active License In Distro Cloud Used By Used For
UML User No GPL No No jimc OS installation testing
VMWare User Yes $120 No Yes jimc Evaluation
HyperV Bare Yes Comml No No none none
VirtBox User Yes ~GPL No ? jimc Win7 TurboTax
Xen Bare Yes GPL Yes Yes charlie WinXP remote desktop
KVM User Yes GPL Yes Yes jimc Development, Turbo Tax

For virtualization it is very important to select a good hypervisor. Jimc is more inclined to pick the userland style, or at least to not pick Xen, because while we have not had actual incompatibilities with Xen, it feels safer to have our real kernel running on the bare metal. Also we have had odd networking effects with Xen, which may or may not be our fault and which may or may not work out better with another hypervisor.

I am familiar with VirtualBox, in the sense that it's the devil I know, and I'm pretty sure it would serve our needs. However, I mistrust ongoing support from Oracle, and if we go with one of the formal cloud managers, I'm not sure if they can handle VirtualBox.

KVM appears to be popular on major high performance clusters (as is Xen), definitely would serve our needs, and is supported by the formal cloud managers. I think we should start our evaluation of hypervisors with KVM.