When I was living in Singapore I “accidentally” got involved in maintaining a community site for expats in Singapore. Till now, one in a while I’m still doing a bit of webmastering on the site as one of my side activities/hobbies.
As part of the activities on that site, there is a regular newsletter to all community members. Since the site was launched, approximately 350k members have signed up, but many of them have moved on to another country or changed their email address. At the same time, despite various anti-spam mechanisms the member-database is also getting polluted with email addresses that don’t exist, or belong to spammers. It is important to keep those email addresses out of the database as sending newsletters to 350,000 members can be quite expensive, but also risky: It only takes one email address that used to belong to a spammer, but has been converted into a honey pot to get you blacklisted.
To prevent those rubbish email addresses from getting into the final list of email addresses for the newsletter, some serious scrubbing needs to be done. A lot of the scrubbing is done by Data Validation, but before the email addresses are sent to them, they are going through some sanity checks first. Of course the format is checked and the domain is verified against a list of disposable email addresses, but the email addresses are also verified against a database of email and ip addresses that belong to spammers.
Over time, I have collected a list of approximately ~30mln email addresses and ~5mln ip addresses that belong or belonged to spammers. I have tried to upload these email and ip addresses into regular databases such as mysql and postgresql, but unless I gave the DB a huge amount of memory, the upload process started running extremely slow while the number or records in the database increased. Because of that, I have fallen back to Cassandra, as Cassandra never gave me any problems while uploading large datasets, even when I squeeze it into a relatively tiny virtual machine.
Why Unikernel approach?
To share resources as much as possible, I had previously installed Cassandra into a Docker environment, sharing resources with Jenkins, Nexus, SVN, a Cron runner and some home automation containers. I can tell you, Cassandra can be quite a jealous b*tch! When those other processes are also claiming some resources, and Cassandra has less memory than it anticipates on, it will crash really badly. When this happened several times, I decided to give Cassandra it’s own virtual machine on my home XenServer. The only thing about virtual machine is that they cause so much more pain than just a Docker container. When you build a vitual machine, you’ll almost always have an entire OS running in the virtual machine, which needs to be updated and secured, etc.
But that has changed with the unikernels, the next logical step from OS based virtual machines, and (Docker) containerization. A unikernel is basically an application compiled with everything it requires to operate, including:
- application code
- required app libraries
- the runtime (JVM)
- required system libraries in the form of a Library OS
- a RAMdisk and PV kernel
- and packaged as an artifact that can be run against a H/V
But it doesn’t have the typical OS features such as multi-user, multi-tasking, device drivers for all those legacy devices out there, and many services that are potentially a security risk. Instead, the historical baggage which isn’t needed on a modern system is replaced with an OS library linking API calls from the application to the virtualized devices of the hypervisor. This library is often linked into the application so that the application actually becomes the kernel (or the other way around). Unikernels are also meant to just run only one application. That application may in turn spin off several threads, if the application’s runtime (e.g. JVM) supports threads, but may not e.g. fork() or exec().
This sounds ideal to make the smallest VM possible: No OS at all and the application is the kernel!
How to build it
While searching the web for Java based unikernels, I stumbled on OSv, which allows you to run your Java or POSIX apps as a unikernel. It turned out that OSv even has ready-made Cassandra OVA images available that you import straight into your hypervisor.
Of course I wanted to hack around with it, and just taking a ready made appliance from the site wouldn’t satisfy my hacker-genes. So I went ahead a created a nice little Ubuntu VM that would help me build OSv images. I started with a bare Ubuntu 14.04 server (x64) VM on my XenServer and got the build-environment running with:
# Make a clone of OSv from Git
sudo apt-get update
sudo apt-get install git
sudo git clone https://github.com/cloudius-systems/osv.git
# Set-up OSv and make sure all git submodules are up-to-date
sudo git submodule update --init --recursive
After that, there is a brand new OSv build environment in my virtual machine, all ready to rumble. So let’s build that Cassandra image:
# This will take a very long time when it's running the first time as there are
# a lot of things that need to be compiled
sudo scripts/build image=cassandra,httpserver
When the build it ready, it has produced a QCOW2 image in the build/release directory. However, my XenServer doesn’t take beef, and I will have to convert that image to another flavor. As qemu-img is part of the OSv build environment, I decided to use that tool to turn the QCOW2 image into a VMDK, which is a format that can be consumed by my XenServer.
sudo qemu-img convert -O vmdk build/release/usr.img build/release/usr.vmdk
After copying file /opt/osv/build/release/usr.img build/release/usr.vmdk to a windows share, I could easily import it into my home XenServer. XenServer will ask you to confirm the VM dimensions, which I happily did, except for the available memory. Instead of the default 1024MB, I have chosen a bit more and changed it to 2816MB, which gets Cassandra to run Excellent.
You are probably wondering what it looks like when the VM boots. So I’ve included a video below. Please note that my XenServer has a far from ideal storage array. Storage is based on WD green disks that don’t have a great performance, but do keep my energy bill within certain limits.
You can see that there is hardly an OS booting up. Most of the the time is taken by Cassandra starting up. Great!
If you’re wondering what the httpserver in the “sudo scripts/build image=cassandra,httpserver” statement is doing: it’s actually adding a diagnostics and rest application to the virtual machine that allows you to tinker around with your new VM using a web-browser on port 8000. The diagnostics section even has a special page for Cassandra, allowing you to view metrics that you would otherwise only get to see if you buy Datastax Enterprise (the commercial version of Cassandra). Please find a screenshot below.
The REST interface that is also supplied by the httpserver module, allows you to even control and tweak settings of the your application and its runtime environment.
Runs faster too!
I really love the concept of the unikernel. I love the fact that it is way less greedy for resources, but who wouldn’t like side-effects in such as increased security, performance and scalability, Without having to support all kinds of legacy devices and without those numerous “obscure” services running in the background, benchmarks seem to prove that those applications run faster. There have been several benchmarks, but please find the comparison of Cassandra on OSv vs Centos below.
Source (containing a lot more benchmarks): http://events.linuxfoundation.org/sites/events/files/slides/OSvAModernSemiPOSIXLibraryOS.pdf
The unikernel approach is advertised as “ideal for cloud”. Although I have applied it to relieve my home-XenServer from unnecessary resources, I do strongly believe in the Cloud case as well. For a cloud OS, improved security, performance and scalability means a lot as addressing this issues are at the heart of cloud of operations and resource sharing. As a result of usage of unikernel VMs, I believe the next generation cloud can have thousands of small VMs per host, saving considerable cost in hardware, electricity, and cooling, while reducing the attack surface of malicious hackers.
Unikernels allow for the careful management of particularly critical portions of an organization’s data and processing needs both on-premise as well as in the cloud. While it does take some extra work to build unikernel applications, it’s getting easier every day as more developers address orchestration, logging and monitoring challenges. At the same time, unikernels are no harder to package and deploy than Docker containers and compared to the OS on a VM approach, you will save time installing and maintaining a full OS.