Rendering 16,000 Schematics in the Cloud with RabbitMQ and PhantomJS

Jun 20 2012, 2:30 PM PDT · 2 comments »

Earlier this month we released two new features that upgrade CircuitLab's schematic rendering engine: presentation-quality schematic exports (PDF, PNG, EPS, SVG), and the highly-requested connection dots on schematics. However, as we were about to deploy these updates, we had to reprocess the 16,000+ saved circuits that the CircuitLab community had built since we launched our tool earlier this year. (Editor's note: that’s now 18,000+ circuits!) In this blog post, we're going to take a quick behind-the-scenes look at how CircuitLab leveraged the power of elastic cloud computing to re-render 16,544 circuits in just 57 minutes.

How CircuitLab Renders a Schematic

First, let’s take a look at how the process works for a single circuit. When one of our users saves a circuit from inside the editor, a compact JSON-encoded representation of the circuit is passed to our servers, where it is tagged and stored. This action generates a render request, because we need to produce preview thumbnails for various display around our website, like our public circuit pages. The actual generation of the images from the schematic involves an automated workflow integrating a wide variety of tools: mostly our custom-built Python and Coffeescript processing code, combined with some of Inkscape’s SVG processing code, GraphicsMagic, Ghostscript, as well as modified version of the CircuitLab schematic editor running in a headless Webkit browser powered by PhantomJS. All in all, creating our professional-quality schematic render outputs in a variety of formats is a CPU-intensive process, typically requiring roughly 10+ seconds of a single core of modern CPU time per schematic.

(This little PNG in a blog post doesn’t do it justice -- grab a PDF.)

At 10 core-seconds per schematic, we were looking at nearly 160,000 seconds -- that’s 44 core-hours -- of CPU time to re-render our entire dataset. But who wants to wait 44 hours for results?

Enter the Cloud

Fortunately, re-rendering these schematics is an overwhelmingly parallel problem. In theory, if we had 16,000 CPU cores available to us, we could start each CPU working on rendering a separate circuit, and we’d be done all of them in just about 10 seconds. In practice, contention for shared resources like storage, network, and database make it impractical to scale linearly to this level, and the finite time and effort required to spin up/down new cores means that it simply doesn’t work that way. However, the fundamental idea of elastic cloud computing is that virtualized computing resources can be turned on and off as needed. Cloud computing shines where the workload is variable or unknown, the extreme being cases like ours where an infrequent but large computing job is need. Somewhere between 1 core and 16,000 cores lies a solution that gets our re-rendering job done much faster without requiring a major redesign of these potential bottlenecks.

With help from the team at M5 Cloud Hosting (still in beta!), the cloud computing division of M5 Hosting, we quickly had 8 virtual machines, all on separate hypervisors, dialed up to 8 cores each and ready to go for our re-render job:

Now with 64 high-speed CPU cores at our disposal, we had to find a way to distribute the re-rendering job across them.

Enter RabbitMQ

From initial concept, the rendering system of CircuitLab was built to scale, designed with a producer-consumer message-passing architecture. We developed our system using RabbitMQ as our message broker, which we selected due to its reliability, our team’s familiarity operating Erlang services, and its support of the standard AMQP messaging standard, with client libraries available in a wide variety of languages.

When a circuit is saved, the web server that receives the upload inserts a message into a render queue. RabbitMQ distributes that message to one of the many consumers listening to that queue, and that lucky consumer process gets down to business. While it sounds like we’ve added a lot of complexity for what might otherwise be a simple function call, we’ve added a layer that separates the request for work from the process that does the work, meaning it can be done at a separate point in time, or even on a separate machine. (Further, the web server process is no longer tied up waiting for that rendering job to complete -- an advantage that many newer event-driven frameworks like Erlang, Node.js, Ruby’s EventMachine, and Python’s Twisted all include as as core features.)

We’ve also glossed over a lot of important details, like monitoring nodes and consumers, handling retries, communicating results, consumer message buffering, parallelizable subtasks, timeouts, access to shared resources, networking/tunnels/firewalls, and error handling, all of which are engineering aspects to address if you’re considering your own parallel architecture for distributing any sort of computing job. Almost all of that complexity exists before we move a single job to another machine!

Namespaced Environments

At CircuitLab, we have several global namespaces, which are essentially containers for all of the data within our universe. Our web servers, databases, file storage interfaces, in-memory caching, backups, and our render consumer system all are separated by these namespaces. The most obvious and important of these namespaces is simply “production”, where the live site runs. However, at any time we also have one or more “dev” namespaces, where our team does development and testing before any code is deployed to the live site. In those cases, data is neatly segregated, and no information other than source code crosses the line between “production” and “dev”. All of these are neatly separated by a mechanism appropriate by application: separate domain names, separate database names, separate memcached key prefixes, and for RabbitMQ, separate “virtual hosts”.

However, for this re-rendering task, we had to introduce a slight cross between these otherwise-distinct groups. We needed to operate on the data from “production”, using the same set of consumer processes, but with a separate queue so that this reprocessing task with its 64 cloud consumer cores would not interfere with the operation of the live site. We again called upon RabbitMQ’s named virtual hosts to provide a parallel set of queues, “production-reprocess”. This allowed us to connect our temporary cloud of consumers only to those queues, and we could configure our message producers to use that set of queues as well, while any circuits saved in “real time” during the operation would still go to our normal render consumer farm.

The Results

With just a bit of modification to our existing codebase, we were able to quickly deploy our render consumers to our shiny new 64-core cloud. We injected 16,544 messages into a queue, each representing a circuit that needed to be re-rendered, and stood back to watch as the consumers all fired up to grab a task and get to processing. In the end, the task finished in just 57 minutes of wall time. That’s (57/60)*64 = about 61 core-hours. Not bad in comparison to our initial rough guess of 44 core-hours!

The emergence of elastic cloud computing makes it possible to slice and dice CPU time to the level of individual core-hours. While there’s no single price or performance point to compare universally, a typical price is $0.03 to $0.10 per core-hour (plus bandwidth, RAM, storage, etc). Because we can turn the cloud machines on and off on demand, our entire 16,000+ circuit job took just a few dollars of CPU time to run!

The bottom line: moving even easily-parallelizable computation to the cloud presents a number of systems architecture challenges and added software and sysadmin complexity. However, if you design your systems with this possibility in mind, the flexibility provided by inexpensive on-demand computing power can enable you to tackle big workloads when you need to.

Give it a try! Take one of the sample circuits for a spin in CircuitLab, our online circuit simulator and schematic editor.


Comments

1" When a circuit is saved, the web server that receives the upload inserts a message into a render queue. RabbitMQ " Don't need to do this. When people check it,then render it,then cache it.

2 Inkscape is GPL not LGPL .You can't use it when you don't like to opensource.

by mcuhack
June 20, 2012

Hi @mcuhack #1: While we didn't describe the exact workflow our system uses, suffice it to say that the render happens when we need it to for the rest of our website to work properly. #2: As a matter of company policy I can't comment on any specific legal issues, but your understanding of these software licenses, as well as the various mechanisms by which different pieces of software can interact, is incomplete.

by mrobbins
June 20, 2012

Leave a Comment

Please sign in or create an account to comment.