Engineering at Wistia: Help Wanted
I’m a big believer in having constrained resources: it forces you to focus on the right opportunities. But even I’ve started to think we’re crazy for how lean we’re running engineering over here! There are just four of us. We develop new features for the product, manage networks of hundreds of servers, and analyze massive amounts of data in realtime. Last year, our video processing platform encoded over 16 years of video. And we served over half a billion videos all over the world!
It’s not so much that we’re having trouble keeping up (though that happens too, now and again). It’s more that there are so many great projects to work on, and we’re at a scale where each of these things have a huge impact.
One resource that’s being stretched particularly thin is front-end development. And that’s saying a lot, because Max Schnur (who does 99% of our front-end work) is super human.
So we’re looking for someone to join our small but impossibly productive engineering team. The way we work isn’t for everyone though. There are a number opinions that are strongly held here:
- Do it right. We focus on the long term. We love to be productive, but deadlines are the first thing to be sacrificed in the name of quality!
- Think for yourself. Nothing’s worse than dogma — whether it’s around programming languages, data stores, development methodologies, what have you. We do what’s right for us, and what’s right for us changes all the time.
- Simplicity over everything. Maybe it’s just that my tiny pea brain can’t easily make sense of 100’s of lines of spaghetti code or some fancy new business model. Hopefully not though. It’s really hard to make things simple and elegant, but it’s completely worth the effort.
- Be humble. “Think you’re so great?” is the catch-phrase form of this that’s taken Wistia HQ by storm lately. I doubt this point needs any explanation.
Sound interesting? Take a look at this job posting for more details on what we’re looking for in a front-end developer.
Also, we’re planning to hire for a bunch of engineering positions this year. So feel free to get in touch with me directly if you have questions or just want to chat: brendan @ wistia.
Zero Downtime Datacenter Migration
Through some clever manipulation we migrated one of our core services to a new datacenter in the middle of a week day during peak traffic without downtime. This is the story of how we did it.
Why the move?
First, some back story on why we moved datacenters in the first place.
We’ve been longtime Slicehost customers here at Wistia. All our machines were in their original datacenter in St. Louis (STL-A, if you’re in the know).
Earlier this year, we learned that Rackspace (who now owns Slicehost) is planning on doing away with that datacenter, and we’d be forced to move.
We were a bit nervous when we first heard this because our architecture was no longer just a few boxes, and we’re fortunate to now have lots of customers who depend on our software. No more secret 4am database juggling while watching the logs like hawks to make sure no one is using the service. People don’t stop using it now!
Which service to move first?
So what does our architecture look like? The three major components are the Wistia application itself, our video encoding platform (which we call the Bakery), and our video analytics platform (the Distillery).
They’re all extremely different in the types of resources they need and the components that make them up.
After some noodling, we decided the Bakery was the best candidate to move first. We were still working on a big upgrade to the Distillery that wasn’t quite ready yet, and the Wistia app is particularly tricky to move because it has hundreds of touch points with the customer and there’s lots of little things can go wrong (as opposed to a service consisting soley of a tight API).
The Bakery’s architecture
So what’s the Bakery look like? It’s pretty simple, really. There are three components: a database, Primes, and Breadroutes.
We have a single MySQL database that stores information about all the media customers upload. It’s nothing special. Our schema is very minimal, and there’s not any significant load on the database.
Then there’s what we call Primes. These are the main building block of the Bakery. Each Prime is a standalone Bakery in and of itself. It can accept media uploads, transcode video, store video, and serve up video. The actual pieces of software doing the work here are Nginx, Unicorns running a Rails app, and a custom task processing system written in Ruby called the Oven (keeping with the Bakery analogy, obviously).
Finally, there’s the Breadroute. This is a routing layer that sits in front of the Primes and balances traffic. It’s not a simple round-robin load balancer though. It has access to the database so it can make smart decisions about where to route each request. For instance, if you request a video and it’s available locally on a Prime in the cluster, it will route your request to the box. In this way, the Breadroutes allow all the Prime boxes to function together as a unit. The Breadroute is made up of four Ruby proxy servers built on top of Tom Preston-Werner’s lovely proxy_machine, all sitting behind HAProxy.
Above is a beautiful rendition of how this all comes together. BR is for Breadroute, P is for Prime, and you can probably guess which one the database is. I just realized I forgot to draw the connections to the database. Well, everything is connected to the database! Spoiler alert: don’t read what’s in that red box! Details on that are in the next section.
The migration strategy
The best migrations are the ones that at each step of the way you can easily move both forward with the plan and backward. Through years of doing this, I’ve developed a healthy fear of migrations with a cliff: ones where there’s that one step, that once you do it, you have to go all the way — there’s no going back.
Sometimes the cliff scenario can’t be avoided, and it’s often the most efficient path. But it sure as hell is scary, and it’s something I go out of my way to avoid.
Luckily for us, this migration was a shining example of avoiding the cliff.
The key to seamlessly migrating the Bakery lay in the Breadroute. Instead of having the Breadroute boxes only routing to Prime boxes on Slicehost, we could make them route to Primes in the new datacenter as well.
Once we realized this, the rest fell into place. Here’s what we did.
Phase I: The Setup
1. Command Center in the Rocketship
Ben and I set up a command center in the downstairs conference room (dubbed the Rocketship, see photo). We made a pact not to leave the room until the migration was complete. Blast off.
2. Clone of Slicehost
Setup a rough clone of what we have in Slicehost at Rackspace. We need a bunch of Prime boxes, a few Breadroutes, and a database
These steps were very straightforward thanks to some help from the guys at Rackspace. We were able to move an image of one of our Prime boxes from Slicehost to Rackspace. Once it was over there, we cloned it a bunch of times.
The Breadroute boxes were provisioned from scratch. We have an internal tool (called Doomcrank) that’s kind of like Puppet or Chef, and we used that to build these boxes.
And the database isn’t much more than an “apt-get install mysql-server”.
3. Master-Master MySQL replication
Enable master-master replication between the databases in both datacenters. By master-master, I mean that we can read from and write to to either database and it will be replicated to the other.
This was my first experience with MySQL replication, and I was surprised how easy it was to setup.
Here’s my writeup of how to do master-master MySQL replication.
Phase II: The Transition
This is where we started to actually shift traffic from Slicehost to Rackspace.
1. Slowly allow Slicehost Breadroutes to also route to Rackspace Primes.
Because the Breadroutes are database-backed, we have the ability to easily control where they route their traffic. Normally they’re proxying to Primes on the local private network, but they can proxy over the public internet just the same!
2. Slowly take all Slicehost Primes out of the loop.
We verified that traffic was being served via Rackspace Primes and that things were looking good. Then we started taking Slicehost Primes out of the pool.
3. Move DNS for the service to point at the Rackspace Breadroutes.
Once all traffic was being handled by Primes in Rackspace (and all Slicehost primes were out of the loop), we shifted prime.wistia.com to point at the Rackspace Breadroutes so they would handle all incoming traffic.
Before we did this though, I edited my /etc/hosts file to map prime.wistia.com to the new Breadroutes to smoke test the whole thing.
4. Triple check everything
After everything was moved over to Rackspace, we kept a really close eye on it for an hour or so. The whole migration went so eerily well that we assumed we must have done something wrong and just hadn’t caught it.
We finally convinced ourselves that everything was right, and went out for beers at the Burren right around 6pm. We were both pretty sure this whole thing was going to take us well past midnight, so finishing early was a welcome surprise!
The nice thing about this migration was that the steps were fluid. We could easily revert any change if the slightest thing went wrong. This allowed the whole process to operate at a methodical and comfortable pace, and in my experience, that’s always very welcome when doing something this important.