Download How And Why Swiftype Moved From EC2 To Real

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Extensible Authentication Protocol wikipedia , lookup

Transcript
How And Why Swiftype Moved From EC2 To Real Hardware
MONDAY, MARCH 16, 2015 AT 8:56AM
This is a guest post by Oleksiy Kovyrin, Head of Technical
Operations at Swiftype. Swiftype currently powers search on over
100,000 websites and serves more than 1 billion queries every
month.
When Matt and Quin founded Swiftype in 2012, they chose to build the company’s infrastructure using
Amazon Web Services. The cloud seemed like the best fit because it was easy to add new servers without
managing hardware and there were no upfront costs.
Unfortunately, while some of the services (like Route53 and S3) ended up being really useful and incredibly
stable for us, the decision to use EC2 created several major problems that plagued the team during our first
year.
Swiftype’s customers demand exceptional performance and always-on availability and our ability to provide
that is heavily dependent on how stable and reliable our basic infrastructure is. With Amazon
we experienced networking issues, hanging VM instances, unpredictable performance degradation
(probably due to noisy neighbors sharing our hardware, but there was no way to know) and numerous
other problems. No matter what problems we experienced, Amazon always had the same solution: pay
Amazon more money by purchasing redundant or higher-end services.
The more time we spent working around the problems with EC2, the less time we could spend
developing new features for our customers. We knew it was possible to make our infrastructure work in the
cloud, but the effort, time and resources it would take to do so was much greater than migrating away.
After a year of fighting the cloud, we made a decision to leave EC2 for real hardware. Fortunately, this no
longer means buying your own servers and racking them up in a colo. Managed hosting providers facilitate a
good balance of physical hardware, virtualized instances, and rapid provisioning. Given our previous
experience with hosting providers, we made the decision to choose SoftLayer. Their excellent service and
infrastructure quality, provisioning speed, and customer support made them the best choice for us.
After more than a month of hard work preparing the inter-data center migration, we were able to execute the
transition with zero downtime and no negative impact on our customers. The migration to real hardware
resulted in enormous improvements in service stability from day one, provided a huge (~2x)
performance boost to all key infrastructure components, and reduced our monthly hosting bill by
~50%.
This article will explain how we planned for and implemented the migration process, detail the performance
improvements we saw after the transition, and offer insight for younger companies about when it might make
sense to do the same.
Preparing For The Switch
Before the migration, we had around 40 instances on Amazon EC2. We would experience a serious
production issue (instance outage, networking issue, etc) at least 2-3 times a week, sometimes daily. Once
we decided to move to real hardware, we knew we had our work cut out for us because we needed to switch
data centers without bringing down the service. The preparation process involved two major steps, each of
which has a dedicated explanation in their own sections below:
1. Connecting EC2 and SoftLayer. First, we built a skeleton of our new infrastructure (the smallest
subset of servers to be able to run all key production services with development-level load) in
SoftLayer’s data center. Once the new data center was set up, we built a system of VPN tunnels
between our old and our new data centers to ensure transparent network connectivity between
components in both data centers.
2. Architectural changes to our applications. Next, we needed to make changes to our applications
to make them work both in the cloud and on our new infrastructure. Once the application could live in
both data centers simultaneously, we built a data-replication pipeline to make sure both the cloud
infrastructure and the SoftLayer deployment (databases, search indexes, etc) were always in-sync.
Step 1: Connecting EC2 And Softlayer
One of the first things we had to do to prepare for our migration was figure out how to connect our EC2 and
our SoftLayer networks together. Unfortunately the “proper” way of connecting a set of EC2 servers to
another private network – using the Virtual Private Cloud (VPC) feature of EC2 – was not an option for us
since we could not convert our existing set of instances into a VPC without downtime. After some
consideration and careful planning, we realized that the only servers that really needed to be able to connect
to each other across the data center boundary were our MongoDB nodes. Everything else we could make
data center-local (Redis clusters, search servers, application clusters, etc).
Since the number of instances we needed to interconnect was relatively small, we implemented a very
simple solution that proved to be stable and effective for our needs:

Each data center had a dedicated OpenVPN server deployed in it that NAT’ed all client traffic to its private
network address.

Each node that needed to be able to connect to another data center would set up a VPN channel there and
set up local routing to properly forward all connections directed at the other DC into that tunnel.
Here are some features that made this configuration very convenient for us:

Since we did not control network infrastructure on either side, we could not really force all servers on either
end to funnel their traffic through a central router connected to the other DC. In our solution, each VPN
server decided (with the help of some automation) which traffic to route through the tunnel to ensure
complete inter-DC connectivity for all of its clients.

Even if a VPN tunnel collapsed (surprisingly, this only happened a few times during the weeks of the project),
it would only mean one server lost its outgoing connectivity to the other DC (one node dropped out of
MongoDB cluster, some worker server would lose connectivity to the central Resque box, etc). None of those
one-off connectivity losses would affect our infrastructure since all important infrastructure components had
redundant servers on both sides.
Step 2: Architectural Changes To Our Applications
There were many small changes we had to make in our infrastructure in the weeks of preparation for the
migration, but having deep understanding of each and every component of it helped us make appropriate
decisions reducing a chance of a disaster during the transitional period. I would argue that infrastructure of
almost any complexity could be migrated with enough time and engineering resources to carefully consider
each and every network connection established between applications and backend services.
Here are the main steps we had to take to ensure smooth and transparent migration:

All stateless services (caches, application clusters, web layer) were independently deployed on each side.

For each stateful backend service (database, search cluster, async queues, etc) we had to consider if we
wanted (or could afford to) replicate the data to the other side or if we had to incur inter-data center latency
for all connections. Relying on the VPN was always considered the last resort option and eventually we were
able to reduce the amount of traffic between data centers to a few small streams of replication (mostly
MongoDB) and connections to primary/main copies of services that could not be replicated.

If a service could be replicated, we would do that and then make application servers always use or prefer the
local copy of the service instead of going to the other side.

For services that we could not replicate with their internal replication capabilities (like our search backends)
we made the changes in our application to implement replication between data centers where asynchronous
workers on each side would pull the data from their respective queues and we would always write all
asynchronous jobs into queues for both data centers.
Step 3: Flipping The Switch
When both sides were ready to serve 100% of our traffic, we prepared for the final switchover by reducing
our DNS TTL down to a few seconds to ensure fast traffic change.
Finally, we switched traffic to the new data center. Requests switched to the new infrastructure with zero
impact on our customers. Once traffic to EC2 had drained, we disabled the old data center and forwarded all
remaining connections from the old infrastructure to the new one. DNS updates take time, so some residual
traffic was visible on our old servers for at least a week after the cut-off time.
A Clear Improvement: Results After Moving From EC2 To Real Hardware
Stability improved. We went from 2-3 serious outages a week (most of these were not customer-visible,
since we did our best to make the system resilient to failures, but many outages would wake someone up or
force someone to abandon family time) down to 1-2 outages a month, which we were able to handle more
thoroughly by spending engineering resources on increasing system resilience to failures and reducing a
chance of them making any impact on our customer-visible availability.
Performance improved. Thanks to the modern hardware available from SoftLayer we have seen a
consistent performance increase for all of our backend services (especially IO-bound ones like databases
and search clusters, but for CPU-bound app servers as well) and, what is more important, the performance
was much more predictable: no sudden dips or spikes unrelated to our own software’s activity. This allowed
us to start working on real capacity planning instead of throwing more slow instances at all performance
problems.
Costs decreased. Last, but certainly not least for a young startup, the monthly cost of our infrastructure
dropped by at least 50%, which allowed us to over-provision some of the services to improve performance
and stability even further, greatly benefiting our customers.
Provisioning flexibility improved, but provisioning time increased. We are now able to exactly specify
servers to meet their workload (lots of disk doesn’t mean we need a powerful CPU). However, we can no
longer start new servers in minutes with an API call. SoftLayer generally can add a new server to our fleet
within 1-2 hours. This is a big trade-off for some companies, but it was one that works well for Swiftype.
Conclusion
Since switching to real hardware, we’ve grown considerably – our data and query volume is up 20x – but our
API performance is better than ever. Knowing exactly how our servers will perform lets us plan for growth in
a way we couldn’t before.In our experience, the cloud may be a good idea when you need to rapidly spin up
new hardware, but it only works well when you’re making a huge (Netflix-level) effort to survive in it. If your
goal is to build a business from day one and you do not have spare engineering resources to spend on
paying the “cloud tax”, using real hardware may be a much better idea.