RAFC – Redundant Array of Flaky Connections
We are big believers in Cloud Computing — 100% of our own infrastructure is in the cloud. In our first office, we learned that reliable and fast internet connectivity is absolutely crucial. When all your infrastructure is in the cloud, all work grinds to a screeching halt whenever connectivity is lost. In that office, we had a single, “business class” symmetric 10MBit link. In short, it sucked.
When we moved to our new 605 Castro Street office last year, we decided to try a different approach. We took design cues from web-scale applications: Pool commodity resources. Distribute load over the pool of resources. Anticipate failures. Scale horizontally. In concrete terms:
- Set up multiple consumer grade internet connections.
- Buy a router that supports multiple-WAN load balancing and failover.
- Add more consumer grade internet connections when more bandwidth is needed.
So instead of a single symmetric 10 MBit link, we ordered:
- 100MBit/10MBit cable modem connection (Comcast).
- 25MBit/5MBit bonded DSL connection (Sonic.net).
We call it “RAFC” – or “Redundant Array of Flaky Connections”. Combined, these two connections cost around $530/mo (with free Cable TV!), or about 65% less than our previous connection, which ran $1,500/mo. Instead of 10/10MBit, we now have 125/15MBit.
The trickiest part was finding the multi-WAN router we liked. After trying a FortiNet Fortigate box, a Cisco ASA, a Netgear “business class” box, we settled a unit made by a company called Peplink.
Peplink’s entire business is built around doing multi-WAN routers right, and it shows: the box is impressive. It’s very easy to set up, supports a rich set of features, doesn’t crash, has great monitoring capabilities (including syslog, which we feed into Sumo Logic). The Peplink still also has a 3rd WAN port for future growth — horizontal scalability. When we discovered that one of our connections is more reliable, while the other one was faster, we adjusted outbound rules on the Peplink accordingly. SSH connections use reliable connection, S3 transfers use fast one. Most other traffic is load balanced in proportion to the uplink/downlink available on each connection. When one connection fails, all traffic fails over to the other one. This took about 5 minutes to set up.
At this point, this setup supports more than 30 users in our office, and while we have connection outages almost daily, nobody notices. Connectivity has not been an issue in months.
Complete visibility for DevSecOps
Reduce downtime and move from reactive to proactive monitoring.