Paxos has been a leader in digital assets from the beginning. We went out and got a New York Department of Financial Services Trust charter before it was the cool thing to do. We’ve always been focused on doing things the right way, by asking permission first instead of begging for forgiveness later. This has differentiated our company and products, and that is partly why PayPal chose to partner with us to launch its crypto services.
Paxos operates the itBit by Paxos crypto exchange and we use this venue to fulfill buy and sell orders from PayPal customers. Knowing PayPal has a huge customer base, we had to prepare our platform for a surge in volume ahead of a mainstream launch so that everything would go smoothly. Our approach to scaling itBit was similar to everything we do at Paxos – we were rigorous, asked questions and made sure we did all the work in advance. While we won’t tell you all the juicy details, we do want to share some insights about our process that could apply to any tech company. Take these lessons as a general framework for scaling any platform.
Visibility, Observability & Reliability Are Key
Knowing what exactly is happening on your platform at any given moment is crucial to ensuring the end customer experience. Crypto never sleeps, so Paxos makes this consideration a top priority. So, instrumentation and monitoring were a critical part of the architecture and implementation.
We start by estimating the potential capacity we expect and create a plan for supporting that capacity. Our goal is to comfortably stay within uptime and latency SLAs even during times of high volatility and bursts of market activity. Once we’ve adjusted our systems and processes to add capacity, we begin load testing and push our infrastructure to the limits to identify any bottlenecks for uptime and latency that would slow performance. We eliminate those bottlenecks and leverage our monitoring and alerting to provide a clear line of sight to root cause should any hiccups occur along the way. Importantly, If there is ever an issue, we’ve built repeatable runbooks so any engineer, not just one single person, has the ability to tackle outages and deploy fixes as fast as possible. We also test and iterate on these runbooks in other environments and rarely need to use these once in production. The dissemination of knowledge and the constant sharing of processes is key.
Build In Redundancies
“Prepare for the worst and hope for the best” is a well-worn saying that’s especially true when scaling a tech platform. Expect anything that can go wrong will go wrong, then build a back-up solution so your systems can continue running without the end customer knowing any better. Redundancy is your best friend, and we leverage the global scale of Amazon Web Services.
Different components of our system have different characteristics – some are software modules, some are data stores, etc. Each component and subsystem must be considered, architected, and engineered for scale, failover, and redundancy but we have to make sure that at the system level, service doesn’t get disrupted.
- Does your service failover reliably? Test that failover regularly and make sure it works reliably.
- Does your replicated service support more volume? Carefully plan the resource consumption and scale them horizontally while keeping overall resource utilization at acceptable levels.
As we prepared to launch, we performed many failover tests for exchange components and scaled them up. Some infrastructure had to be upgraded to support failover, which we only discovered through rigorous testing, demonstrating how crucial were the tests we performed.
Test, Test, Test and Test Again
If your team initially estimated X months of work to complete the initiative, assume at least 50% of the time will be used on testing the platform. This is not an overstatement. We tested multiple services along the way, but once we completed the work, we tested all components together. And we did this countless times until we had near certainty in our improvements. When the Bitcoin market can easily see $10B a day in trade volume, it’s crucial to test for every potential volume and spike in demand.
Specific teams were responsible for specific endpoints and we rotated the responsibility of testing so that different team members could check and double-check work. Multiple teams were testing new functionality in a sandbox environment on a daily basis. Over the course of the project, a squad worked on service capacity, optimization and monitoring so we could continuously optimize the system to meet anticipated load. Load testing was perhaps the most important because any findings would result in immediate optimizations to the system. Driven by capacity assessments, we’d throw multiples of traffic at the system and monitor outcomes. Our testing included a blend of end prediction and production API calls to ensure we calculated the load accurately. We were relentless in our testing and that resulted in positive outcomes for our partners.
Scaling the itBit by Paxos exchange is just one example of how we’ve had to grow our platform. Based on our current product roadmap, we’ll be using this framework a lot in 2021! If this sounds exciting to you, check out the roles we’re hiring to fill ASAP. We’re looking for the best and most versatile engineers to build the market infrastructure that will power the future of finance.
Shai Borochov is an Engineering Manager at Paxos. Learn more about Shai’s work and his background.