Transformation—Replacing the engine in a running car
A few years ago, engineering at ConnectWise started on a journey to reduce the time to market for product delivery so that we could reduce partner churn and increase our footprints in the managed service provider (MSP) market.
A handful of engineers started exploring the existing system to accommodate the given requirement by keeping in mind that the solution should be horizontally scalable and be able to handle at least 10 times the current load (one million managed endpoints). Within a week’s timeframe, this team did a fantastic job of revealing road blockers that inhibited them from achieving the necessary requirements. The road blockers were:
- Manual test & deployments
- Monolith architecture
- Self-managed servers
Manual tests and deployments
All the manual work done by our teams to validate and deploy an artifact in the production was a major blocker to achieving 10 deployments a day from a single deployment a month. So, we decided to automate all possible manual work to accelerate the feedback cycle in the development phase to eventually lead to faster deployments to production.
Scaling services with the monolith architecture (having Microsoft tech stack technologies) along with the session management made it difficult to scale at the required speed. It became more difficult for us as multiple teams were working on the single codebase. As a result, every production deployment needed to go through mandatory approval cycles and manual validation by each team.
Although we have seen that self-managed servers can work like a charm in any organization. When it came to us, this became a blocker as we needed to scale at a much faster pace. Adding on to that, procurement of new servers became a challenge, along with exponential scaling costs due to legacy monolith PL/SQL-centric design.
A team of the same group of engineers started working toward finding a solution to these road blockers and soon devised a plan to address them. The plan included a couple of major changes along with the introduction to a transformational journey in the organization. The team came up with multiple solutions to solve the scaling problem, one of them was to rewrite existing product in parallel and replace one fine day (a big-bang release), whereas other approach was to replace critical components one by one in the existing system. We decided to go with the second approach as that allowed us to validate the solution quickly in production to prove it could achieve the required scale. On the other hand, it brought a challenge akin to replacing the engine in a running car where the car’s speed should not drop below 100 miles per hour.
Monolith vs. microservices
We finally came to the conclusion that it was time for us to move away from the monolith architecture (having Microsoft tech stack technologies) to stateless microservice architecture (having horizontally scalable technologies like Golang, Kafka, and Cassandra).
Both monolith and microservice architectures have their own advantages such as:
The first step toward breaking up a monolith was to think about the separation of code and data based on feature functionalities and convert them into a Microservice. This was so we could start replacing the engine in this running car, where the car’s speed should not drop below 100 miles per hour.
As scaling was the basic requirement for us, we decided to move away from the monolith architecture and introduced stateless microservice architecture, along with horizontally scalable technologies like Kafka, Cassandra, etc., so that we could scale applications horizontally based on the demand. It also helped us reduce dependencies between teams (default backward compatibility) and faster delivery into production as teams could deploy independently.
Our team of engineers started carving out an independent set of functions that could interact with other components but could be replaced at any time in the future. The modules created by us reduced the time to market for any new product drastically as all the foundation code was already available—we just needed to build business logic on top of that.
We chose stateless architecture for our services as this allowed for easier horizontal scaling of services. Adding more servers to a fleet is now far more straightforward. Since we don’t have to worry about the tracking state, this has resulted in cost savings for us as we can quickly provide more servers to deal with demand peaks.
Event-driven and asynchronous services
Going from monolith to microservices is a major paradigm shift. Both the software development process and the actual code base will look significantly different going through this transition. To wrap up, we will quickly cover service-to-service communications and designing for failure, both of which are important concepts in microservices development.
There are two ways that services can communicate with one another—synchronously over REST and asynchronously via messaging. We use both communications based on the demand of the business at ConnectWise.
As of today, we are running over 70 microservices in the production environment, with each of them communicating via REST API.
We decided to move away from manual work and introduced automation as a part of the organization’s culture, automating everything from testing to deployment by using a shift-left approach.
- Automate anything we will do more than once
- Introduce tools and framework that can ease day-to-day life, improve visibility, and productivity
We opted for a continuous deployment pipeline to deliver any code change from the developer’s machine, to lower environments, to the production using a single artifact while building on the promotion lifecycle, so that customers could use a feature as soon as it is available. Our pipeline includes:
- Automating all the functional test cases
- Automating all the performance test cases
Containerization is not a new concept, but this helped us to expedite the transition towards microservices with diverse tech stacks. At ConnectWise, we created a self-service runtime platform to deliver microservices. The goal is to drastically reduce each team’s operational overhead for creating microservices.
In the modern era of software development, we always think about using existing open-source software instead of reinventing the wheel to build applications. This can bring some known vulnerabilities in the application if they are present in the open-source components or framework. We perform both static and dynamic analyses at ConnectWise to discover common security patterns in the pipeline itself.
The automation helped us to reduce feedback cycles and establish a stable master for any timed releases in the production without waiting for multiple teams to provide their signoff.
As of today, we are running this automation on every commit (submission) to identify any broken change as soon as it is introduced in the codebase.
We decided to host our services on the cloud instead of self-managed servers so that we can:
- Ensure that all the services are cloud agnostic
- Avoid additional procurement-related issues
- Perform on-demand scaling via auto-scaling deployment jobs
- Provide infrastructure automation for on-demand scaling
Self-healing and auto-scaling
Nowadays, self-healing and auto-scaling is the bare minimum requirement for service needs to scale based on the load. We have built a framework to perform on-demand scaling via auto-scaling deployment jobs which automates the monitoring of application metrics, from the infrastructure creation to application deployment. This enables it to accommodate additional load coming during peak hours and reduces it again as soon as the load reduces so that we can optimize cost.
As of today, we are running over 800 servers in the cloud to host our services and technology components.
As of today, we at ConnectWise are doing 10 deployments daily (continuous deployment) from the once-a-month deployment. Receiving around 10 million requests per minute (around one and a half million endpoints monitored) on our critical component. We expect this to increase by 10 times in a couple of months, which is equal to around 100 million requests per minute (about 10 million endpoints to be monitored).