Seven hard-won lessons learned from migration a monolith to microservices

Jon Edvald

November 25, 2020

This post originally appeared on InfoQ.

Even in 2023, there’s a good chance that you’re still working with at least one legacy system. If so, you’re probably thinking about whether to migrate to a microservices architecture.
In some cases, it’s better to skip the migration—for example, if most of your development is happening in new systems anyway and few devs ever have to interact with the legacy code.
Because of the overall complexity and difficulty with making sound estimates, it’s usually not a good idea to make a refactor an official “project” with a defined start and end date. But you’ll still need top-down buy-in for the effort, and you’ll need to schedule work so that you can make consistent progress.
If they don’t already exist, writing automated tests before refactoring is one of the best things you can do to ease the process and be sure your new system behaves the way it’s supposed to.
Stick to known patterns as you build your known system, and resist the urge to go straight to the bleeding edge. Pioneering a brand new approach to microservices is expensive, and the refactor will keep you busy enough.

Monolith to microservices: The good, the bad, and the ugly

Over the years, I’ve been both part of and have closely observed several efforts to migrate monolithic legacy systems to a microservice architecture. Through sometimes bitter experiences, I’ve learned of many pitfalls and challenges involved. One of these efforts ended up personally costing me considerable amounts of money, and arguably was a key factor in killing the company involved. I won’t get into specifics there, but I sure don’t want others to have such experiences.

So I’d like to attempt to distill some of the lessons I’ve learned, in the hope that your migration across the harsh lands of refactoring goes as well as possible. Even though a lot has been said about this topic already, there’s still some advice I wish someone had given me, some of which I haven’t seen covered before.

I’m a co-founder of Garden. We make it easier for developers to test, review, and troubleshoot cloud native microservices applications, (our core product is open source—by all means, check it out if it sounds interesting), and so I spend a lot of time working with users who are going through this very migration process. Additionally, through these interactions, I’ve picked up some important lessons that I think are worth sharing.

Let’s be honest. Even in 2023, many, if not most, mature businesses still run a legacy system that is mission-critical but getting increasingly difficult to scale, operate, and maintain. These are often complex multi-functional systems that hold too much responsibility. They’re known as monoliths, a word now synonymous with applications that don’t follow modern development architectures, don’t meet scalability requirements, and/or match organizational structures. Too many people working on one tightly coupled codebase can create serious bottlenecks and slow down development, and the coupling can cause reliability and scaling problems.

So it’s perfectly reasonable to be thinking about migrating to a microservices architecture. Here are some things I’d suggest that you consider as you contemplate and plan your next steps. Start with piece of advice number one.

#1 Maybe, don’t

Seriously. Really consider whether you actually need to split up your monolith, and if you do, make sure you do it for the right reasons. There are many good reasons for a refactor, but here are some signs that you could, or should, possibly avoid it:

Most of your feature development is already occurring in your newer systems. Perhaps you don’t have a development velocity problem, and the legacy system is just responsible for things that don’t need a lot of iteration.
The system still meets your scalability needs. Even if it isn’t modern, maybe it still handles the load you’re dealing with (and expect to soon). Monoliths can be quite efficient since they usually don’t rely on networks and high-level APIs for their internal communication.
You don’t have that many developers that need to work on it. This could be because a single team is responsible for it, or you don’t have a large organization to begin with, which might otherwise overload the development pipeline and create process bottlenecks.
It’s actually architected quite well. Monolithic systems tend to become balls of spaghetti that are intricate, have tight coupling between different parts, and are difficult to understand. But maybe you were able to avoid those common trappings and maintain a good structure this whole time.

Of course, if you nod your head to all four of these, the conclusion is obvious. Just don’t. Consider microservices for future projects but keep your focus on your actual business problems, and let your solid system do its thing.

However, you’d probably not be reading this and considering a refactor in the first place. Reality is complex, and it’s probably between zero and three of these points that match your situation. If it’s zero, you don’t have much of a choice. For anything in between, there’s a judgment call to be made, and you may have other good reasons to press on.

You need to weigh your challenges against the cost of the migration. Your estimated cost should include at least the following:

The cost of the developer hours. This is the most obvious factor but is often difficult to estimate.
The opportunity cost of not spending those hours on new development. I find this is often overlooked or underestimated. Your business is competing with others, and any time spent not moving forward should be considered carefully.
Cost of new tooling and processes. Particularly if you don’t have an existing, mature microservices infrastructure, you need to consider the cost of adopting the new methodologies, and the tooling and infrastructure investment required.

Consider all these factors, make an estimate, and then multiply that by about three times to account for all the uncertainty involved.

If all that weighs less than your current headaches, forge ahead. If you’re still unsure, perhaps the following bits of advice can help you decide if it’s worth the effort.

#2 Don’t make it a project

This may need some explanation. What I mean by a project is something you schedule and plan out, allocate many weeks of dev time, and so on. That is, something you expect to work on continuously, uninterrupted, and has a start and projected end date. Making the whole migration a project, with the goal of completing that project, is a common response to technical debt that I generally find to be a mistake.

There are a few reasons for that:

You’d most likely block and allocate the time of a subset of your organization to work on this technical debt project. They may be excited to do it because these can be interesting engineering problems that your team may have been thinking about solving for a while now. But this can disconnect those people from other ongoing development, and can sometimes result in friction because their goals are different than the rest of the organization.
If you put the whole organization on the project, you halt or slow any development as far as your users/customers are concerned. Plus, you probably have too many cooks in the kitchen, and you exacerbate any bottleneck problems you already had with the monolith. Ultimately your problem is going to be the slowdown in new development because…
It’s going to take longer than you think. Or longer than your team thinks—even your most senior and capable engineers. It is so easy to fall into the trap of underestimating the sheer complexity of a refactoring project. And it’s easy to forget all the subtle logic and little nuances gradually embedded in your code while getting caught up in the high-level architecture.

Instead of making it a project, I suggest making it an ongoing effort. It’s an important distinction because an effort doesn’t necessarily have a timeline and ideally doesn’t block other projects. Rather, it is one of your business goals.

#3 Commit to the effort

This, however, only works if you commit to the effort, and don’t let it slide again and again when you prioritize your work. Committing to the effort means engaging your organization and making sure the effort has cultural support.

It’s all too easy to get some nodding heads in a meeting, start strong, and then promptly have the effort overruled by competing priorities in the next sprint, week, quarter, etc. Urgent tends to defeat important in our day-to-day business, so you’ll need to keep beating the drum to avoid getting drowned out by the neverending parade of urgency.

Here are some specific things you can do to make sure your organization remains committed to the effort:

Take the time to scope out the work required and break it into manageable, self-contained pieces, with clearly defined objectives. This will make it easier to do alongside other projects and reduces the risk at each step.
Make sure the work has the support of management, and—if your monolith is central to your organization—is seen as a strategic priority. Management needs to understand the advantages, as well as the looming impact of avoiding the effort. In many cases, you’ll be preempting disastrous scaling issues or productivity grinding to a halt, if it hasn’t already. In that context, a sustained refactoring effort may not sound too bad.
Demonstrate progress. If you’ve segmented the work into manageable chunks, you should have a sense of progress as you complete each part. This is both rewarding for the team (checking boxes and moving across columns feels nice!), and also avoids the sense that the refactor is a never-ending rabbit hole, which would likely cause disillusionment.

#4 Write tests

It is common and perfectly normal for automated tests to be lacking for any system, including legacy systems. If you’ve had the pleasure of refactoring code that had great tests already in place, you’ll appreciate this. Perhaps you have and didn’t even notice, because it went so smoothly.

Without automated tests, it’s very difficult to evaluate whether the new code does the same job as the old code. This applies whether you’re actually rewriting it (maybe in a new programming language) or just moving your old code around. Even the most trivial refactors can create subtle issues or inconsistencies, no matter how skilled the developers involved are.

If you don’t feel good about the current test coverage, start by addressing that issue. Writing good tests will dramatically ease the effort, and also serves as a great way to understand the codebase better, its structure, intricacies, and idiosyncrasies (you’re gonna find those for sure).

If you already have a bunch of tests, make sure to dig in and see what they cover. Do they test the system from the outside, e.g. at the API level, or do they only test the code underneath the surface? The latter, (unit tests basically), while important, have somewhat less direct value when the time comes to split the code up and move things around. Good integration and end-to-end tests that don’t make assumptions about the implementation under the surface can be kept as-is and used as benchmarks during the migration.

#5 Slowly reduce it to a proxy

This is two separate bits of advice in one. The first part is to do the split incrementally. Don’t assume you can successfully do a wholesale migration and then flip a switch and everything will be dandy. It’s simply too risky. Find out where your hotspots are, your biggest bottlenecks or other challenges, and peel those out first, one by one.

The second part is perhaps optional and not always applicable, but I highly recommend keeping the monolith as a front to your new services for as long as is viable. This has the benefit of keeping the same surface as far as the user/consumer of the service is concerned, and allows you to keep the same integration and end-to-end tests throughout the process.

A sensible alternative to this is to first make a proxy, with just the monolith behind it, and then start to route to different services as you peel them out and/or add new ones. That’s pretty much equivalent, and your best strategy depends on your current system and where you’re headed. In either case, you can benefit from gradually migrating parts of your monolith without its users knowing or caring.

So even as you move functionality to newly built services, the monolith can still exist throughout the refactor, and maybe for some time after it—either serving as the façade for new functionality or hidden behind a façade in the form of a proxy.

#6 Stick to known patterns

It’s tempting to go from legacy right to the bleeding edge. And it’s an understandable urge. You’re seeking to future-proof this time around so that you won’t face another refactor again anytime soon.

But I’d urge caution in this regard, and to consider taking an established route. Otherwise, you may find yourself wrangling two problems at once, and getting caught in a fresh new rabbit hole. Most companies can’t afford to pioneer new technology and the ones that can tend to do it outside of any critical path for the business.

For much the same reason, it’s often advisable to avoid rewriting in a new language—or even something seemingly innocuous like moving from Python 2 to 3—in the same process, since it’ll be doubly difficult to debug any behavior differences you encounter along the way.

And whatever you do, don’t try to invent a whole new way to do microservices or whatever else. If you think you need to fully invent your own solution to your problem, as opposed to adopting a known solution, I’d offer that you probably don’t. Keep looking for solutions already out there, ask around, and field your problem outside your organization.

#7 Prepare to invest in tooling

For all its limitations, a monolithic architecture does have several intrinsic benefits. One of which is that it’s generally simple. You have a single pipeline and a single set of development tools. Venturing into a distributed architecture involves a lot of additional complexity, and there are lots of moving parts to consider, particularly if this is your first time doing it. You’ll need to compose a set of tools to make the developer experience palatable, possibly write some of your own, (although I’d caution against this if you can avoid it), and factor in the discovery and learning process for all that as well.

It’s also the case that operating a microservices architecture is a lot different than operating a monolith. You’ll need to invest in monitoring tooling, and you should plan on a learning and adjustment period as your organization builds up operational expertise.

And these operational differences are organizational as well as technical. If many different teams are developing and deploying their services independently of each other, you might also need to rethink your communication channels and norms.

It may very well still be worth it, but don’t let these “day 2 problems” catch you off guard.

In closing

Legacy systems are a part of our life, and always will be. The world evolves ever more rapidly, and we’ll always be faced with technical debt of some sort. We just need to accept that as part of our existence as developers.

The monolith is one of those technical debts that most of us know, and you may have one in front of you right now. It may or may not be a pressing problem, but if it really is, make an effort to tease it apart gradually, and make sure to get buy-in from everyone involved.

There will be dragons, but you’ll be fine if you go about it sensibly, and consider the hard-learned lessons of those before you. Rest assured, you’re not alone.

Watch this space for a follow-up post which will go further into the technical challenges involved, and some ideas on how to mitigate them.

And since you’re here, you may find Garden a useful tool when it comes to migrating from a monolith, if you’re looking to work with containers and Kubernetes during your refactor.