Thursday, January 17, 2013

Frankensystems, Half-Strangled Zombies and other Monsters

There are lots of ugly things that can happen to a system over time. This is what the arguments over technical debt are all about – how to keep code from getting ugly and fragile and hard to understand and more expensive to maintain over time, because of sloppiness and short-sighted decision making. But some of the ugliest things that happen to code don’t have anything to do with technical debt. They’re the result of conscious and well-intentioned design changes.

Well-Intentioned Changes can create Ugly Code

Bad things can happen when you decide to re-architect or rewrite a system, or start some large-scale refactoring, but you don’t get the job done. Other more important work comes up before you can finish transitioning all of the code over to the new design or the new platform – or maybe that was never going to happen anyways, because you didn't have the budget and the mandate to do the whole job in the first place. Or the somebody who started the work leaves, and nobody else understands their vision well enough to carry it through – or nobody that’s left cares about it enough to finish it. Or you get just far enough to solve whatever problems your or the customer really cared about, and there’s no good business case to keep going.

Now you’re left with what a colleague of mine calls a “Frankensystem”: different designs and different platforms spliced together in a way that works but that is horribly difficult to understand and maintain.

Why does this happen? How do you stop your system from turning into a monster like this?

Branching by Abstraction

One way that code can get messed up, in the short-term at least, is through Branching by Abstraction, an idea that has become popular in shops that Dark Launch changes through Continuous Deployment or Continuous Delivery.

In Branching by Abstraction (also known as “branching in code”), instead of creating a feature branch to isolate code changes, and then merging the changes back when you’re done, everyone works in trunk. If you need to make bigger code changes, you start by writing temporary scaffolding (abstraction layers, conditional logic, configuration code like feature switches) to isolate the changes that you’ll need to make, and then you can make your changes directly in the code mainline in small, incremental steps. The scaffolding serves to protect the rest of the system from the impact of your changes until all of the work is complete.

Branching by Abstraction tries to address problems with managing the misuse of feature branches (especially long-lived branches) – if you don’t let developers branch, then you don’t have to figure out how to keep all of the branches in sync and manage merge conflicts. But with Branching by Abstraction, until the work is complete and the temporary scaffolding code removed, the code will be harder to maintain and understand, and more brittle and error-prone, as James McKay points out:

“…visible or not, you are still deploying code into production that you know for a fact to be buggy, untested, incomplete and quite possibly incompatible with your live data. Your if statements and configuration settings are themselves code which is subject to bugs – and furthermore can only be tested in production. They are also a lot of effort to maintain, making it all too easy to fat-finger something. Accidental exposure is a massive risk that could all too easily result in security vulnerabilities, data corruption or loss of trade secrets. Your features may not be as isolated from each other as you thought you were, and you may end up deploying bugs to your production environment”.

If you decide to branch in code like this (we do branching in code in some cases, and feature branching in others – branching in code is good for rolling out behind-the-scenes plumbing changes, not so good for big functional changes), be careful. Review your scaffolding to ensure that your code changes are completely isolated, and test with old and new configurations (switches off and on) to check for regressions. Minimize the number of changes that the team rolls out at one time, so that there’s no chance of changes overlapping or colliding. And to keep Branching by Abstraction from becoming a maintenance nightmare, make sure that you remove temporary scaffolding as soon as you are done with it.

Half-Strangled Zombies

Branching by Abstraction can lead to ugly code, at least for the few weeks or months that it will take to roll out each change. But things can get much worse in the code if you try to do a major rewrite or re-architecture of a system incrementally, for example “strangling” the existing system with new code and a new design (another approach coined by ThougtWorks), and slowly suffocating the old system.

Strangling a system lets you introduce a new design or change over to a new and modern platform without having to finish a long and expensive rewrite first. The strangling work is done in parallel, usually by a separate team, letting the rest of the team to maintain the old code – which of course means that both teams need to keep in sync as changes and fixes are made.

But if you don’t finish the job, you’ll be left with a kind of zombie, a scary half-dead and half-alive thing with ugly seams showing, as Nat Pryce warns against in this Stack Overflow post:

"The biggest problem to overcome is lack of will to actually finish the strangling (usually political will from non-technical stakeholders, manifested as lack of budget). If you don't completely kill off the old system, you'll end up in a worse mess because your system now has two ways of doing everything with an awkward interface between the two. Later, another wave of developers will probably decide to strangle what's there, writing yet another strangler application, and again a lack of will might leave the system in an even worse state, with three ways of doing things….

I've seen critical systems that have suffered both of these fates, and ended up with about four or five "strategic architectural directions" and "future state architectures". One large multi-site project ended up with eight different new persistence mechanisms in its new architecture. Another ended up with two different database schemas, one for the old way of doing things and another for the new way, neither schema was ever removed from the system and there were also multiple class hierarchies that mapped to one or even both of these schemas."

Strangling, and other incremental strategies for re-architecting a system, will let you start showing benefits to the customer early, before all of the work of writing the new system is done. This is both an advantage and a problem. Because once the customer starts to get what they really care about (some nice new screens or mobile access channels or better performance or faster turnaround on rules changes or…) you may not be able to make the business case to finish up the work that’s left. Everyone understands (or should) that this means you’re stuck with some inconsistencies – on the inside certainly, and maybe on the outside too. But whatever is there does the job, and keeping this mess running may cost a lot less than finishing the rewrite, at least in the short term.

Frankensystems and Zombies are Everywhere

Monster-making happens more often than it should to big systems, especially big, mission-critical systems that a lot of different people have worked on over a long time. As Pryce warns, it can even happen multiple times over the life of a big system, so that you end up with several half-realized architectures grafted together, creating all kinds of nasty maintenance and understanding problems.

When making changes or adding features, developers will have to decide whether to do it the old way or the new way (or the other new way) – or sometimes they will need to do both, which means working across different architectures, using different tools and different languages, and often having to worry about keeping different data models in sync. This complexity means it’s easy to make mistakes or miss or misunderstand something, and testing can be even uglier than the coding.

You need to recognize these risks when you start down the path of incrementally changing a system’s direction and design – even if you believe you have the commitment and time to finish the job properly. Because there’s a good chance that you’ll end up creating a monster that you will have to live with for years.

No comments:

Site Meter