Platform Responsibilities: Embrace Tension

Ben BrantleyMarch 22, 2013

Many of the best engineering problems form at the crossroads of competing interests: material strength versus weight for rocket parts, displacement versus efficiency for engines, or energy density versus longevity (or perhaps flammability) for batteries.

At Guidewire, we build complex core systems for insurance, so these types of challenges are constantly tumbling out, large and small alike. Consequently, we have learned to embrace them when they do: we identify the things that are in tension with each other, acknowledge the pros and cons of the choices in each dimension that have to be made, and then chart the most rational course to moderate the differences and properly balance risk and reward.

Sounds straightforward, right?

My friends, I am here to tell you that it is not always easy. Let me give you an example. One of the most exquisite of our engineering challenges is something that is woven right through the heart of all of the code we write. It is a classic set of design tradeoffs in large-scale computational systems. It is the tension between representational precision, transactional integrity, and performance.

Let's start with integrity. Since insurance is a business of promises, there is simply nothing more important than being able to correctly represent these promises (and the servicing of them) in the system of record. We use databases, persistence mapping, distributed locking schemes, and lots of other tools to ensure that we have a consistent, perfect record of the business work stored on the disk at all times. And that part would be pretty easy, all by itself, if we could just use a really simple model.

But we can't just use a simple model, because the business of making those contractual promises well depends on being able to capture lots of nuance and detail with respect to them (and the servicing of them). That means that we must take data structure and design seriously, and in fact today we manage thousands upon thousands of individual elements in our base relational model.1 Of course, we also make sure that each of our customers can take that starting point and elaborate on top of it, as needed, to further tailor it to their needs.

Maintaining integrity across a large representational model already introduces lots of engineering decision points, but the challenge becomes exponential when we layer in scale.

It's really pretty simple: we build core systems for an industry that itself has incredible size diversity. There are meaningful, small, sustainable insurance companies that write handfuls of policies or have whole years with no claims. And there are carriers 1,000 times larger anchoring the other end of the spectrum, supporting premium volumes into the many tens of billions USD. We write one InsuranceSuite that must serve these two kinds of carriers, plus all of the ones in between, at the same time.

We tackle this problem in many ways, the first of which is to be mindful of scale considerations in the first place. That means that new features, before they make their way from an idea on a whiteboard to the first few lines of code, get analyzed in the context of scale. It's easy to come up with new ideas, but at the end of the day, we exercise great selectivity and care in deciding what capabilities to introduce into the codebase, because we, through our customers' many production environments, will have to clear the scale hurdle with confidence.

The next thing we do, of course, is test. I talked about unit testing, wherein we check for correctness in our software, in a previous post. However, we also have a whole other set of test tools, infrastructure, and humans who do performance testing by crafting complex, long-running production simulations for our suite. We put our software through these torture tests during every release, and that means that we can identify hot spots and tune them out in the factory, rather than foisting our engineering hopes and dreams on our customers.

The last thing we do is perhaps the hardest of all: we compromise. I know, it sounds crazy! Who wouldn't want a no-compromise, do-everything, infinitely-scalable core system? (Answer: the person who does not have infinite time and energy to build the impossible.) Compromise comes in many forms, including rejecting a design outright, or backing away from it in part. In some cases, we work directly with our customers to make hard decisions together. Sometimes we don't have to compromise, but we instead introduce a new component that has different scale characteristics. Sometimes we denormalize our data model to cool down a hot query. We precompute. We cache. We shed tears, sweat, and blood.

I didn't write about this challenge to engender sympathy. It's our job, here at Guidewire, to find the right balance between many tough choices. We do this with full awareness that we are making living software: code that has to run well, every day, for thousands of users around the world. That's just part of the job, and it's a big part of what makes it worth doing.

1 We also use non-relational representations for some specialized computations. I'll share more about those in another entry.