There is a phrase that is very frequently used at Google and other big tech companies. It’s a nice and short phrase. And the problem it describes is broadly applicable to modern software engineering. Yet it doesn’t seem to be used much outside of the Google diaspora, and there doesn’t seem to be an alternative naming for the concept either. The phrase is version skew or for short: skew.
Version skew happens when two or more distributed systems with a dependency between them get deployed–and because their mutual deployment isn’t atomic, they temporarily operate at different versions.
A more concrete example is: System A calls System B. A change is made to both systems. Both systems get deployed. System B takes a long time to deploy while System A deploys in seconds. Most instances of System A are now newer than System B. If System A relies on System B having the new change, the distributed system is now in a broken state.
This article describes 3 aspects of version skew:
Examples of skew #
This sounds abstract and complicated but in practice it happens all the time. Some examples of version skew:
Frontend client skew #
You deploy a new version of your website. But some clients still have the old website loaded. They might keep the browser tab open for days or even weeks. When these old clients make API calls to your backend, the backend needs to deal with being compatible with these old clients.
The same problem can happen with native mobile apps–only there some users might have auto updates turned off, and may never even update the initial version of their app.
Microservice skew #
You might develop a microservice architecture from a monorepo. The repo makes it appear like you could make consistent changes between the services in a single PR. E.g. you may add a new required field to an existing API. But in practice the services that depend on this API may not be deployed at the very same time. Even if deployments are fast, they will take seconds, and within these seconds existing clients may not yet include the new required field while hitting backends that do assume to receive it.
Configuration skew #
Sophisticated systems often have a way to deploy system configuration without deploying new versions of the software itself. Typically, such configuration updates are much faster than software updates.
This may lead to at least two types of problems:
- Software may receive configuration it cannot process yet (because the ability was introduced in a newer version.)
- Two systems that are actually running at exactly the same version may still run at varying configuration and hence behave in unexpected fashion.
Experiment or feature flag skew #
An important special case of configuration skew is experiment or feature flag skew. In this case two systems may again run at the same version, but calculate a different set of feature flag values (either through a static set of configuration or some experiment system) for the given API call.
Concretely, if system A calls system B, and both systems calculate the value for the feature flag
awesome_launch independently, then they might produce different output such as that one system thinks the feature launched while the other system thinks it is still hidden.
Skew boundaries #
Skew boundaries are the system boundaries between which some form of version skew may appear. Not all skew boundaries are created equal. Some explicitly manage the possibility of skew while others may be extremely difficult to reason about. Some examples are:
Service API skew #
Explicit APIs between services are the most obvious skew boundary. Respectively, these boundaries must be actively managed for skew. E.g. one may mandate that servers must never introduce new required fields into existing APIs, because they must always assume that there are clients that don’t yet know they need to send this new field.
HTML skew #
HTML as a skew boundary is so difficult to reason about that one may likely decide that the only way to manage this type of skew is by ensuring that it never occurs.
Database schema skew #
Data at rest introduces special skew challenges as updating it's shape usually cannot done with any of the skew management strategies that apply to software systems.
- In schemaful databases such as relational databases a schema change could roll into the database while there are still clients that cannot handle the new schema.
- Databases without a fixed schema may host data using logical schemas going back years of incremental changes. Clients may either need to be backward compatible with all versions that were ever deployed or may need to be backward compatible while migrations on existing data are running.
Skew management strategies #
There are a broad range of strategies (also called "Push Safety" at Facebook/Meta) to manage skew that come with varying trade-offs and applicability to various situations:
Version locking #
In some scenarios it may be possible for clients to explicitly require a server to run at the exact version as themselves. E.g. there might be a routing system that can route their API calls to servers at the desired version. Naturally, this makes deployments substantially more complex if it is possible at all as it requires the underlying software system to be able to run 2 or more versions of the same software system at production scale at the same time.
Vercel's Skew Protection for deployments is an example of transparent version-locking that eliminates skew between web clients and servers.
Be backward compatible #
Servers must never introduce new required fields in APIs until they know all clients are updated.
Protocol buffers, which were carefully designed to allow managing version skew, have optional and required fields. Most mature teams will essentially mandate that it is not allowed to ever introduce required field into an existing API.
In protocol buffers it is typically a safe operation to rename a field (there are implementation that serialize to human readable JSON which do not have this property), because fields are identified by their ID in production. In JSON based APIs renaming a field is a non-backward compatible operation.
Rollback windows can help put a boundary on the maximum duration of require backward compatibility. However, backward compatibility may be required for the long-term if
- Clients cannot be easily or ergonomically forced to update (such as for web and native apps).
- The skew is against data that is stored at rest and cannot be easily updated to a new schema including because the newly expected data simply was never collected.
Be forward compatible #
Clients must never require that servers can process recently introduced API changes.
E.g. if an API introduces a new optional argument, then clients cannot assume that the server will actually process it.
Rollback windows #
One may decide that servers or clients must never be reverted to an old version after they have been in production for N days. That allows clients to reason about when it is safe to assume that a server has a certain capability they rely on.
The backward and forward compatibility mandates mentioned above would then be lifted once a change has been deployed for the duration of the rollback window.
A common pattern you see in teams using rollback windows is that
- an engineer would submit a PR that puts a certain behavior in place
- wait for it to be deployed
- wait for the duration of the rollback window
- and finally submit a second PR that relies on the change from the original PR.
Amazon calls a variant of this strategy two-phase deployment.
Maybe the most obvious skew management strategy is explicit versioning. E.g. in REST APIs it is common to include an explicit version string in the path name which allows the client to select the version of the API they are compatible with. The main downside of explicit versioning is increased maintenance cost of the varying versions.
Example of versioning #
POST /v0/user: An avatar picture originally was not required.
POST /v1/user: It's an error to submit a new user without an avatar picture.
One benefit of versioning over relying purely on backwards compatibility is that while an old version may tolerate certain missing information, the new version can help engineers with proper error messages for missing data.
Arguments > flags #
Feature flag skew can be avoided by mandating that if a feature flag modifies behavior across multiple systems, then only the client must evaluate the flag.
The client then proceeds to change the behavior of the server by passing an argument to the API call that triggers the behavior of the feature flag, rather than reevaluating the feature flag on the server-side.
Version skew is an everyday experience in software deployment and can lead to subtle or not-so-subtle errors. Mature deployment strategies manage skew explicitly through skew boundaries and skew management strategies that avoid skew altogether when possible, and which make the skew easy for humans to reason about when it cannot be avoided.
And if you'd rather not deal with the problem yourself, checkout my recent project Skew Protection which we shipped on Vercel and which can eliminate the problem space by implementing version locking at the platform layer.