Error handling in GraphQL is often not easy. It’s actually pretty hard. This is why many developers struggle to build a really usable GraphQL.
If this sounds like you, you are not alone. If it doesn’t sound like you, you might still stumble upon these issues soon.
A good API can be a game-changer, especially if the product you build does not only want to rely on a user interface you provide. Even if it’s only your own product that suffers from difficult error handling, any frontend engineer (and even yourself) will thank you if you make handling errors as simple as possible.
There are gladly a few strategies that you can apply to improve the error handling of your GraphQL API, which will make it an overall more pleasure to use and work with.
None of these are difficult to implement, they just require some careful thinking and planning before you can implement them.
Table of Contents
Open Table of Contents
The Issue With GraphQL And Errors
GraphQL is an amazing API technology, but it has its “flaws”.
It’s great to enable your clients to simply retrieve only the data they really care for (while leaving out all the additional burden of the 321 properties they will never need in their lifetime). But if you use GraphQL over HTTP (yes, that is really a thing because GraphQL is protocol-agnostic and the whole spec doesn’t mention a required protocol even once), you might (or might not. If so, I’m happy I can tell you about it!) have realized that any well-formed GraphQL response usually has to respond with an HTTP 200 status code.
Other status codes are reserved for very specific use cases, and this usually leaves you only with an HTTP 200, which does not help at all when encountering errors.
This is why GraphQL introduces the optional “errors” array into any GraphQL response, which looks like this:
Errors can always be appended to a GraphQL response, even if data is returned. Everything the presence of the errors array states is that something went wrong while processing the request.
The above can be boiled down to the following:
- If errors are absent, everything went fine
- If errors are present, something went wrong, and many clients will throw an error, even if data was returned
Errors in GraphQL are specified, but not strictly. The three properties message
, locations
, and path
are conforming to the standard. Adding additional properties is not a violation of the specification, but it is highly discouraged. Many client libraries still expect only these three properties to be included within an error object. Adding anything else might give your users or colleagues a hard time using your API.
To battle this inflexibility, the GraphQL specification introduces the extensions
object into any error object. Within these extensions, you can put anything you want, without giving clients a hard time interpreting your responses (technically, you only leave it up to them to discover and interpret them).
The error example from before can be extended with extensions like this:
The extensions
object can contain anything. There are no boundaries to what you can put there. It allows for a lot of flexibility, but it also has one huge issue: You cannot type it reliably.
Unlike entities in your GraphQL schema that can be strictly typed by deriving a type system from the schema, error extensions are (conceptually) only like a Map<String, Any>
. There is nothing you can send within your response (except for some very nasty hacks like putting a whole JSON schema into each error’s extensions) to help clients generate type-safe code for your API’s (hopefully) standardized extensions.
Error extensions also have a second, smaller issue: It’s more difficult to access them. Unlike the usual GraphQL response that can easily be accessed from your favorite GraphQL client library, you need to put extra work into extracting the errors and their extensions to make them usable within a client’s logic.
When errors are difficult to handle in GraphQL, what then? What is it that you can do to improve the situation and make your API as usable as possible? How can you prevent a night full of coffee or energy drinks with Netflix running on the side while trying to come up with a suitable way to use your API?
You can first try to classify common errors and see whether you can apply different patterns to different kinds of errors because as you will see, not all errors need to be treated with all of someone’s energy.
Classifying Errors
It makes a lot of sense to first think about which types of errors (on a very high level) your API can actually experience. But which classifications make sense?
There are many ways to classify errors, from the lowest level to the highest one available, but in reality, there is one specific classification that makes the most sense: Classifying errors by their (un-)planned impact.
If you think about this for a moment, you will probably realize that it’s a good idea to think of errors as either wanted or unwanted states of your system.
Some errors are always unexpected. They just happen although you might not be happy about it. An upstream API may be unavailable, your database might be down, or sometimes the math your API performs just doesn’t add up.
Other errors are planned. Users aren’t logged in, there might be legal reasons that prevent you from showing content to users in certain regions (geo-blocking), or a user might simply need a subscription to view a specific page.
Now, you might look for names to give to these errors (given that planned and unplanned somehow don’t please you well, and I feel you), but we can come up with two very fitting names:
Technical errors and Business errors.
But before we go deeper into how to handle each class of errors, we should first take a look at the error classification and specify what technical errors and business errors really are.
Technical Errors
Technical errors originate from technical issues. Upstream APIs that are down, a faulty database, or even your API running out of memory are all errors that can be boiled down to technical issues. The list can be extended infinitely but you probably get the idea by now.
These kinds of errors are always unwanted. You can plan for them but never exactly tell when they will occur.
Errors like these are a necessary evil and you can usually not eliminate them. They belong to the product you build like every line of code you chew out.
Thankfully, they make up their own category and can all be handled just the same, which makes it easier to process and handle them in a way that also suits any client.
Technically, there is not much a client can do when such an error occurs. It can retry the request and see whether it gets another response that time, or it can simply show a banner with an error message and say sorry to the user, asking them to retry the action again later.
The important point about technical errors is that they can usually only be handled implicitly.
Business Errors
Business errors originate from the need of the business to create error states. Although your business analysts or product managers and owners or even yourself might not call them errors, technically they still are.
Errors like these are planned for. You expect and want them to happen in very specific circumstances. In these cases, there is simply no other way than to force the user to do something or deny them a specific action.
When a user encounters such a state, there is usually a plan for what should happen next (there really should be one). If some content is behind a paywall, there is often a process to make it as easy as possible for the user to upgrade to a paid plan. This usually involves a whole flow within your software, which in the end, brings the user back to the actual resource they tried to access after they spent money on your product.
At least theoretically, errors like these feel more like responses that lead to very specific actions, and this distinguishes them from technical errors. Unlike technical errors, business errors trigger specific actions. Or in other words: Business errors always need to be handled in a very explicit way. You want to force the client to do something based on that very specific response.
Handling Technical Errors
Going back to our initial definition of technical errors as errors that simply happen out of your control and the fact that you can’t do too much about them, there is one very convenient way to state their existence: An error object within the response.
As you can’t do very much about them, and given the fact that their occurrence usually hinders users of your API to proceed with whatever they wanted to do, an error object is the best way to make use of how most GraphQL clients handle errors by throwing some kind of exception.
If you encounter a technical error within your API, you can usually just throw an exception yourself and let the framework pick it up, map it into an error object within the response, or simply add it yourself.
Going back to the original, extended example, such a response can look as follows:
Most clients will pick this response up, realize that there is an error nested within the response, and then act accordingly. Some clients will throw an exception that you have to catch and process, others will give you hints in other ways (but they all usually point you in the right direction).
The only thing you can then do is to at least extract the error message and display it in a popup, toast, banner, or wherever else you have planned to notify users of unsuccessful operations. This is the implicit error handling we talked about earlier (implicit because often, it is even enough to let a general exception handler do something about the error you don’t even care to handle yourself). There is just not much more you can do about it.
Handling Business Errors
After you have tackled the easier part of handling errors in GraphQL, it’s time to jump into deeper water: Handling business errors.
Let’s quickly look at a partial definition of business errors again:
They are planned for. They are expected. If they occur, you usually want clients to do something very specific to solve the issue.
In other words: You want to force the client to do something and not just ignore the error.
If you are only a little like me, you might already think about how you can do this: Force clients to do something. And you, like me, will probably realize that GraphQL already has the best way to force clients to do what you want: The schema.
Whatever you put inside the schema becomes the reality of the API. Entities, properties, types, etc. All that is already being enforced. GraphQL schemas can even be picked up by code generators and transformed into strictly and dynamically typed code.
Code and types are exactly what you need to solve this mystery. Especially types can force you to do so many things that you would usually (be honest. I’m also guilty!) just ignore and let be.
Unions in many languages, for example, are a great way to state that a type is either this, that, or that, or that, or…wait a second…Unions! That’s the specific answer to the problem of making business errors explicit.
A query with a potential business error means: Your query is either successful or unsuccessful. That’s a union with two possible types: Success and Error. And this is exactly how you can tackle the problem at hand. (Imagine borrowing the concept from Rust’s Result enum).
If any of your queries or mutations can have potential business errors, you can introduce a union to model all potential outcomes of the call. As unions don’t have a hard limit on how many member types they can carry, they are flexible enough to serve the use case.
Let’s look at a basic example of an API schema of an e-commerce system. For now, it only has one query that allows users to fetch a list of products:
The query productsByCategory
allows clients to fetch all products by a specific category. This implies that there also might be no products at all for a given category.
If no products are found, the list will simply be empty, which is enough for the client to handle this specific case. But what about some more exotic (and a little constructed) use cases? What if some of these products can not be sold worldwide? Trade and regional laws often differ, and sometimes, things that you can consume or buy in country A are simply banned in country B.
There are two ways to deal with this constructed case:
- You simply leave the products out of the list
- You tell people that these products are not available for legal reasons
If you go for the first case, you are instantly done and don’t need to spend more thought on this case. If you want to stay with me for a little longer and just assume for a second that the second case also isn’t far from reality, continue to read.
Let’s assume that the second case is exactly what your business wants. You can still give customers a hint that these products exist, and perhaps there are some relatives somewhere else that could potentially order these items and send them over (come on, we’ve all done this already. There is so much US food banned in Europe that is actually quite tasty…). Why should your business throw away this opportunity?
If you want to solve exactly this case, you can make use of a union with two member types:
- A Product
- A geo-restricted Product
The first one is the usual entity, while the second one can have other properties (but doesn’t necessarily need to). The second entity type however forces any client to handle this case right within its code, either through a simple if-statement
, an instanceof
check, or else.
Applying this to the schema, you end up with a schema that looks something like this:
Instead of directly nesting a Product within the list, productsByCategory
now returns a list of ProductResult
’s. The latter is a union that has two members: Product
and GeoRestrictedProduct
.
If a product is geo-restricted (and ignoring that some laws might prevent us from doing this…damn you legislature!), a client can explicitly handle this case and additionally render why a customer can’t order this product and in which other countries they are still available.
You can generally apply this pattern to all other use cases that potentially come to your mind. But we can do one more example to give you a better idea of how broadly applicable this pattern is.
Let’s look at Medium, the popular blogging platform. It allows you to post articles and additionally monetize them (if you sign up for it). This presents two potential cases to users:
- A user is a subscriber and can pass the paywall
- A user is not a subscriber and can thus only read an excerpt, with a call to action to subscribe (let’s ignore the two/three articles per month free thing here but we all use incognito tabs nevertheless, okay?)
If you want to model this in a GraphQL schema, you can come up with something close to the following:
When a user requests an article, the API can check whether they are logged in and subscribed to Medium. If so, the API returns an Article
and the user can go on reading. If not, a RestrictedArticle
is returned that only has an excerpt in it.
Any client is forced to handle the case of a RestrictedArticle
without having to check the status of the user itself. Everything is left to the backend to decide. In case of Medium’s main page, the user is presented the usual view of an excerpt of an article with a call to action to upgrade if they want to continue reading.
This is the power of this pattern. It allows you to force clients to do something explicitly. Even if someone else decides to implement their own client for your API, you can still dictate to them what to do in certain circumstances without having to directly communicate with them. It can all be done through the schema of your API.
With this issue solved, there is only one border left to conquer. Something for companies that take GraphQL very seriously and spend a lot of money on flexibility, technical toys (and sometimes nightmares): Federated GraphQL.
Handling Business Errors In Federated GraphQL APIs
Apollo Federation is an addition to the GraphQL specification. Its idea is to allow multiple GraphQL backend services to be composed into a single API surface, without the need to implement an API monolith.
To make this happen, Apollo Federation extends the GraphQL specification and adds new directives that are used by a central service, called Router (or older Gateway), to derive a query plan for incoming requests.
In federated Graphs, entities can be shared and referenced among individual GraphQL services (called subgraphs). It’s easy for one subgraph to add a property pointing to an entity that is served by another subgraph because the Router takes care of fetching the data before a unified response is sent back to a client.
This system, however, has limits. One of these limits is the fact that unions cannot be referenced and shared between subgraphs. This is simply for the reason that unions cannot be uniquely identified by a type and an id.
You can still make this pattern work in such a federated graph by extending it and adding a layer of indirection, which does not look very beautiful (to be honest), but does its job well.
Let’s first look at a very simple API for a blog. It serves articles and authors through one, unified API, and its schema looks as follows:
Each Entity (Article
and Author
) is served by a different subgraph and defined in its schema. When both schemas come together, the result is what you see above (ignoring the fact that federated API schemas usually have a lot of boilerplate code, like directives and such, inside them).
Let’s now go back to the paywall example from before (because we all want to earn money somehow, don’t we? I’m not judging you! Content creation is a serious business and takes more time than many of us would like to admit.).
Let’s also assume we want the following to happen:
- Authors should always be viewable publicly (Who puts authors behind paywalls anyway?!)
- Articles can either be behind the paywall or free
An issue now arises from the fact that we (as the good engineers we are) have modeled a real graph where both sides of an entity relationship have a reference to the other side (Article
references Author
, Author
references multiple Article
s. You can navigate in both ways.).
Applying our previous solution would result in Article
becoming a union called something like ArticleResult
. But unions can’t be referenced by other subgraphs. The Author
subgraph would lose its reference, which is nothing you should favor. So, what now?
You can add a (relatively ugly but necessary) indirection into the schema to make the whole system work again. Even if a union cannot be referenced by other subgraphs, a wrapper entity that contains a property that points to a union can (this sounds more complex than it actually is). If the Author
subgraph references that entity, a client can then unpack the nested result union (with more code), but still be forced to handle all possible cases.
Let’s look at the resulting schema, so you better understand where this is going:
Article
has lost its @key directive. This is because it no longer needs it. Only entities that can be referenced by other subgraphs need such a directive. Instead, ArticleWrapper
is now the entity that can be referenced by other subgraphs. Nested within it is the union you already know, and within that union, both possible types are present. (I told you it wouldn’t be beautiful). An Author
now references an ArticleWrapper
instead of an Article
.
This small workaround allows you to keep all the benefits of this pattern while enabling the use right within a federated schema. Clients are still forced to explicitly handle scenarios you want them to handle, and you only pay a small (depending on how you view it) price with one further indirection (which bloats code a little, but nothing a dedicated API lib cannot solve).
Summary
Writing good GraphQL APIs is no easy task and handling errors puts an additional burden on you. But there are ways to handle especially errors in a way that benefits both you as an API’s creator and your users (or colleagues) who implement clients for it.
By first classifying errors into two categories, technical and business errors, you open up a way to handle each category as best as possible.
Technical errors are errors that you usually don’t want to happen, but they happen nevertheless. You can’t circumvent them but still somehow need to deal with them. Their impact, however, is often larger because they hinder people from progressing further in your software. Still, you can often only handle them implicitly by catching and presenting some form of error message to your users.
Business errors are errors you expect to happen because they belong to the design of the flow of your API or software. Geo-blocking, paywalls, or else are usually examples of these kinds of errors. You want to handle them explicitly because their occurrence usually leads to a different flow in your application. You also often want to force users of your API to react to them without too obvious ways to circumvent handling these cases.
In case of technical errors, you can use the errors within the GraphQL response and give general directions on what happened and leave a message for users to read. General catch blocks can then pick up the errors forwarded by GraphQL clients and do something with them.
In case of business errors, you can use techniques to design your schema in a way that forces anyone who uses your API to take care of these kinds of errors when they occur. You can achieve this by using unions that model results, which are types that can be either this or that or even more.
In federated environments, you have to put in a little more work to make the pattern of presenting business errors feasible. This is achieved by adding an indirection and a further abstraction in the form of a wrapper that then contains the already-known union type that models the actual response. This ensures that entities can still be referenced by other subgraphs without further technical limitations.