Handling errors in a Go based microservice
Table of contents
In one of my application there was a total mess with error handling. Any wrong situation was treated as an internal error, so I could not set the alerting right. That's why every time there was an alert, I went to investigate, and it usually turned out to be some request for a non-existing ID or an inconsistent request body.
A more clever solution was needed.
The error handling in Go is nothing like in classic C-like languages we all used to. Here are the key differences:
- There is no support for throwing or catching errors. Functions return errors as simple values.
- There is a concept of error wrapping and error chaining: an error coming from an underlying function can be wrapped to add more context to it.
- Every function typically returns the generic error interface, and the calling code at the very top should decide what to do if the error isn't nil.
The standard error is merely an interface, so it opens up endless possibilities for custom error creation, as the only constraint would be to implement the Error() method.
Helpful Do-s and Dont-s regarding errors can be found in the Uber Go Style Guide.
Historically speaking, there are two patterns of handling errors in Go: each error is a different structure versus having one structure for the error, and distinguish the nature by code.
// Different structures: each structure can have it's own methods and// can be distinguished using errors.Is() and errors.As()type InternalError struct {}type NotFoundError struct {}...// Code-based approach: one structure, can be detected using errors.As() and// then the code tells about the naturetype Code stringconst (CodeInternal Code = "internal"CodeNotFound Code = "not_found"...)type Error struct {Code Code}
I've seen both of them in action, and it's safe to say the first approach is more canonical, but also implies broader boilerplate and can be useful for large projects. The second approach is more concise, but can lack some flexibility.
✅ I've decided to go ahead with the code-based approach and flavour it with an ability to attach fields, to compensate the limitations.
As I've mentioned above, there is a concept of error wrapping. What it means is every time an error comes from a called function, we immediately wrap it to add more context to that error.
func doSomething() error {return errors.New("not enough minerals")}func doWhatnot() error {err := doSomething()if err != nil {return wrapError(err, "could not call doSomething()")}return nil}func main() {err := doWhatnot()if err != nil {log(err.Error())}}
If we try to call the Error() method of the error at the very top level, the result would be could not call doSomething(): not enough minerals. As you can see, the error messages were combined together to provide a detailed information about the nature of the error.
The pro-s of the wrapping:
- Different types of error can be "chained" together, and there are ways to search for a particular error type in the chain using functions such as:
- errors.Is() - returns true if there is an error in the chain of errors that is an instance of a certain structure
- errors.As() - almost the same as Is(), but also casts the generic error to a target structure and populates it
- The message is quite informative, and serves as a good replacement for a traditional stack trace: just follow the chain to find the reason.
However, there are also downsides:
- Heftier and much more complex boilerplate: you don't just return errors, you have to wrap them. Error handling in Golang isn't easy by itself, and wrapping adds even more overhead.
- In big projects backtracking the error with having only the message at your disposal may be not that straightforward.
- Deeply nested function calls may produce long, sometimes repetitive and tiresome error messages you hardly want to show to your end user.
There is an alternative way: avoid wrapping and make use of the call stack trace.
func doSomething() error {return errors.New("not enough minerals")}func doWhatnot() error {err := doSomething()if err != nil {return err}return nil}func main() {err := doWhatnot()if err != nil {log(err.Error(), GetStackTrace(err))}}
Here we return an error as it was, but the high level code, such as logger, tries to extract the call stack trace from the error (errors.New() will produce the call stack trace) and logs the trace next to the error.
Advantages:
- Leaner boilerplate for sure.
- The precise location of the error the call stack trace will show.
Disadvantages:
- The error messages are very low level, technical and cryptic, and may even expose sensitive information. You would probably prefer to avoid showing those to your user as well, afraid to cause a heart attack to a non-technical person who won't understand such mumbo-jumbo.
- Not all errors produced by 3rd party libraries are traceable, and there is a high chance to end up without both the stack and the nested error message.
In our project we heavily rely on two main pillars of error handling:
- Built-in errors
- Has both errors.Is() and errors.As()
- Allows wrapping through fmt.Errorf("%w", err)
- Allows wrapping of multiple errors, effectively turning the error chain into a tree
- Famous and respected pkg errors
- Has both errors.Is() and errors.As()
- Allows wrapping through both errors.Wrap() and errors.Errorf()
- Records the call stack trace
Essentially, these two are sort of competing features offering more or less the same kind of functionality. However, since the built-in errors are built-in, it's recommended to stick to them instead, unless the call stack is needed.
✅ To take the best from the two worlds, I've decided to go ahead with wrapping-based approach, but also allow capturing the call stack trace.
So, here I've outlined the requirements for the solution:
- The custom error type must implement the error interface.
- To support error message chaining and both errors.Is() and errors.As(), errors should have the Unwrap() method implemented.
- An error should have a type (internal, bad input, not found), so it can become distinguishable at the top level in order to send distinct HTTP codes.
- An error should provide extensive information about the nature of the issue, including full stack trace.
- There must be a way to attach additional information to an error. In order to achieve that, error fields were added.
- Wrapping of multiple errors is not needed.
Here is what the logger must have:
- The logger must support structured logging, so it generates JSON-formatted output that Grafana understands.
- It must accept additional information in a form of fields.
- It must also support the operation id. The id, once chosen, can combine several log entries into a group, so the course of the entire operation could be easily grasped by filtering by that id.
- The error type should be also used to define the log level (error, warning, info) in order to fix the metrics.
- There must be a middleware that recovers from panics and coverts them to internal errors, to be processed down the line.
...the whole approach taken seems quite promising, the only concern is, as always, performance. I wouldn't use it for a highly loaded system, sticking to standard plain error instead. However, as I don't have a highly loaded system here, let's go full steam ahead with our custom errors!
Here is the custom error implementation I've introduced to cover the mentioned requirements.
package syserrimport ("errors""fmt""strings"pkgError "github.com/pkg/errors")type Code stringconst (InternalCode Code = "internal"BadInputCode Code = "bad_input"NotFoundCode Code = "not_found"NotImplementedCode Code = "not_implemented")type Error struct {Message stringcode CodeStack []*ErrorStackItemfields []*FieldWrappedError error}type ErrorStackItem struct {File stringLine stringFunction string}type Field struct {Key stringValue any}func F(key string, value any) *Field {return &Field{Key: key,Value: value,}}func NewInternal(message string, fields ...*Field) *Error {return New(InternalCode, message, fields...)}func NewBadInput(message string, fields ...*Field) *Error {return New(BadInputCode, message, fields...)}func NewNotFound(message string, fields ...*Field) *Error {return New(NotFoundCode, message, fields...)}func New(code Code, message string, fields ...*Field) *Error {stack := extractStackFromGenericError(pkgError.New(""))return &Error{Message: message,code: code,fields: fields,Stack: stack,}}func Wrap(err error, code Code, message string, fields ...*Field) *Error {newError := New(code, message, fields...)newError.WrappedError = errreturn newError}func WrapAsIs(err error, message string, fields ...*Field) *Error {newError := New(GetCode(err), message, fields...)newError.WrappedError = errreturn newError}func (e *Error) Unwrap() error {return e.WrappedError}func (e *Error) Error() string {if e.WrappedError != nil {return fmt.Sprintf("%s: %s", e.Message, e.WrappedError.Error())}return e.Message}func (e *Error) GetCode() Code {return e.code}func (e *Error) GetStack() []*ErrorStackItem {return e.Stack}func (e *Error) GetFields() []*Field {return e.fields}type stackTracer interface {StackTrace() pkgError.StackTrace}func (e *Error) GetStackFormatted() []string {return formatStack(e.Stack)}func GetStackFormatted(err error) []string {var sysErr *Errorok := errors.As(err, &sysErr)if ok {return sysErr.GetStackFormatted()}return formatStack(extractStackFromGenericError(err))}func GetCode(err error) Code {if err == nil {return InternalCode}for {if sErr, ok := err.(*Error); ok {return sErr.GetCode()}switch x := err.(type) {case interface{ Unwrap() error }:err = x.Unwrap()if err == nil {return InternalCode}default:return InternalCode}}}func GetFields(err error) []*Field {var result []*Fieldfor {if err == nil {return result}if sErr, ok := err.(*Error); ok {result = append(result, sErr.GetFields()...)}switch x := err.(type) {case interface{ Unwrap() error }:err = x.Unwrap()default:return result}}}func formatStack(stack []*ErrorStackItem) []string {result := make([]string, len(stack))for index, stackItem := range stack {result[index] = fmt.Sprintf("%s:%s %s", stackItem.File, stackItem.Line, stackItem.Function)}return result}func extractStackFromGenericError(err error) []*ErrorStackItem {stackTrace := extractStackTraceFromGenericError(err)result := make([]*ErrorStackItem, len(stackTrace))for index, frame := range stackTrace {result[index] = &ErrorStackItem{File: getFrameFilePath(frame),Line: fmt.Sprintf("%d", frame),Function: fmt.Sprintf("%s", frame),}}return result}func extractStackTraceFromGenericError(err error) pkgError.StackTrace {var result pkgError.StackTracevar traceableError stackTracerok := errors.As(err, &traceableError)if ok {result = traceableError.StackTrace()}return result}func getFrameFilePath(frame pkgError.Frame) string {frameString := strings.Split(fmt.Sprintf("%+s", frame), "\n\t")return frameString[1]}
Let me break it down. So,
- There is a custom error that holds a message, a code, a stack and a bunch of fields.
- There are three factory functions to create Internal, BadInput or NotFound types of errors. It also accepts optional fields.
- There is a special factory function Wrap() that wraps any error with a system error.
- The call stack is captured using the errors package.
- Thee are three functions that take an error and try to extract code, fields and call stack by walking down the chain of errors:
- GetCode()
- GetFields()
- GetStackFormatted()
- The Error() method walks down the error chain recursively to build a combined message, that contains all messages from all the errors in the chain, delimited with ": ".
- I don't implement functions such as Errorf() or Wrapf(), since I believe the message should always be a simple constant string without any inclusions of data. The additional data, if any, should be provided through fields. This approach also makes the localization of error messages way easier.
In many similar articles I've seen it was advised for the error structure to keep the HTTP code right from the start. I don't think this is a right approach, because the application can be exposed not only via REST, but also through gRPC or Kafka. Even the piece of code itself, the error is returned from, can be part of some microservice running on the background as a cron job. That's why the error code must be abstracted from the concrete transport or environment, and then essentially mapped into specific codes on the way out.
As I've mentioned above, the operation id must be introduced. Since it's merely a simple string, it could be simply put to the context and scoped around a certain operation.
package ctximport "context"type (operationIDKey string)const (OperationIDKey operationIDKey = "OperationID")func WithOperationID(ctx context.Context, operationID string) context.Context {if operationID == "" {return ctx}return context.WithValue(ctx, OperationIDKey, operationID)}func GetOperationID(ctx context.Context) string {value := ctx.Value(OperationIDKey)if value != nil {return value.(string)}return ""}
Another key part is the logger. I've decided to put the log/slog logger into an additional wrapper in order to de-couple the rest of the code from the specific logger implementation.
package loggerimport ("context""log/slog""os"pkgCtx "internal/ctx"syserr "internal/syserr")var (logger = slog.New(slog.NewJSONHandler(os.Stdout, nil)))type Field struct {key stringvalue any}func F(key string, value any) *Field {return &Field{key: key,value: value,}}func Warning(ctx context.Context, message string, fields ...*Field) {logger.Warn(message, convertFields(addOperationID(ctx, fields))...)}func Error(ctx context.Context, message string, fields ...*Field) {logger.Error(message, convertFields(addOperationID(ctx, fields))...)}func Info(ctx context.Context, message string, fields ...*Field) {logger.Info(message, convertFields(addOperationID(ctx, fields))...)}var (errorToLogLevel = map[syserr.Code]func(ctx context.Context, message string, fields ...*Field){syserr.InternalCode: Error,syserr.BadInputCode: Warning,syserr.NotFoundCode: Warning,syserr.NotImplementedCode: Error,})func LogError(ctx context.Context, err error, fields ...*logger.Field) {code := syserr.GetCode(err)fn := errorToLogLevel[code]if fn == nil {fn = logger.Error}fields = append(fields, convertErrorFieldsToLoggerFields(syserr.GetFields(err))...)fields = append(fields, logger.F("stack", syserr.GetStackFormatted(err)))fn(ctx, err.Error(), fields...)}func convertErrorFieldsToLoggerFields(fields []*syserr.Field) []*Field {result := make([]*Field, len(fields))for index, field := range fields {result[index] = F(field.Key, field.Value)}return result}func addOperationID(ctx context.Context, fields []*Field) []*Field {if ctx == nil {return fields}operationID := pkgCtx.GetOperationID(ctx)if operationID != "" {return append(fields, F("operation_id", pkgCtx.GetOperationID(ctx)))}return fields}func convertFields(fields []*Field) []any {result := make([]any, 2*len(fields))index := 0for _, field := range fields {result[index] = field.keyresult[index+1] = field.valueindex += 2}return result}
Essentially,
- there are three functions for each of the log levels: info, warn, error, and they basically just wrap corresponding functions of the log/slog logger,
- the logger also checks the context for the presence of the operation id. If found, it is appended to the list of fields.
- there is a function that logs errors by mapping codes to certain log level.
If we assume, that there is a certain function that executes a certain operation, then the error handling may look like this:
...func executeOperation(ctx context.Context) (err error) {opCtx := ctxUtil.WithOperationID(ctx, uuid.New().String())defer func(){if err != nil {logger.LogError(opCtx, syserr.WrapAsIs(err, "could not execute operation"))}}()localErr := doStepOne()if localErr != nil {return syserr.WrapAsIs(localErr, "could not execute step one")}logger.Info(opCtx, "step one was done")localErr = doStepTwo()if localErr != nil {return syserr.WrapAsIs(localErr, "could not execute step two")}logger.Info(opCtx, "step two was done")return doSomethingAtTheEnd()}
All log entries will share the same operation_id, and thus can be easily extracted from the rest of the logs.
Each HTTP request to a server may be treated as an operation. In order to enable the same functionality for a route, a set of middlewares is needed.
package utilimport ("context""fmt""net/http""github.com/google/uuid"ctxUtil "internal/ctx""internal/logger""internal/syserr")type Handler func(http.ResponseWriter, *http.Request) errorfunc WithLogger(next types.Handler) types.Handler {return func(w http.ResponseWriter, r *http.Request) error {operationID := uuid.New().String()ctx := ctxUtil.WithOperationID(r.Context(), operationID)r = r.WithContext(ctx)w.Header().Set("X-Operation-ID", operationID)err := next(w, r)fields := make([]*logger.Field, 1)fields[0] = logger.F("query", fmt.Sprintf("%s %s", r.Method, r.RequestURI))ctx = r.Context()if err != nil {logger.LogError(ctx, errors.WrapAsIs(err, "request handles with errors"), fields...)} else {logger.Info(ctx, "request handled", fields...)}return err}}var (errorToHTTPError = map[syserr.Code]int{syserr.InternalCode: http.StatusInternalServerError,syserr.BadInputCode: http.StatusBadRequest,syserr.NotFoundCode: http.StatusBadRequest,syserr.NotImplementedCode: http.StatusNotImplemented,})type ErrorResponse struct {Error string `json:"error"`}func WithErrorHandler(next types.Handler) types.Handler {return func(w http.ResponseWriter, r *http.Request) error {err := next(w, r)if err != nil {code := syserr.GetCode(err)httpCode := errorToHTTPError[code]if httpCode == 0 {httpCode = http.StatusInternalServerError}response := ErrorResponse{Error: err.Error(),}if code == syserr.InternalCode {response.Error = "internal error occurred"}writeErr := httpUtil.EncodeJSONResponse(response, httpCode, w)if writeErr != nil {logger.Error(r.Context(), "could not write the error response")}}return err}}
Here, the logger middleware sets the operation id, and also sends it back to the client via a header. The error handler middleware hides the message for an internal error, to prevent the user from having a heart attack. However, since the operation id is known in the response, it can be used to search the logs for a matching entry that gives full disclosure about the nature of the error.
The middleware should wrap every endpoint as shown below:
...type Route struct {Method stringPattern stringHandlerFunc middlewares.Handler}type Routable interface {GetRoutes() map[string]Route}func PopulateRouter(router *mux.Router, routables ...Routable) {for _, routable := range routables {routes := routable.GetRoutes()for _, route := range routes {router.HandleFunc(route.Pattern, func(w http.ResponseWriter, r *http.Request) {handler := WithLogger(WithErrorHandler(route.HandlerFunc))_ = handler(w, r)}).Methods(route.Method, http.MethodOptions)}}}
{"time":"2024-02-27T00:13:16.485015+01:00","level":"INFO","msg":"trying to execute the GetBooks() method","operation_id":"2bdc993d-5686-47e2-9c6a-16ff12bbeb03"}{"time":"2024-02-27T00:13:16.485465+01:00","level":"WARN","msg":"request handled with error: could not execute GetBooks(): could not retrieve book count","method":"POST","url":"/books?","status":400,"term":"","foo":"bar","stack":["/Users/mustermann/proj/pocs-demos-katas/golang/logging-error-handling/internal/syserr/syserr.go:44 syserr.go","/Users/mustermann/proj/pocs-demos-katas/golang/logging-error-handling/internal/syserr/syserr.go:36 syserr.go","/Users/mustermann/proj/pocs-demos-katas/golang/logging-error-handling/internal/repository/book/book.go:31 book.go","/Users/mustermann/proj/pocs-demos-katas/golang/logging-error-handling/internal/service/book/book.go:29 book.go","/Users/mustermann/proj/pocs-demos-katas/golang/logging-error-handling/internal/controller/book/book.go:33 book.go","/Users/mustermann/proj/pocs-demos-katas/golang/logging-error-handling/internal/http/middleware.go:23 middleware.go","/Users/mustermann/.gvm/gos/go1.21/src/net/http/server.go:2136 server.go","/Users/mustermann/.gvm/gos/go1.21/src/net/http/server.go:2514 server.go","/Users/mustermann/.gvm/gos/go1.21/src/net/http/server.go:2938 server.go","/Users/mustermann/.gvm/gos/go1.21/src/net/http/server.go:2009 server.go","/Users/mustermann/.gvm/gos/go1.21/src/runtime/asm_amd64.s:1650 asm_amd64.s"],"operation_id":"2bdc993d-5686-47e2-9c6a-16ff12bbeb03"}
As you can see, there are two calls of the logger function, the first one is just unobtrusive debug, and the second records an error happened during the endpoint execution, caused by an illegal input (hence the 400 HTTP status). Both of the calls provide the operation id, and they also expose the fields defined when the error was initially created.
That's all for today, folks!
Sergei Gannochenko
Golang, React, TypeScript, Docker, AWS, Jamstack.
20+ years in dev.