Back to all tech blogs

Distributed tracing with OpenTelemetry in your Go/Python microservices

  • Backend
Learn how to troubleshoot problems with microservices systems and walk through a real-life example

What is Distributed tracing?

Imagine you have a system made of 100+ microservices, communicating with different data sources and external systems. Everything works great for a year after the service launch. But suddenly, one day, the system becomes unreliable — it stops sending responses for some requests, or starts getting slower. What would you do to address it and understand what’s going wrong in the system? Distributed tracing deals with this kind of problem very well. We use it to troubleshoot systems at Adevinta and keep things running smoothly

A brief introduction

Distributed tracing is a method of tracking how requests flow inside your system. This can help you to highlight every suboptimal or broken part of the system by visualising the full path of requests.

Main benefits

  • Helps to spot system reliability issues by highlighting degraded parts of the system
  • Helps to define and maintain SLA

Distributed tracing lets us know exactly how long each step in the system takes to execute. This enables us to evaluate whether we aren’t in compliance with SLAs, and it can also help to define/adjust SLAs on new features.

Main issues

  • Human factors during manual instrumentation of services in a system

As services have to be instrumented manually, there is a chance that the developer can miss some parts of the system or doesn’t instrument it properly. This can lead to missing traces and a mess of spans in the visualisation tool.

OpenTelemetry allows us to instrument code automatically, but it doesn’t always bring the benefits you expect. In most cases, flexibility provided by automatic instrumentation libraries is not enough for real applications.

Example of tracing diagram

tracing diagram

Trace diagrams show how requests flow through the system, which parts of the system are involved and how long they take to complete their work.

In the picture above we can see a trace.

Trace — representation of a series of causally related events that represents request flow through a system

It shows that the whole request took 500 ms to complete and the request visited two services — A and B, inside the A₀ span

Span — the basic unit of work, represents an event in our system

There are three other events that were executed — A₁, A₂, and B₀ (B service was called from unit A₂), then unit A₃ was executed and took 250 ms to complete.

Now let’s look at something that helps integrate distributed tracing into our systems.

OpenTelemetry

OpenTelemetry is a collection of tools, APIs, and SDKs that you can use to instrument, generate, collect and export telemetry data — metrics, logs, and traces, known as the three pillars of observability, in a vendor-agnostic way.

OpenTelemetry provides you with:

  • A single, vendor-agnostic instrumentation library per language with support for both automatic and manual instrumentation
  • A single, vendor-neutral collector binary that can be deployed in a variety of ways
  • An end-to-end implementation to generate, emit, collect, process, and export telemetry data
  • Full control of your data with the ability to send data to multiple destinations in parallel through configuration
  • Open-standard semantic conventions to ensure vendor-agnostic data collection
  • The ability to support multiple context propagation formats in parallel to assist with migrating as standards evolve

How to start using distributed tracing with OpenTelemetry in your system

There are three steps you should complete to adopt distributed tracing in your system:

  1. Instrument your code
  2. Collect and export tracing data
  3. Analyse and visualise it

The first two points are covered by OpenTelemetry, and the third one is solved by observability backends.

In this article, we will use two different observability backends — Grafana Tempo and Jaeger. There is a big difference between them, specifically, in the guarantees they offer. We won’t get into discussing observability backends in depth in this article, but if you are curious, I can recommend these resources about head-based (Jaeger) sampling and tail-based (Tempo) sampling.

Real-life example

To give you a better understanding of distributed tracing and OpenTelemetry concepts, I created a system of two microservices and instrumented them. Now I will show you, step by step, how to instrument your Python/Go applications.

The system walkthrough

Imagine you have a Medium-like application, which consists of two microservices — the first service is responsible for showing articles to users (articles-microservice), and the second service recommends users similar articles to the ones they are currently reading (suggestions-microservice).

The system diagram
The system diagram

The source code and startup guide can be found in the repo README.

How data flows in the system

When you make a request to articles-microservice, articles-microservice retrieves requested data from the memory (data was loaded on the startup from the mocks/articles.json) and returns a list of articles.

Data flow

It’s also possible to add with_suggested parameter to a request URL query. For every article in the list, articles-microservice will query suggestions-microservice which will retrieve a list of suggested article IDs from its memory (data was loaded on the startup from the mocks/suggestions.json). Then articles-microservice will enrich these IDs with articles data and this data combination to a user.

Data flow

The project structure

├── README.md
├── apps // contains microservices applications
│   ├── articles // contains main microservice written on Go
│       ...
│   └── suggestions // folder contains auxiliary microservice, written in Python
│       ...
├── grafana //preconfigured dockerized grafana stack for run Tempo tracing backend
│   ├── docker-compose.yaml
│   ├── grafana-datasources.yaml
│   ├── prometheus.yaml
│   └── tempo.yaml
├── jaeger //preconfigured dockerized Jaeger tracing backend
│   └── docker-compose.yaml
├── requirements // project-level python requirements, mainly used by generate_mocks.py script
│   └── requirements.txt
└── scripts // contains scripts that help to automate something in the project, for now, there is only one script that generates mock data + its wrapper, to simplify script usage
├── generate_mocks
└── generate_mocks.py
JSON

How to run the project

The first thing you should do is generate mock data using the following command:


export number_of_articles=10 
export number_of_suggestions_per_article=5

scripts/generate_mocks ${number_of_articles} ${number_of_suggestions_per_article}
generate_mocks.sh hosted with ❤ by GitHub

Generated files are our data sources for both microservices. I decided not to use a database in order to keep the project simple.

The next step is to run the microservices and tracing backend. As everything is dockerized for your convenience, you can run the whole stack with this command:

docker compose up -d && docker compose -f jaeger/docker-compose.yaml up -d
JSON

That’s it — the project is ready to work with.

Article microservice teardown (Golang application)

For our main microservice, I chose Golang and Go kit framework. The project follows a standard Go kit enforced structure, so it will be familiar to most readers who have experience building microservices with Go. This framework is widely used in the community.

The microservice folders structure

apps/articles
├── Dockerfile
├── articles // all internal app code resides here
│   ├── endpoint.go
│   ├── errors.go
│   ├── models.go
│   ├── service.go
│   └── transport.go
├── go.mod
├── go.sum
├── main.go // entry file
├── mocks
│   └── articles.json // generated mock file, our datasource
└── tracing // main tracing code is in this folder
├── middleware.go
└── provider.go
JSON

The service can do several things:

  1. Show us a list of articles
  2. Show detailed info for an article by id
  3. If we ask, it will enrich articles objects with suggested articles from suggestions-microservice

How the service is instrumented

A huge part of the logic resides inside /tracing folder

provider.go – responsible for initialisation of tracing provider

middleware.go – middleware which registers the first span in the microservices on every request

Things worth mentioning from provider.go

// we create an OTLP exporter with a tracing backend endpoint
exporter, err := otlptrace.New(ctx, otlptracegrpc.NewClient(
	otlptracegrpc.WithEndpoint(exporterURL),
	otlptracegrpc.WithInsecure(), // as connection is not need to be secured in this demo project
))
if err != nil {
	return nil, err
}

// setup resource, which describes an entity about identifying information and metadata is exposed
resource := resource.NewWithAttributes(
	semconv.SchemaURL,
	semconv.ServiceNameKey.String(serviceName),
	semconv.ServiceVersionKey.String("0.0.1"),
)

// AlwaysSample returns a Sampler that samples every trace, for testing purpose we want to sample every trace but in the production system with high number of requests it can be expensive 
sampler := trace.AlwaysSample()
	
// create new tracing provider from the components above
tracerProvider := trace.NewTracerProvider(
	trace.WithSampler(sampler),
	trace.WithBatcher(exporter),
	trace.WithResource(resource),
)


// set the provider as a global, so all components of the service can access it
otel.SetTracerProvider(tracerProvider)

// set global propagator to tracecontext, it can help us to wire services which calls each others
otel.SetTextMapPropagator(propagation.TraceContext{})
provider.go hosted with ❤ by GitHub

middleware.go teardown

func TracingMiddleware(next http.Handler, tracer trace.Tracer) http.Handler {
	return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
		spanName := fmt.Sprintf("%s %s", r.Method, r.URL.Path)

		// if there is a context inside the carrier we have to extract it and update our local context, 
		// so it will add a parent span to the first service span if the service was called form the another service,
		// this is not relevant to the example, but it's a good practice to always extract carrier in the first span, 
		// because it makes service wiring extremely easy in future
		otel.GetTextMapPropagator().Extract(r.Context(), propagation.HeaderCarrier(r.Header))

		// start new span with the context from the carrier
		ctx, span := tracer.Start(r.Context(), spanName)
		defer span.End()

		// add the new context into the request,
   // so span which we will creat next will be shown as a child spans of the firts initial span
		r = r.WithContext(ctx)

		next.ServeHTTP(w, r)
	})
}
middleware.go hosted with ❤ by GitHub

Then we have to manually create spans inside every function that we want to record in a final trace — a code sample follows.

func makeGetArticlesEndpoint(as Service, tracer trace.Tracer) endpoint.Endpoint {
   return func(ctx context.Context, request interface{}) (interface{}, error) {
      // start new span using global tracer
      ctx, span := tracer.Start(ctx, "get articles endpoint")
      // close the span before exit from the function
      defer span.End()
 
      ...
articles_endpoints.go hosted with ❤ by GitHub

How it works in practice

To see how it works, you should follow these steps.

  1. Make a request
curl http://localhost:8001/articles/v1/ | jq
JSON

You will get a similar response to the one below.

Content might be different on your computer, so keep an eye only on the response structure.

{
  "articles": [
    {
      "id": "d18602d5a5554e5f98734b39dc56b176",
      "author": "Adrian Perez",
      "title": "Oil activity pay.",
      "text": "Happen education ever. Majority over teacher boy on arm. Somebody both opportunity us per.",
    },
    {
      "id": "80c056b351334da5b3c95f4b38bbb90f",
      "author": "Melissa Nichols",
      "title": "Fear room federal.",
      "text": "Building result difference hold. Imagine professional media both.\nReal building gas into. Example world skill treat. Return time raise decade choose.",
    },
    ....
  ]
}
JSON

2. Open Jaeger UI and find a recently created trace. It should look similar to what you see in the screenshot below

Trace

On this page, we can spot several things

  1. Trace was created for /articles/v1/ endpoint
  2. The whole request took 159 ms to complete
  3. Spans were created for each functional unit that the request visited
  4. The trace graph follows the project structure, as request decoderendpoint and request encoder are on the same level in the go-kit model, and getArticles service called by the endpoint -that shows that they are causally dependent
  5. Trace shows how much time each unit took to complete

Suggestions microservice teardown (Python application)

To show you how tracing helps in tracking request flows across a system made of several microservices, I wrote a service that acts as an external component for articles-microservice. This service shows suggestions for a given article. This suggestions-microservice was written in Python. As a framework I choose Flask — this framework is well-known in the community, and there is a good chance most readers are familiar with it.

There is a library for Flask-based projects auto-instrumentation, but the goal of this article is to show readers how to instrument services manually, as it gives you more flexibility. But if you’re interested in auto-instrumentation, please read this page of the OpenTelemetry documentation.

folders structure

apps/suggestions
├── Dockerfile
├── app.py // entry file
├── mocks
│   └── suggestions.json // generated mock file, our datasource
├── requirements
│   └── requirements.txt // requirements specific to the microservice
├── suggestions
│   ├── __init__.py
│   ├── containers.py
│   ├── service.py
│   └── views.py
└── tracing.py // all tracing code resides here
JSON

How the service is instrumented

The whole tracing logic resides in the tracing.py file. There are two main parts:

  1. Provider initialiser, the code is pretty similar to what we have in articles microservice, but written in python
  2. Tracing decorator

The most interesting piece of the code here is an implementation of tracing decorator. It’s well documented right in the code, but I want to say more about what it can do.

How to use the decorator

We can decorate classes or functions.

When we decorate a class, all its methods automatically decorate with @Trace() and provided arguments, it also works for classmethods and staticmethods.


@Trace() 
class SuggestionsService:     
    def get_suggestions(self) -> Dict[str, List[str]]:         
        ...      
    def get_suggestions_for_article(self, article_id: str) -> List[str]:         
        ... 
decorate_class_simple.py hosted with ❤ by GitHub

The code above means that all methods of SuggestionsService will be instrumented. This is equivalent to:

class SuggestionsService:     
    @Trace()     
    def get_suggestions(self) -> Dict[str, List[str]]:         
        ...
        
    @Trace()
    def get_suggestions_for_article(self, article_id: str) -> List[str]:
        ...
decorate_each_method.py hosted with ❤ by GitHub

If we want to disable tracing for some methods of a class we can decorate these methods with @Trace(disable_tracing=True)

@Trace() 
class SuggestionsService:
    def get_suggestions(self) -> Dict[str, List[str]]:
        ...
        
    @Trace(disable_tracing=True)
    def get_suggestions_for_article(self, article_id: str) -> List[str]:
        ...
decorate_class_and_some_of_it_methods.py hosted with ❤ by GitHub

If we want to disable tracing on selected endpoints we have to decorate its view with @Trace(disable_tracing=True)

@Trace(disable_tracing=True) 
def get_by_article_id(article_id):
    ...
disable_tracing_for_endpoint.py hosted with ❤ by GitHub

This will automatically disable tracing for the whole request, even on methods used inside get_by_article_id view.

Lastly, it’s worth mentioning that this decorator doesn’t add any overhead in runtime, as all heavy introspection happens during the module’s initialisation step, during the application startup.

How it works in practice

To see how it works you should follow these steps:

1. Make a request

curl http://localhost:8002/suggestions/v1/<article_id> | jq
JSON

You will get a list of suggested articles for the given article_id

As before, content might be different on your computer, so keep an eye only on the response structure.

[
  "fd0236aca4474cfa8d4e2b3718dd6fe0",
  "9fd6a8b6557147c396e0f42046de7f80",
  "c2317f1138324945a836cceb902e25d1"
]
JSON

2. Open Jaeger UI and find a recently created trace. It should look similar to what you see in the screenshot below.

Trace

The content of the screenshot is similar to what we’ve already seen in articles-microservice, but now we see there are spans from suggestions-microservice.

Wiring services together

To wire tracing data of all services together, we should pass span context from a caller to a target service. For this purpose, we should use propagators.

Propagators — objects used to read and write span context data to and from messages exchanged by the applications

Types of Propagator

Propagators API currently defines one Propagator type:

  • TextMapPropagator is a type that injects values into and extracts values from carriers as string key/value pairs.

Carrier — A carrier is a medium used by Propagators to read values from and write values to. Each specific Propagator type defines its expected carrier type, such as a string map or a byte array.

For more details, I recommend reading the official OpenTelemetry doc.

How it’s done in the code

I added logic to several places

  1. articles-microservice — on provider initialisation, because we want to carry span context over through HTTP headers when we make requests to suggestions-microservice (full source code)
otel.SetTextMapPropagator(propagation.TraceContext{})

2. articles-microservice — tracing middleware, where I also extract context from HTTP headers using HeaderCarrier (full source code)

// if there is a context inside carrier we have to extract it and update our local context
otel.GetTextMapPropagator().Extract(r.Context(), propagation.HeaderCarrier(r.Header))
tracing_middleware_extract_context_from_request_headers.go hosted with ❤ by GitHub

It’s a good practice to extract context on the top input layer, even if a service is not meant to be called from other services because there is a chance it will be needed in the future, and this extra instrumentation can also spot service misuse.

3. articles-microservice — we also should instrument HTTP Client which calls suggestions-microservice

&http.Client{
    Transport: otelhttp.NewTransport(http.DefaultTransport),   
    Timeout: 10 * time.Second, 
}
instrument_http_client.go hosted with ❤ by GitHub

Now on every call, this client will partially automatically inject parent span context into headers — partially automatically because we have to provide parent context to the request manually, like that (full source code)


// build request object with tracing context
u, err := url.Parse(fmt.Sprintf("%s/suggestions/v1/%s", s.suggestionsServiceHost, id))
if err != nil {
    return nil, err
}

req := &http.Request{
    Method: http.MethodGet,
    URL:    u,
}
req = req.WithContext(ctx)

// fetch suggestions from the suggestions microservice
resp, err := s.client.Do(req)
if err != nil {
    return nil, err
}
passing_context_from_caller_to_target.go hosted with ❤ by GitHub

4. The last thing we have to do is to extract context from HTTP headers in suggestions-microservice (full source code)


# we extract headers values, because if seggestions microservice
# was called from another instrumented service,
# we want to have a fancy dependency graph in traces explorer
return TraceContextTextMapPropagator().extract(
 getter=DefaultGetter(), carrier=request.headers
)
extract_context_on_target_side.py hosted with ❤ by GitHub

and use this context as a parent context for the first span (full source code)


with self.tracer.start_as_current_span(span_name, context=ctx()) as _:
    ...
Python

To test it, we should query articles microservice API with additional parameter with_suggested=1

curl http://localhost:8001/articles/v1/?with_suggested=1 | jq
JSON

The response should look similar to this one. The new suggested key is now presented in the response.

{
  "articles": [
    {
      "id": "fd0236aca4474cfa8d4e2b3718dd6fe0",
      "author": "Timothy Foster",
      "title": "Detail assume wonder.",
      "text": "Clear alone wonder movement difference consumer. Bring lose short.\nTeach charge cup service. Fine yeah per wrong public. Until bit page real later third old focus.",
      "suggested": [
        {
          "id": "2119cd10997741cb81c1dcc058e2abd4",
          "author": "Christopher Huff",
          "title": "Gas company discussion whatever.",
          "text": "Either within behavior may. Color purpose reality wonder. Better during party.\nLittle ago measure themselves meet. Do their perform among."
        },
        {
          "id": "2119cd10997741cb81c1dcc058e2abd4",
          "author": "Christopher Huff",
          "title": "Gas company discussion whatever.",
          "text": "Either within behavior may. Color purpose reality wonder. Better during party.\nLittle ago measure themselves meet. Do their perform among."
        },
        {
          "id": "d4dc9ba64ee14f2591472a2d0fe17148",
          "author": "Nathan Wright",
          "title": "While at on.",
          "text": "Senior collection conference process become. Follow product process phone special.\nType hotel role smile although unit. Condition reveal generation like."
        }
      ]
    },
    ...
]}
JSON

Now open Jaeger UI and let’s see how it shows the request flow between these two microservices.

Articles

On the search page, we see a new trace entity. Notice that now it shows that the request visited two services.

Articles

The graph shows us exactly which function of articles-microservice called which methods of suggestions-microservice, and how requests visited both microservices.

From the article-microservice we call suggestions-microservice for every article in the list, it’s suboptimal for a production system but nice for the demo project as it shows clearly how request flows between services.

Benefits of using vendor-agnostic instrumentation

As was mentioned above, OpenTelemetry offers a vendor-agnostic way to instrument applications. Previously, we were using Jaeger as a tracing backend, but what if we want to switch to something else?

The only thing you need to do is to replace the tracing backend URL (and auth method in some cases) with a new one that points to a new tracing backend.

To make it work on your machine, follow these simple steps:

  1. Stop Jaeger service
docker compose -f jaeger/docker-compose.yaml stop
Dockerfile

2. Run Grafana stack docker-compose

docker compose -f grafana/docker-compose.yaml up
Dockerfile

3. Replace TRACING_BACKEND_URL with a relevant one on both microservices .env.docker files

# TRACING_BACKEND_URL=jaeger:4317
TRACING_BACKEND_URL=tempo:4317
Dockerfile

4. Restart the main docker-compose

5. Query articles microservice with with_suggested=1 flag

curl http://localhost:8001/articles/v1/?with_suggested=1 | jq
JSON

6. Open Grafana UI

7. On the left menu open Explore tab

8. Open the search tab and press Run query button, you should find a new trace there

Run query

9. Open the trace

On the right side of the screen, a graph opens, and it is the same graph you saw in Jaeger

graph jaeger

Services colours are pretty similar because Grafana calculates colours based on service name, and as our services both have *-service postfixes, their hashes look almost the same — which leads to this annoying behaviour.

Conclusions

We explored distributed tracing and OpenTelemetry. We reviewed how to add distributed tracing with OpenTelemetry to the Go and Python applications, and how to track requests across microservice environments. We briefly explored how to visualise tracing data with Jaeger and Grafana Tempo.

I hope this article has helped you to build your expertise in distributed tracing and OpenTelemetry and helps to improve the observability of your services.

Related techblogs

Discover all techblogs

Leveraging A/B Testing to “soft disable” unused features and reduce unnecessary calls

Read more about Leveraging A/B Testing to “soft disable” unused features and reduce unnecessary calls
Sharing our user-centric approach to reducing emissions through informed decisions

The 300 Bytes That Saved Millions: Optimising Logging at Scale

Read more about The 300 Bytes That Saved Millions: Optimising Logging at Scale
The 300 Bytes That Saved Millions: Optimising Logging at Scale

Oops, I forgot to –publish! How can I connect to the container then?

Read more about Oops, I forgot to –publish! How can I connect to the container then?
Oops, I forgot to --publish! How can I connect to the container then?