What is Distributed tracing?
Imagine you have a system made of 100+ microservices, communicating with different data sources and external systems. Everything works great for a year after the service launch. But suddenly, one day, the system becomes unreliable — it stops sending responses for some requests, or starts getting slower. What would you do to address it and understand what’s going wrong in the system? Distributed tracing deals with this kind of problem very well. We use it to troubleshoot systems at Adevinta and keep things running smoothly
A brief introduction
Distributed tracing is a method of tracking how requests flow inside your system. This can help you to highlight every suboptimal or broken part of the system by visualising the full path of requests.
Main benefits
- Helps to spot system reliability issues by highlighting degraded parts of the system
- Helps to define and maintain SLA
Distributed tracing lets us know exactly how long each step in the system takes to execute. This enables us to evaluate whether we aren’t in compliance with SLAs, and it can also help to define/adjust SLAs on new features.
Main issues
- Human factors during manual instrumentation of services in a system
As services have to be instrumented manually, there is a chance that the developer can miss some parts of the system or doesn’t instrument it properly. This can lead to missing traces and a mess of spans in the visualisation tool.
OpenTelemetry allows us to instrument code automatically, but it doesn’t always bring the benefits you expect. In most cases, flexibility provided by automatic instrumentation libraries is not enough for real applications.
Example of tracing diagram
Trace diagrams show how requests flow through the system, which parts of the system are involved and how long they take to complete their work.
In the picture above we can see a trace.
Trace — representation of a series of causally related events that represents request flow through a system
It shows that the whole request took 500 ms to complete and the request visited two services — A and B, inside the A₀ span
Span — the basic unit of work, represents an event in our system
There are three other events that were executed — A₁, A₂, and B₀ (B service was called from unit A₂), then unit A₃ was executed and took 250 ms to complete.
Now let’s look at something that helps integrate distributed tracing into our systems.
OpenTelemetry
OpenTelemetry is a collection of tools, APIs, and SDKs that you can use to instrument, generate, collect and export telemetry data — metrics, logs, and traces, known as the three pillars of observability, in a vendor-agnostic way.
OpenTelemetry provides you with:
- A single, vendor-agnostic instrumentation library per language with support for both automatic and manual instrumentation
- A single, vendor-neutral collector binary that can be deployed in a variety of ways
- An end-to-end implementation to generate, emit, collect, process, and export telemetry data
- Full control of your data with the ability to send data to multiple destinations in parallel through configuration
- Open-standard semantic conventions to ensure vendor-agnostic data collection
- The ability to support multiple context propagation formats in parallel to assist with migrating as standards evolve
How to start using distributed tracing with OpenTelemetry in your system
There are three steps you should complete to adopt distributed tracing in your system:
- Instrument your code
- Collect and export tracing data
- Analyse and visualise it
The first two points are covered by OpenTelemetry, and the third one is solved by observability backends.
In this article, we will use two different observability backends — Grafana Tempo and Jaeger. There is a big difference between them, specifically, in the guarantees they offer. We won’t get into discussing observability backends in depth in this article, but if you are curious, I can recommend these resources about head-based (Jaeger) sampling and tail-based (Tempo) sampling.
Real-life example
To give you a better understanding of distributed tracing and OpenTelemetry concepts, I created a system of two microservices and instrumented them. Now I will show you, step by step, how to instrument your Python/Go applications.
The system walkthrough
Imagine you have a Medium-like application, which consists of two microservices — the first service is responsible for showing articles to users (articles-microservice), and the second service recommends users similar articles to the ones they are currently reading (suggestions-microservice).
The source code and startup guide can be found in the repo README.
How data flows in the system
When you make a request to articles-microservice, articles-microservice retrieves requested data from the memory (data was loaded on the startup from the mocks/articles.json) and returns a list of articles.
It’s also possible to add with_suggested parameter to a request URL query. For every article in the list, articles-microservice will query suggestions-microservice which will retrieve a list of suggested article IDs from its memory (data was loaded on the startup from the mocks/suggestions.json). Then articles-microservice will enrich these IDs with articles data and this data combination to a user.
The project structure
├── README.md
├── apps // contains microservices applications
│ ├── articles // contains main microservice written on Go
│ ...
│ └── suggestions // folder contains auxiliary microservice, written in Python
│ ...
├── grafana //preconfigured dockerized grafana stack for run Tempo tracing backend
│ ├── docker-compose.yaml
│ ├── grafana-datasources.yaml
│ ├── prometheus.yaml
│ └── tempo.yaml
├── jaeger //preconfigured dockerized Jaeger tracing backend
│ └── docker-compose.yaml
├── requirements // project-level python requirements, mainly used by generate_mocks.py script
│ └── requirements.txt
└── scripts // contains scripts that help to automate something in the project, for now, there is only one script that generates mock data + its wrapper, to simplify script usage
├── generate_mocks
└── generate_mocks.py
JSONHow to run the project
The first thing you should do is generate mock data using the following command:
export number_of_articles=10
export number_of_suggestions_per_article=5
scripts/generate_mocks ${number_of_articles} ${number_of_suggestions_per_article}
generate_mocks.sh hosted with ❤ by GitHubGenerated files are our data sources for both microservices. I decided not to use a database in order to keep the project simple.
The next step is to run the microservices and tracing backend. As everything is dockerized for your convenience, you can run the whole stack with this command:
docker compose up -d && docker compose -f jaeger/docker-compose.yaml up -d
JSONThat’s it — the project is ready to work with.
Article microservice teardown (Golang application)
For our main microservice, I chose Golang and Go kit framework. The project follows a standard Go kit enforced structure, so it will be familiar to most readers who have experience building microservices with Go. This framework is widely used in the community.
The microservice folders structure
apps/articles
├── Dockerfile
├── articles // all internal app code resides here
│ ├── endpoint.go
│ ├── errors.go
│ ├── models.go
│ ├── service.go
│ └── transport.go
├── go.mod
├── go.sum
├── main.go // entry file
├── mocks
│ └── articles.json // generated mock file, our datasource
└── tracing // main tracing code is in this folder
├── middleware.go
└── provider.go
JSONThe service can do several things:
- Show us a list of articles
- Show detailed info for an article by id
- If we ask, it will enrich articles objects with suggested articles from
suggestions-microservice
How the service is instrumented
A huge part of the logic resides inside /tracing
folder
provider.go
– responsible for initialisation of tracing provider
middleware.go
– middleware which registers the first span in the microservices on every request
Things worth mentioning from provider.go
// we create an OTLP exporter with a tracing backend endpoint
exporter, err := otlptrace.New(ctx, otlptracegrpc.NewClient(
otlptracegrpc.WithEndpoint(exporterURL),
otlptracegrpc.WithInsecure(), // as connection is not need to be secured in this demo project
))
if err != nil {
return nil, err
}
// setup resource, which describes an entity about identifying information and metadata is exposed
resource := resource.NewWithAttributes(
semconv.SchemaURL,
semconv.ServiceNameKey.String(serviceName),
semconv.ServiceVersionKey.String("0.0.1"),
)
// AlwaysSample returns a Sampler that samples every trace, for testing purpose we want to sample every trace but in the production system with high number of requests it can be expensive
sampler := trace.AlwaysSample()
// create new tracing provider from the components above
tracerProvider := trace.NewTracerProvider(
trace.WithSampler(sampler),
trace.WithBatcher(exporter),
trace.WithResource(resource),
)
// set the provider as a global, so all components of the service can access it
otel.SetTracerProvider(tracerProvider)
// set global propagator to tracecontext, it can help us to wire services which calls each others
otel.SetTextMapPropagator(propagation.TraceContext{})
provider.go hosted with ❤ by GitHubmiddleware.go
teardown
func TracingMiddleware(next http.Handler, tracer trace.Tracer) http.Handler {
return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
spanName := fmt.Sprintf("%s %s", r.Method, r.URL.Path)
// if there is a context inside the carrier we have to extract it and update our local context,
// so it will add a parent span to the first service span if the service was called form the another service,
// this is not relevant to the example, but it's a good practice to always extract carrier in the first span,
// because it makes service wiring extremely easy in future
otel.GetTextMapPropagator().Extract(r.Context(), propagation.HeaderCarrier(r.Header))
// start new span with the context from the carrier
ctx, span := tracer.Start(r.Context(), spanName)
defer span.End()
// add the new context into the request,
// so span which we will creat next will be shown as a child spans of the firts initial span
r = r.WithContext(ctx)
next.ServeHTTP(w, r)
})
}
middleware.go hosted with ❤ by GitHubThen we have to manually create spans inside every function that we want to record in a final trace — a code sample follows.
func makeGetArticlesEndpoint(as Service, tracer trace.Tracer) endpoint.Endpoint {
return func(ctx context.Context, request interface{}) (interface{}, error) {
// start new span using global tracer
ctx, span := tracer.Start(ctx, "get articles endpoint")
// close the span before exit from the function
defer span.End()
...
articles_endpoints.go hosted with ❤ by GitHubHow it works in practice
To see how it works, you should follow these steps.
- Make a request
curl http://localhost:8001/articles/v1/ | jq
JSONYou will get a similar response to the one below.
Content might be different on your computer, so keep an eye only on the response structure.
{
"articles": [
{
"id": "d18602d5a5554e5f98734b39dc56b176",
"author": "Adrian Perez",
"title": "Oil activity pay.",
"text": "Happen education ever. Majority over teacher boy on arm. Somebody both opportunity us per.",
},
{
"id": "80c056b351334da5b3c95f4b38bbb90f",
"author": "Melissa Nichols",
"title": "Fear room federal.",
"text": "Building result difference hold. Imagine professional media both.\nReal building gas into. Example world skill treat. Return time raise decade choose.",
},
....
]
}
JSON2. Open Jaeger UI and find a recently created trace. It should look similar to what you see in the screenshot below
On this page, we can spot several things
- Trace was created for
/articles/v1/
endpoint - The whole request took 159 ms to complete
- Spans were created for each functional unit that the request visited
- The trace graph follows the project structure, as
request decoder
,endpoint
andrequest encoder
are on the same level in the go-kit model, andgetArticles
service called by theendpoint -
that shows that they are causally dependent - Trace shows how much time each unit took to complete
Suggestions microservice teardown (Python application)
To show you how tracing helps in tracking request flows across a system made of several microservices, I wrote a service that acts as an external component for articles-microservice
. This service shows suggestions for a given article. This suggestions-microservice
was written in Python. As a framework I choose Flask — this framework is well-known in the community, and there is a good chance most readers are familiar with it.
There is a library for Flask-based projects auto-instrumentation, but the goal of this article is to show readers how to instrument services manually, as it gives you more flexibility. But if you’re interested in auto-instrumentation, please read this page of the OpenTelemetry documentation.
folders structure
apps/suggestions
├── Dockerfile
├── app.py // entry file
├── mocks
│ └── suggestions.json // generated mock file, our datasource
├── requirements
│ └── requirements.txt // requirements specific to the microservice
├── suggestions
│ ├── __init__.py
│ ├── containers.py
│ ├── service.py
│ └── views.py
└── tracing.py // all tracing code resides here
JSONHow the service is instrumented
The whole tracing logic resides in the tracing.py file. There are two main parts:
- Provider initialiser, the code is pretty similar to what we have in
articles microservice
, but written in python - Tracing decorator
The most interesting piece of the code here is an implementation of tracing decorator.
It’s well documented right in the code, but I want to say more about what it can do.
How to use the decorator
We can decorate classes or functions.
When we decorate a class, all its methods automatically decorate with @Trace()
and provided arguments, it also works for classmethods
and staticmethods
.
@Trace()
class SuggestionsService:
def get_suggestions(self) -> Dict[str, List[str]]:
...
def get_suggestions_for_article(self, article_id: str) -> List[str]:
...
decorate_class_simple.py hosted with ❤ by GitHubThe code above means that all methods of SuggestionsService
will be instrumented. This is equivalent to:
class SuggestionsService:
@Trace()
def get_suggestions(self) -> Dict[str, List[str]]:
...
@Trace()
def get_suggestions_for_article(self, article_id: str) -> List[str]:
...
decorate_each_method.py hosted with ❤ by GitHubIf we want to disable tracing for some methods of a class we can decorate these methods with @Trace(disable_tracing=True)
@Trace()
class SuggestionsService:
def get_suggestions(self) -> Dict[str, List[str]]:
...
@Trace(disable_tracing=True)
def get_suggestions_for_article(self, article_id: str) -> List[str]:
...
decorate_class_and_some_of_it_methods.py hosted with ❤ by GitHubIf we want to disable tracing on selected endpoints we have to decorate its view with @Trace(disable_tracing=True)
@Trace(disable_tracing=True)
def get_by_article_id(article_id):
...
disable_tracing_for_endpoint.py hosted with ❤ by GitHubThis will automatically disable tracing for the whole request, even on methods used inside get_by_article_id
view.
Lastly, it’s worth mentioning that this decorator doesn’t add any overhead in runtime, as all heavy introspection happens during the module’s initialisation step, during the application startup.
How it works in practice
To see how it works you should follow these steps:
1. Make a request
curl http://localhost:8002/suggestions/v1/<article_id> | jq
JSONYou will get a list of suggested articles for the given article_id
As before, content might be different on your computer, so keep an eye only on the response structure.
[
"fd0236aca4474cfa8d4e2b3718dd6fe0",
"9fd6a8b6557147c396e0f42046de7f80",
"c2317f1138324945a836cceb902e25d1"
]
JSON2. Open Jaeger UI and find a recently created trace. It should look similar to what you see in the screenshot below.
The content of the screenshot is similar to what we’ve already seen in articles-microservice
, but now we see there are spans from suggestions-microservice
.
Wiring services together
To wire tracing data of all services together, we should pass span context from a caller to a target service. For this purpose, we should use propagators.
Propagators — objects used to read and write span context data to and from messages exchanged by the applications
Types of Propagator
Propagators API currently defines one Propagator
type:
TextMapPropagator
is a type that injects values into and extracts values from carriers as string key/value pairs.
Carrier — A carrier is a medium used by
Propagators
to read values from and write values to. Each specificPropagator
type defines its expected carrier type, such as a string map or a byte array.
For more details, I recommend reading the official OpenTelemetry doc.
How it’s done in the code
I added logic to several places
articles-microservice
— on provider initialisation, because we want to carry span context over through HTTP headers when we make requests tosuggestions-microservice
(full source code)
otel.SetTextMapPropagator(propagation.TraceContext{})
2. articles-microservice
— tracing middleware, where I also extract context from HTTP headers using HeaderCarrier (full source code)
// if there is a context inside carrier we have to extract it and update our local context
otel.GetTextMapPropagator().Extract(r.Context(), propagation.HeaderCarrier(r.Header))
tracing_middleware_extract_context_from_request_headers.go hosted with ❤ by GitHubIt’s a good practice to extract context on the top input layer, even if a service is not meant to be called from other services because there is a chance it will be needed in the future, and this extra instrumentation can also spot service misuse.
3. articles-microservice
— we also should instrument HTTP Client which calls suggestions-microservice
&http.Client{
Transport: otelhttp.NewTransport(http.DefaultTransport),
Timeout: 10 * time.Second,
}
instrument_http_client.go hosted with ❤ by GitHubNow on every call, this client will partially automatically inject parent span context into headers — partially automatically because we have to provide parent context to the request manually, like that (full source code)
// build request object with tracing context
u, err := url.Parse(fmt.Sprintf("%s/suggestions/v1/%s", s.suggestionsServiceHost, id))
if err != nil {
return nil, err
}
req := &http.Request{
Method: http.MethodGet,
URL: u,
}
req = req.WithContext(ctx)
// fetch suggestions from the suggestions microservice
resp, err := s.client.Do(req)
if err != nil {
return nil, err
}
passing_context_from_caller_to_target.go hosted with ❤ by GitHub4. The last thing we have to do is to extract context from HTTP headers in suggestions-microservice
(full source code)
# we extract headers values, because if seggestions microservice
# was called from another instrumented service,
# we want to have a fancy dependency graph in traces explorer
return TraceContextTextMapPropagator().extract(
getter=DefaultGetter(), carrier=request.headers
)
extract_context_on_target_side.py hosted with ❤ by GitHuband use this context as a parent context for the first span (full source code)
with self.tracer.start_as_current_span(span_name, context=ctx()) as _:
...
PythonTo test it, we should query articles microservice API with additional parameter with_suggested=1
curl http://localhost:8001/articles/v1/?with_suggested=1 | jq
JSONThe response should look similar to this one. The new suggested
key is now presented in the response.
{
"articles": [
{
"id": "fd0236aca4474cfa8d4e2b3718dd6fe0",
"author": "Timothy Foster",
"title": "Detail assume wonder.",
"text": "Clear alone wonder movement difference consumer. Bring lose short.\nTeach charge cup service. Fine yeah per wrong public. Until bit page real later third old focus.",
"suggested": [
{
"id": "2119cd10997741cb81c1dcc058e2abd4",
"author": "Christopher Huff",
"title": "Gas company discussion whatever.",
"text": "Either within behavior may. Color purpose reality wonder. Better during party.\nLittle ago measure themselves meet. Do their perform among."
},
{
"id": "2119cd10997741cb81c1dcc058e2abd4",
"author": "Christopher Huff",
"title": "Gas company discussion whatever.",
"text": "Either within behavior may. Color purpose reality wonder. Better during party.\nLittle ago measure themselves meet. Do their perform among."
},
{
"id": "d4dc9ba64ee14f2591472a2d0fe17148",
"author": "Nathan Wright",
"title": "While at on.",
"text": "Senior collection conference process become. Follow product process phone special.\nType hotel role smile although unit. Condition reveal generation like."
}
]
},
...
]}
JSONNow open Jaeger UI and let’s see how it shows the request flow between these two microservices.
On the search page, we see a new trace entity. Notice that now it shows that the request visited two services.
The graph shows us exactly which function of articles-microservice
called which methods of suggestions-microservice
, and how requests visited both microservices.
From the article-microservice
we call suggestions-microservice
for every article in the list, it’s suboptimal for a production system but nice for the demo project as it shows clearly how request flows between services.
Benefits of using vendor-agnostic instrumentation
As was mentioned above, OpenTelemetry offers a vendor-agnostic way to instrument applications. Previously, we were using Jaeger as a tracing backend, but what if we want to switch to something else?
The only thing you need to do is to replace the tracing backend URL (and auth method in some cases) with a new one that points to a new tracing backend.
To make it work on your machine, follow these simple steps:
- Stop Jaeger service
docker compose -f jaeger/docker-compose.yaml stop
Dockerfile2. Run Grafana stack docker-compose
docker compose -f grafana/docker-compose.yaml up
Dockerfile3. Replace TRACING_BACKEND_URL with a relevant one on both microservices .env.docker
files
# TRACING_BACKEND_URL=jaeger:4317
TRACING_BACKEND_URL=tempo:4317
Dockerfile4. Restart the main docker-compose
5. Query articles microservice with with_suggested=1 flag
curl http://localhost:8001/articles/v1/?with_suggested=1 | jq
JSON6. Open Grafana UI
7. On the left menu open Explore
tab
8. Open the search tab and press Run query
button, you should find a new trace there
9. Open the trace
On the right side of the screen, a graph opens, and it is the same graph you saw in Jaeger
Services colours are pretty similar because Grafana calculates colours based on service name, and as our services both have *-service postfixes, their hashes look almost the same — which leads to this annoying behaviour.
Conclusions
We explored distributed tracing and OpenTelemetry. We reviewed how to add distributed tracing with OpenTelemetry to the Go and Python applications, and how to track requests across microservice environments. We briefly explored how to visualise tracing data with Jaeger and Grafana Tempo.
I hope this article has helped you to build your expertise in distributed tracing and OpenTelemetry and helps to improve the observability of your services.