William Schroeder
May 12, 2023 5 min read

OpenTelemetry Tail Sampling Replacement

Our team was interested in sending our traces and spans from our Elixir services to an opentelemetry-collector service, which would then sample and send them to Datadog. A typical web request, for example, should result in a single trace. A trace is a record of all the notable calls across services involved in that web request, represented by spans. A span is a call or event nested within a trace or another span of that trace.

We wanted a few policies:

All errors and high-latency errors are sent
Sample a tiny fraction of traces related to our most called endpoints
Sample about 2/3 of everything else

We would also want Datadog to be able to show the true numbers of traces, regardless the ingestion rate we applied locally.

Tail Sampling (did not work)

The appropriate pattern to identify features of a trace such as the HTTP status or the time it took to execute the trace is called Tail Sampling. For a while, we attempted to use OpenTelemetry’s tailsamplingprocessor. For some reason, we were able to send traces but not link them to operations suitable for Datadog views, even when using the always_sample configuration like this:

config:
  processors:
    batch:
      send_batch_max_size: 500
      send_batch_size: 50
      timeout: 10s
    datadog:
    tail_sampling:
      policies:
        - name: always
          type: always_sample
  service:
    pipelines:
      traces:
        receivers:
          - otlp
        processors:
          - batch
          - datadog
          - tail_sampling
        exporters:
          - datadog

This configuration instructs the opentelemetry-collector to receive OTLP messages. We batch incoming traces in order to handle batches instead of one at a time. Then the datadog processor analyzes these traces and sends metrics about them to Datadog; this is how the graphs that show hits per second are populated. Next, we apply tail sampling to group spans by trace and push forward all traces. Finally, we export the traces to Datadog with the datadogexporter.

If we removed tail_sampling from the processors section, everything would come through as expected. This led to several days of frustration, trying various tail_sampling policies without success while reading numerous blog posts that this processor works for other people, even recently.

Debugging OpenTelemetry Collector

It is possible to see debug-level logging by adding a telemetry map to the service section of the configuration that sets the log level:

  service:
    telemetry:
      logs:
        level: debug
    pipelines:
      traces:
        receivers:
          - otlp
        processors:
          - batch
          - datadog
          - tail_sampling
        exporters:
          - datadog

This showed us that the tail_sampling processor reports it was working. With always_sample, it marked every trace as being Sampled. It told us how many it received and how many it forwarded onward. The datadog exporter exported the same number traces. They simply were not connected to the operation in Datadog.

Deprecating Tail Sampling

Then we ran into this Github issue, which indicates that the community wants to deprecate the tailsamplingprocessor.

It is inefficient, consuming too much CPU (confirmed in our system)
It could be broken up into multiple composable processors
Under pressure, it has a memory leak

Unfortunately, the only work towards this deprecation was to separate out span aggregation for a trace; users would need to rely on the groupbytrace processor for that aspect.

How to Tail Sample without Tail Sampling

It turns out that OpenTelemetry Collector has a composable solution for what we wanted to do in the form of using multiple pipelines. The tradeoff is that each pipeline must be careful to exclude traces. This approach consumes a little more memory but is vastly more efficient than using the tailsamplingprocessor. Here is a version of our solution:

config:
  processors:
    datadog:
    batch:
      send_batch_max_size: 500
      send_batch_size: 50
      timeout: 10s

    # Exclude everything captured in other pipelines
    filter/default:
      error_mode: ignore
      traces:
        span: >
          (
            ( attributes["http.status_code"] >= 400 ) or
            ( end_time_unix_nano - start_time_unix_nano > 5000000000 ) or
            ( attributes["http.route"] == "/super/common/endpoint" )
          )          

    # We always pass through interesting things, like errors, so exclude
    # anything that isn't interesting.
    filter/all:
      error_mode: ignore
      traces:
        span: >
          not (
            ( attributes["http.status_code"] >= 400 ) or
            ( end_time_unix_nano - start_time_unix_nano > 5000000000 )
          )          

    # Spammy spans
    filter/spammy:
      error_mode: ignore
      traces:
        span: >
          not (
            ( attributes["http.route"] == "/super/common/endpoint" )
          )          

    probabilistic_sampler/default:
      hash_seed: 23
      sampling_percentage: 66.6
    probabilistic_sampler/spammy:
      hash_seed: 23
      sampling_percentage: 5.0

    attributes/default:
      actions:
        - action: upsert
          key: otel.pipeline
          value: default
    attributes/all:
      actions:
        - action: upsert
          key: otel.pipeline
          value: all
    attributes/spammy:
      actions:
        - action: upsert
          key: otel.pipeline
          value: spammy

  service:
    pipelines:
      traces/all:
        receivers:
          - otlp
        processors:
          - batch
          - filter/all
          - datadog
          - attributes/all
        exporters:
          - datadog
      traces/spammy:
        receivers:
          - otlp
        processors:
          - batch
          - filter/spammy
          - datadog
          - probabilistic_sampler/spammy
          - attributes/spammy
        exporters:
          - datadog
      traces/default:
        receivers:
          - otlp
        processors:
          - batch
          - filter/default
          - datadog
          - probabilistic_sampler/default
          - attributes/default
        exporters:
          - datadog

In this configuration, we have three pipelines:

trace/all - Sample all errors and high-latency traces
trace/spammy - Sample a tiny fraction of spammy traces
trace/default - Sample 2/3 of everything else

The pipelines process in this fashion:

Batch the incoming traces
Filter the traces to what we want to handle in this pipeline
Aggregate traces into metrics
Use probabilistic sampling to sample the traces according to our policy
Add a debugging attribute to each traces so we can validate the policies

Because we needed a bit of math for calculating latency and comparing HTTP status codes, we had to use the OTTL capabilities of the filter processor. This puts filters in an exclusion-only mode, so the logic is in the inverse of how a person would think of each policy. We solved this by using blocks wrapped with not.

Success

The opentelemetry-collector service allows us to sideload the concern of sampling traces in our cluster. The tailsamplingprocessor did not work for us, but the more composable multi-pipeline approach did.

« Simplify Function Generation in Elixir with Macros The pganalyze Postgres Query Planner Presentation »

Legends of Learning - Developer Talk