How we cut detection and resolution times in half

My role as Director of Platform Engineering at Salt safety it allows me to pursue my passion for cloud-native technology and solving difficult system design challenges. One of the recent challenges we solved was with the visibility of our services.

Or the lack of.

Initially, we decided to go with OpenTelemetry, but that didn’t give us everything we needed as we still had some blind spots in our system.

In the end, we found a solution that helped us focus on service errors and cut the time it takes to detect and fix problems in half.

But let’s take a step back.

70 services and 50 billion strong monthly spans

At Salt Security we have about 70 services, based on Scala, Go and NodeJS, generating 50 billion intervals per month.

Since 70 is not a small number, nor is it 50 billion, we needed assistance in gaining visibility into requests across services.

The need to see

Why do we need to see in our services?

1. At a macro level, we needed to monitor and identify problems after making changes to our system. For example, we needed to detect filters, glitches and any other signs of problematic streams.

2. At the micro level, we needed to be able to pinpoint the root cause of any identified problem. For example, errors, slow operations or incomplete streams, whether they support gRPC or Kafka operations, as well as their communication with databases.

To be clear, when we say “visibility” we mean a deep level of granularity at the payload level. Because a single slow database query could slow down the entire flow, impacting our operations and customers.

Getting this exposure proved to be a tough nut to crack. Not only due to the large number of services and spans, but also due to the complexity of some flows.

For example, a single flow might involve up to five services, three databases, and thousands of internal requests.

Attempt no. 1: Open Telemetry and Jaeger

Naturally, our first goal was Open Telemetry with ours Jager example.

This impressive open source collection helps simplify the capture of traces and distributed metrics from applications and infrastructure. The SDKs, the Collector and the OpenTelemetry Protocol (OTLP) allow you to collect traces and metrics from all sources and propagate them with the W3C TraceContext and by Zipkin B3 formats.

Here’s a high-level diagram of what the resulting OTel configuration looked like:

As you can see, we have used the OTel collector to collect, process and move data from our services. Then, the data was propagated to another open source tool: Jaeger. Jaeger was used to view the data.

Jaeger is great, but it failed to meet our needs. We weren’t able to cover critical parts of our system, resulting in blind spots when we ran into errors.

Hi Helios

That’s when we found out Helios. Helios displays distributed tracing for quick troubleshooting. We chose Helios over other solutions because it addresses our needs at both the macro and micro levels, and it’s especially amazing at the micro level.

Helios treats backend services, such as databases and queues, and protocols, such as gRPC, HTTP, Mongo queries, and others, as first-class citizens. The data is formatted according to what it represents.

For example, a Mongo query will be shown in the first person when looking at a Mongo DB call, with JSON formatting. An HTTP call will be separated into header and body. A Kafka topic that publishes or consumes a message will show the header and payload separately. This view makes it extremely easy to understand why the call or query is slow.

Helios also provides super advanced support for cloud and 3rd party API calls. When it comes to Kafka, Helios displays the list of topics it has collected. For AWS, Helios shows the list of services in use and they get highlighted when the services are using them.

In addition, the guys from Helios have come up with a whole test strategy based on tracks! We can generate one-click tests by looking at a specific range. There are also many other great features, such as advanced search, previews of streams in search results, error highlighting of tracks that have not been closed, and so on.

Our Helios set-up consists of:

  • An OTel collector running on our Kubernetes cluster.
  • The Helios SDK, which is used by each service in any language, and wraps the OTel SDK.
  • Two pipelines:
    • Between the OTel collector and Helios.
    • Between the collector of OTel and Jaeger, with a one day hold. (We use 3% sampling when sending ranges to Helios and a much higher sample rate when sending to Jaeger, but with much lower retention, for development purposes).
  • The probability sampling for spans sent to Helios is approximately 3%.

The proof is in the pudding

Switching to Helios as an additional layer on top of OpenTelemetry has proved to be a success. We use Helios on a daily basis when making changes to our system or when trying to identify the source of a problem.

In one case, we used Helios to identify a bad span that occurred when a NodeJS service using the AWS SDK timed out on requests to S3. Thanks to Helios, we were able to identify the problem and fix it quickly.

In another case, one of our complicated streams was failing. The flow involves three services, three databases, called Kafka and gRPC. However, the errors were not propagated properly and the logs were missing. With Helios, we could look into the trace and immediately understand the end-to-end problem.

Another thing we like about Helios is its user interface. which presents the services involved in each flow.

Here’s what that tricky flow looks like in Helios:

Simple and easy to understand, right?

Final comments

We all know the challenges of microservices and how blind we are when something goes wrong. But while we are inundated with tools to figure out that there is a problem, we were missing a tool that could help us figure out the exact location of the problem.

With Helios, we can see the actual queries and payloads without having to dig into span metadata. Displaying them greatly simplifies root cause analysis.

I highly recommend Helios for troubleshooting.

Group Created with Sketch.

#cut #detection #resolution #times
Image Source : thenewstack.io

Leave a Comment