and i sing to the boundless earth

and i sing to the boundless earth, letting its returning voice carry me to far-off cliffs, where the power of water can be heard in the waves, seen in the caves rising like cathedrals where fish hunt…

Smartphone

独家优惠奖金 100% 高达 1 BTC + 180 免费旋转




Monitoring Amazon SageMaker Endpoint With Datadog

SageMaker Endpoint Traces in Datadog

Amazon SageMaker is a powerful tool for Data Scientists and Software Engineers to prepare, build, train, and deploy high-quality machine learning models. SageMaker Endpoint is an AWS fully managed service that provides an API to do real-time inference.

Just as any normal service, it’s crucial for data scientists and engineers to understand the endpoint's performance so that they can take action to improve the latency.

Datadog agent is a profiler that gives deep insights into the performance of a running application. In this post, I will introduce how to set up Datadog to monitor the performance of SageMaker Endpoints.

We manage the docker image of our SageMaker Endpoints. It is written in python with Flask framework. We deployed the endpoints into AWS private subnet with a NAT Gateway. We set up the network this way to ensure the ML endpoints only serve internal requests, while are also able to communicate to the outside world when necessary (such as allow the Datadog agent to send traces to the Datadog server).

Note that due to a version constraint by another package, 0.38.4 is the latest version we can use. You should use the latest version if possible.

2. Wrap functions with traces.

Since ddtrace automatically integrates with the Flask framework, you can get many traces out of the box, including the total time of a route like /invocations. You can also wrap the code in interest with trace to get the granularity insight.

3. Specify the host and port of the Datadog agent

You need to set two environment variables:

I will discuss the detail of the Datadog hostname in the last section. For now, let’s assume the Datadog agent is running and have the hostname datadog.demo.shift.com .

4. Prefix the command to run your server with ddtrace-run.

In prod, we also use Gunicorn on top of Flask. Our original command to run the server is:

Now run it with ddtrace is just adding the prefixddtrace-run:

5. Redeploy the SageMaker endpoint and see traces appear in Datadog!

The SageMaker endpoint takes 10.2 ms to serve this inference request, and 90% of the time spend on getting a value from Redis. The endpoint in this demo is very fast because of the model it uses. For endpoint with a more complex model, the traces can help identify the performance bottleneck.

6. Set up a fixed Datadog Hostname

In prod, things can be a little different. You will likely have many SageMaker endpoints, and you want to monitor all of them. A clean way to do this is to set up a standalone Datadog Agent Service. The Datadog Agent Service will have a load balancer with a friendly hostname and listen to port 8126 for traces. Your SageMaker endpoints will dump the traces to this hostname and port 8126. The load balancer will then forward the traffic to an auto-scaling group of Datadog agents. So you can have as many SageMaker endpoints as your business needs, and the number of Datadog agents behind the scene can scale up and down automatically.

Network Diagram

In a nutshell, to set up the Datadog Agent Service, you need to:

I initially used Datadog agent 7.23.1 but the agent cannot pass the target group’s health check (TCP, port 8126), upgrading the agent to 7.29.1solves the problem.

Add a comment

Related posts:

What is Vice

Hand tools are essential for some operations in a workshop. Operations such as sawing, filing, polishing, chipping, tapping, and threading must be mastered by a fitter. Skill in the use of hand tools…