Skip to main content
The SageMaker integration lets you deploy models from the Anaconda Platform model catalog to Amazon SageMaker real-time endpoints using a custom inference container. The container serves an OpenAI-compatible API backed by llama.cpp and handles SageMaker’s /ping and /invocations protocol.

Prerequisites

  • Docker with NVIDIA GPU support (nvidia-container-toolkit)
  • Access to Anaconda Platform (Self-hosted) and an API key
  • An AWS account with SageMaker access
  • A SageMaker execution IAM role
  • An Amazon ECR repository to push the container image to

Install

Install the sagemaker extra for anaconda-ai:
pip install 'anaconda-ai[sagemaker]'

Build the container

Clone the anaconda-sagemaker-runtime repository and build the image:
git clone https://github.com/anaconda/anaconda-sagemaker-runtime
cd anaconda-sagemaker-runtime

docker build \
  -t anaconda-sagemaker-runtime:latest \
  -f Dockerfile \
  .

Push to ECR

Authenticate with ECR, then tag and push the image:
AWS_ACCOUNT=$(aws sts get-caller-identity --query Account --output text)
AWS_REGION=us-east-1
ECR_REPO=anaconda-sagemaker-runtime

# Create the repository if it does not exist
aws ecr create-repository \
  --repository-name ${ECR_REPO} \
  --region ${AWS_REGION} 2>/dev/null || true

# Authenticate
aws ecr get-login-password --region ${AWS_REGION} \
  | docker login --username AWS --password-stdin \
    ${AWS_ACCOUNT}.dkr.ecr.${AWS_REGION}.amazonaws.com

# Tag and push
docker tag anaconda-sagemaker-runtime:latest \
  ${AWS_ACCOUNT}.dkr.ecr.${AWS_REGION}.amazonaws.com/${ECR_REPO}:latest

docker push \
  ${AWS_ACCOUNT}.dkr.ecr.${AWS_REGION}.amazonaws.com/${ECR_REPO}:latest

Deploy and invoke

Use AnacondaModel to deploy a model from the catalog and invoke the endpoint:
from anaconda_ai.integrations.sagemaker import AnacondaModel
import json

IMAGE_URI = "<account_id>.dkr.ecr.<region>.amazonaws.com/anaconda-sagemaker-runtime:latest"

model = AnacondaModel(
    model_id="Qwen2.5-7B-Instruct/Q4_K_M",
    image_uri=IMAGE_URI,
)

endpoint = model.deploy(instance_type="ml.g5.2xlarge")

response = endpoint.invoke(
    body=json.dumps({
        "messages": [{"role": "user", "content": "What is conda?"}],
        "max_tokens": 256,
    }),
    content_type="application/json",
)
print(json.loads(response.body))

endpoint.delete()
Deleting an endpoint does not delete the underlying deployable model. The model registration persists in SageMaker and can be redeployed at any time from the AWS console or using native AWS tools. You only need to run model.stage() and model.build() once per model.
To pre-stage the model to S3 for a faster cold start (approximately 4 minutes versus 6 minutes), pass stage=True:
endpoint = model.deploy(instance_type="ml.g5.2xlarge", stage=True)
You can also call the staging, registration, and deployment steps explicitly:
model.stage()   # upload GGUF to S3 via CodeBuild
model.build()   # register SageMaker Model resource
endpoint = model.deploy(instance_type="ml.g5.2xlarge")

llama-server tuning

Use these options to configure the inference server’s context size, parallelism, and attention behavior.
model = AnacondaModel(
    model_id="Qwen2.5-7B-Instruct/Q4_K_M",
    image_uri=IMAGE_URI,
    ctx_size=16384,
    parallel=8,
    flash_attn=True,
    cache_type_k="q8_0",
)

Environment variables

Required for catalog download

The following environment variables are required when the container downloads the model from the catalog at startup.
When using stage=True or model.stage(), the model is pre-staged to S3 and these variables are not required.

Optional

Enabling LOG_REQUEST_BODY writes all prompts to CloudWatch Logs. Ensure log retention is configured appropriately for your data policy.
Any llama-server argument that supports an environment variable can be passed as LLAMA_ARG_*. For example: LLAMA_ARG_BATCH=4096, LLAMA_ARG_FLASH_ATTN=on.

Supported instance types

GPU is required. Recommended instance families:

Model identifier format

For models in the Anaconda Platform model catalog, models are identified using the format:
<ModelName>/<Quantization>
Anaconda Platform supports the following quantization methods: Q4_K_M, Q5_K_M, Q6_K, Q8_0. Examples:
Qwen2.5-7B-Instruct/Q4_K_M
Llama-3.2-3B-Instruct/Q5_K_M
See the Model Catalog for available models and their quantizations.

Next steps

Once your endpoint is in service, you’re back to standard AWS. You can monitor and manage the endpoint, view startup and inference logs in CloudWatch, and configure auto-scaling or other post-creation settings using the AWS CLI or Boto3. See the Amazon SageMaker real-time inference documentation for details.