Amazon SageMaker - Anaconda

The SageMaker integration lets you deploy models from the Anaconda Platform model catalog to Amazon SageMaker real-time endpoints using a custom inference container. The container serves an OpenAI-compatible API backed by llama.cpp and handles SageMaker’s /ping and /invocations protocol.

Prerequisites

Docker with NVIDIA GPU support (nvidia-container-toolkit)
Access to Anaconda Platform (Self-hosted) and an API key
An AWS account with SageMaker access
A SageMaker execution IAM role
An Amazon ECR repository to push the container image to

Install

Install the sagemaker extra for anaconda-ai:

pip install 'anaconda-ai[sagemaker]'

Build the container

Clone the anaconda-sagemaker-runtime repository and build the image:

git clone https://github.com/anaconda/anaconda-sagemaker-runtime
cd anaconda-sagemaker-runtime

docker build \
  -t anaconda-sagemaker-runtime:latest \
  -f Dockerfile \
  .

Push to ECR

Authenticate with ECR, then tag and push the image:

AWS_ACCOUNT=$(aws sts get-caller-identity --query Account --output text)
AWS_REGION=us-east-1
ECR_REPO=anaconda-sagemaker-runtime

# Create the repository if it does not exist
aws ecr create-repository \
  --repository-name ${ECR_REPO} \
  --region ${AWS_REGION} 2>/dev/null || true

# Authenticate
aws ecr get-login-password --region ${AWS_REGION} \
  | docker login --username AWS --password-stdin \
    ${AWS_ACCOUNT}.dkr.ecr.${AWS_REGION}.amazonaws.com

# Tag and push
docker tag anaconda-sagemaker-runtime:latest \
  ${AWS_ACCOUNT}.dkr.ecr.${AWS_REGION}.amazonaws.com/${ECR_REPO}:latest

docker push \
  ${AWS_ACCOUNT}.dkr.ecr.${AWS_REGION}.amazonaws.com/${ECR_REPO}:latest

Deploy and invoke

Python SDK
CLI

Use AnacondaModel to deploy a model from the catalog and invoke the endpoint:

from anaconda_ai.integrations.sagemaker import AnacondaModel
import json

IMAGE_URI = "<account_id>.dkr.ecr.<region>.amazonaws.com/anaconda-sagemaker-runtime:latest"

model = AnacondaModel(
    model_id="Qwen2.5-7B-Instruct/Q4_K_M",
    image_uri=IMAGE_URI,
)

endpoint = model.deploy(instance_type="ml.g5.2xlarge")

response = endpoint.invoke(
    body=json.dumps({
        "messages": [{"role": "user", "content": "What is conda?"}],
        "max_tokens": 256,
    }),
    content_type="application/json",
)
print(json.loads(response.body))

endpoint.delete()

Deleting an endpoint does not delete the underlying deployable model. The model registration persists in SageMaker and can be redeployed at any time from the AWS console or using native AWS tools. You only need to run model.stage() and model.build() once per model.

To pre-stage the model to S3 for a faster cold start (approximately 4 minutes versus 6 minutes), pass stage=True:

endpoint = model.deploy(instance_type="ml.g5.2xlarge", stage=True)

You can also call the staging, registration, and deployment steps explicitly:

model.stage()   # upload GGUF to S3 via CodeBuild
model.build()   # register SageMaker Model resource
endpoint = model.deploy(instance_type="ml.g5.2xlarge")

IMAGE_URI="<account_id>.dkr.ecr.<region>.amazonaws.com/anaconda-sagemaker-runtime:latest"

# Deploy — container downloads model from the catalog at startup
anaconda ai deploy Qwen2.5-7B-Instruct/Q4_K_M --image-uri $IMAGE_URI

# Deploy with S3 staging for faster cold start
anaconda ai deploy Qwen2.5-7B-Instruct/Q4_K_M --image-uri $IMAGE_URI --stage

# Register model only, without creating an endpoint
anaconda ai deploy Qwen2.5-7B-Instruct/Q4_K_M --image-uri $IMAGE_URI --build-only

# Stage a model to S3 without deploying
anaconda ai stage Qwen2.5-7B-Instruct/Q4_K_M

# List staged models
anaconda ai stage --list

Run anaconda ai deploy --help to see all available options.

llama-server tuning

Use these options to configure the inference server’s context size, parallelism, and attention behavior.

Python SDK
CLI

model = AnacondaModel(
    model_id="Qwen2.5-7B-Instruct/Q4_K_M",
    image_uri=IMAGE_URI,
    ctx_size=16384,
    parallel=8,
    flash_attn=True,
    cache_type_k="q8_0",
)

anaconda ai deploy Qwen2.5-7B-Instruct/Q4_K_M --image-uri $IMAGE_URI \
    --ctx-size 16384 --parallel 8 --flash-attn --cache-type-k q8_0

Environment variables

Required for catalog download

The following environment variables are required when the container downloads the model from the catalog at startup.

When using stage=True or model.stage(), the model is pre-staged to S3 and these variables are not required.

Optional

Enabling LOG_REQUEST_BODY writes all prompts to CloudWatch Logs. Ensure log retention is configured appropriately for your data policy.

Any llama-server argument that supports an environment variable can be passed as LLAMA_ARG_*. For example: LLAMA_ARG_BATCH=4096, LLAMA_ARG_FLASH_ATTN=on.

Supported instance types

GPU is required. Recommended instance families:

Model identifier format

For models in the Anaconda Platform model catalog, models are identified using the format:

<ModelName>/<Quantization>

Anaconda Platform supports the following quantization methods: Q4_K_M, Q5_K_M, Q6_K, Q8_0. Examples:

Qwen2.5-7B-Instruct/Q4_K_M
Llama-3.2-3B-Instruct/Q5_K_M

See the Model Catalog for available models and their quantizations.

Next steps

Once your endpoint is in service, you’re back to standard AWS. You can monitor and manage the endpoint, view startup and inference logs in CloudWatch, and configure auto-scaling or other post-creation settings using the AWS CLI or Boto3. See the Amazon SageMaker real-time inference documentation for details.

anaconda-ai

​Prerequisites

​Install

​Build the container

​Push to ECR

​Deploy and invoke

​llama-server tuning

​Environment variables

​Required for catalog download

​Optional

​Supported instance types

​Model identifier format

​Next steps

Prerequisites

Install

Build the container

Push to ECR

Deploy and invoke

llama-server tuning

Environment variables

Required for catalog download

Optional

Supported instance types

Model identifier format

Next steps