The SageMaker integration lets you deploy models from the Anaconda Platform model catalog to Amazon SageMaker real-time endpoints using a custom inference container. The container serves an OpenAI-compatible API backed by llama.cpp and handles SageMaker’s /ping and /invocations protocol.
Prerequisites
- Docker with NVIDIA GPU support (nvidia-container-toolkit)
- Access to Anaconda Platform (Self-hosted) and an API key
- An AWS account with SageMaker access
- A SageMaker execution IAM role
- An Amazon ECR repository to push the container image to
Install
Install the sagemaker extra for anaconda-ai:
pip install 'anaconda-ai[sagemaker]'
Build the container
Clone the anaconda-sagemaker-runtime repository and build the image:
git clone https://github.com/anaconda/anaconda-sagemaker-runtime
cd anaconda-sagemaker-runtime
docker build \
-t anaconda-sagemaker-runtime:latest \
-f Dockerfile \
.
Push to ECR
Authenticate with ECR, then tag and push the image:
AWS_ACCOUNT=$(aws sts get-caller-identity --query Account --output text)
AWS_REGION=us-east-1
ECR_REPO=anaconda-sagemaker-runtime
# Create the repository if it does not exist
aws ecr create-repository \
--repository-name ${ECR_REPO} \
--region ${AWS_REGION} 2>/dev/null || true
# Authenticate
aws ecr get-login-password --region ${AWS_REGION} \
| docker login --username AWS --password-stdin \
${AWS_ACCOUNT}.dkr.ecr.${AWS_REGION}.amazonaws.com
# Tag and push
docker tag anaconda-sagemaker-runtime:latest \
${AWS_ACCOUNT}.dkr.ecr.${AWS_REGION}.amazonaws.com/${ECR_REPO}:latest
docker push \
${AWS_ACCOUNT}.dkr.ecr.${AWS_REGION}.amazonaws.com/${ECR_REPO}:latest
Deploy and invoke
Use AnacondaModel to deploy a model from the catalog and invoke the endpoint:from anaconda_ai.integrations.sagemaker import AnacondaModel
import json
IMAGE_URI = "<account_id>.dkr.ecr.<region>.amazonaws.com/anaconda-sagemaker-runtime:latest"
model = AnacondaModel(
model_id="Qwen2.5-7B-Instruct/Q4_K_M",
image_uri=IMAGE_URI,
)
endpoint = model.deploy(instance_type="ml.g5.2xlarge")
response = endpoint.invoke(
body=json.dumps({
"messages": [{"role": "user", "content": "What is conda?"}],
"max_tokens": 256,
}),
content_type="application/json",
)
print(json.loads(response.body))
endpoint.delete()
Deleting an endpoint does not delete the underlying deployable model. The model registration persists in SageMaker and can be redeployed at any time from the AWS console or using native AWS tools. You only need to run model.stage() and model.build() once per model.
To pre-stage the model to S3 for a faster cold start (approximately 4 minutes versus 6 minutes), pass stage=True:endpoint = model.deploy(instance_type="ml.g5.2xlarge", stage=True)
You can also call the staging, registration, and deployment steps explicitly:model.stage() # upload GGUF to S3 via CodeBuild
model.build() # register SageMaker Model resource
endpoint = model.deploy(instance_type="ml.g5.2xlarge")
IMAGE_URI="<account_id>.dkr.ecr.<region>.amazonaws.com/anaconda-sagemaker-runtime:latest"
# Deploy — container downloads model from the catalog at startup
anaconda ai deploy Qwen2.5-7B-Instruct/Q4_K_M --image-uri $IMAGE_URI
# Deploy with S3 staging for faster cold start
anaconda ai deploy Qwen2.5-7B-Instruct/Q4_K_M --image-uri $IMAGE_URI --stage
# Register model only, without creating an endpoint
anaconda ai deploy Qwen2.5-7B-Instruct/Q4_K_M --image-uri $IMAGE_URI --build-only
# Stage a model to S3 without deploying
anaconda ai stage Qwen2.5-7B-Instruct/Q4_K_M
# List staged models
anaconda ai stage --list
Run anaconda ai deploy --help to see all available options.
llama-server tuning
Use these options to configure the inference server’s context size, parallelism, and attention behavior.
model = AnacondaModel(
model_id="Qwen2.5-7B-Instruct/Q4_K_M",
image_uri=IMAGE_URI,
ctx_size=16384,
parallel=8,
flash_attn=True,
cache_type_k="q8_0",
)
anaconda ai deploy Qwen2.5-7B-Instruct/Q4_K_M --image-uri $IMAGE_URI \
--ctx-size 16384 --parallel 8 --flash-attn --cache-type-k q8_0
Environment variables
Required for catalog download
The following environment variables are required when the container downloads the model from the catalog at startup.
When using stage=True or model.stage(), the model is pre-staged to S3 and these variables are not required.
Optional
Enabling LOG_REQUEST_BODY writes all prompts to CloudWatch Logs. Ensure log retention is configured appropriately for your data policy.
Any llama-server argument that supports an environment variable can be passed as LLAMA_ARG_*. For example: LLAMA_ARG_BATCH=4096, LLAMA_ARG_FLASH_ATTN=on.
Supported instance types
GPU is required. Recommended instance families:
For models in the Anaconda Platform model catalog, models are identified using the format:
<ModelName>/<Quantization>
Anaconda Platform supports the following quantization methods: Q4_K_M, Q5_K_M, Q6_K, Q8_0.
Examples:
Qwen2.5-7B-Instruct/Q4_K_M
Llama-3.2-3B-Instruct/Q5_K_M
See the Model Catalog for available models and their quantizations.
Next steps
Once your endpoint is in service, you’re back to standard AWS. You can monitor and manage the endpoint, view startup and inference logs in CloudWatch, and configure auto-scaling or other post-creation settings using the AWS CLI or Boto3. See the Amazon SageMaker real-time inference documentation for details.