What metrics does eval.ninja support?

eval.ninja supports the standard RAG evaluation metrics: faithfulness, answer relevance, context precision, context recall, and context relevance. It also supports custom LLM-as-a-judge rubrics for open-ended scoring.

Does my data ever leave my server?

If you self-host with Docker, nothing leaves your infrastructure except the calls you make to your own LLM provider via your own API keys. On the managed cloud, your eval inputs are sent to our judge models, processed, and not used for training.

Can I bring my own API keys?

Yes. On self-host, you bring your own keys for your judge model provider. eval.ninja never sees them. On managed cloud, the judge model is included in your credit usage so you do not need a separate key.

Where can I run the self-hosted Docker image?

Anywhere you can run a container, including AWS ECS or Fargate, Google Cloud Run, Azure Container Apps, Fly.io, Railway, Kubernetes, or a plain VM. It also runs as a serverless function for spiky workloads.

Do I need to use Docker?

No. The managed cloud has no install step. Sign up and start calling the API. Docker is for teams that need full data isolation or want to run evals inside their own network.

How is pricing structured?

Self-hosting is commercially licensed, and you provide the compute and your own LLM keys. Managed cloud is credit-based, starting with 100 free credits and paid tiers from $2.99 to $30 per month.

API · Licensed Self-Host · Cloud

LLM evaluation at the speed of a single API call

Faithfulness, Relevance, and Judge-based scoring in one API call — for CI checks and live production monitoring.

Get 100 Free Credits Self-Host with Docker

$ docker pull evalninja/eval:latest

eval.ninja - terminal

eval.ninja - running evaluation suite...
✓Faithfulness0.94
✓Answer Relevance0.87
✓Context Precision0.91
✗Context Recall0.62
3/4 metrics passing · completed in 4.2s

Runs anywhere you run containers

AWS Lambda · ECS / Fargate · Google Cloud Run · Azure Container Apps · Fly.io · Kubernetes · Bare metal

No middleware

Eval latency is
model latency

No queues, no workers, no async pipelines. Your HTTP call hits the judge model and returns. Every other eval platform adds an orchestration layer between your code and the model. We don't.

Others

your app

→ queue

→ worker pool

→ judge model

→ response queue

→ your app

eval.ninja

your app

→ judge model

→ your app

OPENAI_API_KEY=sk-••••••••••••

# stays in your environment, not ours

ANTHROPIC_API_KEY=sk-ant-••••••

# or any provider you choose

# eval.ninja never sees your keys

# you pay your provider directly

Your keys, your cost

No markup.
No middleman.

Self-host under a commercial license with your own LLM provider keys. eval.ninja never sees them. You pay your provider at their posted rate. No token markup, no bundled API fees.

REST, any language

One endpoint, no SDK required. Call from bash, Python, Go, Node, Rust. Any HTTP client works.

RAG metrics out of the box

Faithfulness, answer relevance, context precision, context recall. Plus custom LLM-as-a-judge rubrics for any open-ended task.

Design-time and runtime

Gate prompt changes before deploys, then sample live outputs to catch quality drift, hallucinations, and unsafe automations.

Runtime evals

Score the message your app actually generated

Design-time evals tell you whether a prompt, model, or retrieval change is ready to ship. Runtime evals check the answer your app generated for a real user.

Runtime eval guide

Sampling, blocking checks, async judging, and alert thresholds.

Read guide ->

Lead generation evals

Verify generated outreach before it is sent or handed to sales.

Read guide ->

Production checks

Chatbot answer pass

faithfulness 0.91 · policy 0.96 · helpfulness 0.88

Send to user, store score on conversation trace

Lead email review

personalization 0.78 · claim support 0.64 · tone 0.93

Hold message because one generated claim lacks CRM evidence

Support escalation block

policy 0.42 · groundedness 0.57 · confidence 0.49

Route to fallback or human review before action is taken

How It Works

Three steps from zero to your first eval

Deploy

Use a licensed Docker self-host in your own infrastructure, or skip the install and use the managed cloud.

# Self-host

$ docker run -p 8080:8080 \

evalninja/eval:latest

# Or use cloud

# app.eval.ninja - no install

Call the API

One HTTP call per evaluation. Send the question, the answer, and the retrieved context. Get back metric scores with reasoning.

$ curl -X POST \

https://api.eval.ninja/v1/evaluate \

-H "Authorization: Bearer $TOKEN" \

-H "Content-Type: application/json" \

-d '{"user_input":"...",

"user_input":"...",

"response":"...",

"retrieved_contexts":["..."],

"metrics":["faithfulness"]}'

Gate, Sample & Iterate

Wire scores into CI to catch regressions on every PR, or sample production outputs to verify messages after generation.

# In CI

{"faithfulness": 0.94,

"reasoning": "answer is

grounded in context..."}

# ✓ Above 0.85 threshold

Two deployment options · same API · same metrics

self-host

Your network,
your keys, your cost

Licensed for teams that need eval data and judge calls to stay inside their own infrastructure, or who want to pay their LLM provider directly without markup.

Commercial license - predictable fee, no per-token markup

Data isolation - HIPAA, SOC 2, air-gapped networks

No token markup - pay your provider at their posted rate

Runs anywhere - Lambda, ECS, Cloud Run, Kubernetes, bare metal

$ docker run -p 8080:8080 evalninja/eval:latest

Setup guide →

managed cloud

Start calling the API
in under 5 minutes

No infrastructure, no provider keys, no setup. Sign up, copy the API key, start evaluating. Judge model is included in your credits.

Zero setup - no infra to provision or maintain

Judge model included - no provider keys needed

Never used for training - your eval data stays yours

100 free credits included · no credit card required

Create free account →

CLI tool coming soon. It works against either deployment.

Pricing

Start free. Pay for what you use.

Managed cloud uses credits. Self-hosting is commercially licensed, with your own compute and provider keys.

Free

Try it out

$0 /mo

✓ 100 credits included
✓ Basic evaluation metrics
✓ 1 concurrent evaluation
✓ 30 days data retention
✓ Community support

Start Free

Starter

Solo developers

$2.99 /mo

✓ 200 credits per month
✓ All evaluation metrics
✓ 2 concurrent evaluations
✓ 60 days data retention
✓ Email support

Choose Starter

Growth

Best value for teams

$9.99 /mo

✓ 1,500 credits per month
✓ All evaluation metrics
✓ 5 concurrent evaluations
✓ 90 days data retention
✓ Priority email support
✓ Advanced analytics dashboard
✓ Full API access

Choose Growth

Scale

High-volume teams

$30 /mo

✓ 3,800 credits per month
✓ All evaluation metrics
✓ 20 concurrent evaluations
✓ 365 days data retention
✓ Dedicated support channel
✓ Custom integrations
✓ Team collaboration tools

Choose Scale

Self-hosting. Pull the Docker image and bring your own LLM provider keys. Commercial license required; no markup on model usage. Contact us for pricing.

How eval.ninja compares

Versus typical SaaS-only eval platforms

	eval.ninja	Typical SaaS Eval Tools
Run inside your network	✓ Docker, anywhere	✗ SaaS only
Bring your own keys	✓ No markup on tokens	✗ Bundled, marked up
Application-layer overhead	✓ None	⚠ Queues, workers
Serverless deployment	✓ Lambda, Cloud Run	✗ Vendor-hosted only
Language-agnostic API	✓ REST, any language	⚠ Often Python-first
Pricing model	✓ Credits from $2.99 + licensed self-host	⚠ Seat-based, $50+/mo

Ready to run your first eval?

Get 100 Free Credits Self-Host Guide

Frequently Asked Questions

Stop guessing if your LLM app works.

Licensed self-host with Docker. Or use the managed cloud. Same API either way.

Get 100 Free Credits

No credit card required.