API · Licensed Self-Host · Cloud

LLM evaluation at the speed of a single API call

Faithfulness, Relevance, and Judge-based scoring in one API call — for CI checks and live production monitoring.

$ docker pull evalninja/eval:latest
eval.ninja - terminal
eval.ninja - running evaluation suite...
Faithfulness 0.94
Answer Relevance 0.87
Context Precision 0.91
Context Recall 0.62
3/4 metrics passing · completed in 4.2s

Runs anywhere you run containers

AWS Lambda · ECS / Fargate · Google Cloud Run · Azure Container Apps · Fly.io · Kubernetes · Bare metal
No middleware

Eval latency is
model latency

No queues, no workers, no async pipelines. Your HTTP call hits the judge model and returns. Every other eval platform adds an orchestration layer between your code and the model. We don't.

Others
your app
→ queue
→ worker pool
→ judge model
→ response queue
→ your app
eval.ninja
your app
→ judge model
→ your app
OPENAI_API_KEY=sk-••••••••••••
# stays in your environment, not ours
ANTHROPIC_API_KEY=sk-ant-••••••
# or any provider you choose
# eval.ninja never sees your keys
# you pay your provider directly
Your keys, your cost

No markup.
No middleman.

Self-host under a commercial license with your own LLM provider keys. eval.ninja never sees them. You pay your provider at their posted rate. No token markup, no bundled API fees.

REST, any language

One endpoint, no SDK required. Call from bash, Python, Go, Node, Rust. Any HTTP client works.

RAG metrics out of the box

Faithfulness, answer relevance, context precision, context recall. Plus custom LLM-as-a-judge rubrics for any open-ended task.

Design-time and runtime

Gate prompt changes before deploys, then sample live outputs to catch quality drift, hallucinations, and unsafe automations.

Runtime evals

Score the message your app actually generated

Design-time evals tell you whether a prompt, model, or retrieval change is ready to ship. Runtime evals check the answer your app generated for a real user.

Production checks
Chatbot answer pass
faithfulness 0.91 · policy 0.96 · helpfulness 0.88
Send to user, store score on conversation trace
Lead email review
personalization 0.78 · claim support 0.64 · tone 0.93
Hold message because one generated claim lacks CRM evidence
Support escalation block
policy 0.42 · groundedness 0.57 · confidence 0.49
Route to fallback or human review before action is taken

How It Works

Three steps from zero to your first eval

01

Deploy

Use a licensed Docker self-host in your own infrastructure, or skip the install and use the managed cloud.

# Self-host
$ docker run -p 8080:8080 \
evalninja/eval:latest
# Or use cloud
# app.eval.ninja - no install
02

Call the API

One HTTP call per evaluation. Send the question, the answer, and the retrieved context. Get back metric scores with reasoning.

$ curl -X POST \
https://api.eval.ninja/v1/evaluate \
-H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
-d '{"user_input":"...",
"user_input":"...",
"response":"...",
"retrieved_contexts":["..."],
"metrics":["faithfulness"]}'
03

Gate, Sample & Iterate

Wire scores into CI to catch regressions on every PR, or sample production outputs to verify messages after generation.

# In CI
{"faithfulness": 0.94,
"reasoning": "answer is
grounded in context..."}
# ✓ Above 0.85 threshold
Two deployment options  ·  same API  ·  same metrics
self-host

Your network,
your keys, your cost

Licensed for teams that need eval data and judge calls to stay inside their own infrastructure, or who want to pay their LLM provider directly without markup.

Commercial license  -  predictable fee, no per-token markup
Data isolation  -  HIPAA, SOC 2, air-gapped networks
No token markup  -  pay your provider at their posted rate
Runs anywhere  -  Lambda, ECS, Cloud Run, Kubernetes, bare metal
$ docker run -p 8080:8080 evalninja/eval:latest
Setup guide →
managed cloud

Start calling the API
in under 5 minutes

No infrastructure, no provider keys, no setup. Sign up, copy the API key, start evaluating. Judge model is included in your credits.

Zero setup  -  no infra to provision or maintain
Judge model included  -  no provider keys needed
Never used for training  -  your eval data stays yours
100 free credits included  ·  no credit card required
Create free account →
CLI tool coming soon. It works against either deployment.
Pricing

Start free. Pay for what you use.

Managed cloud uses credits. Self-hosting is commercially licensed, with your own compute and provider keys.

Free

Try it out

$0 /mo
  • 100 credits included
  • Basic evaluation metrics
  • 1 concurrent evaluation
  • 30 days data retention
  • Community support
Start Free

Starter

Solo developers

$2.99 /mo
  • 200 credits per month
  • All evaluation metrics
  • 2 concurrent evaluations
  • 60 days data retention
  • Email support
Choose Starter
Most Popular

Growth

Best value for teams

$9.99 /mo
  • 1,500 credits per month
  • All evaluation metrics
  • 5 concurrent evaluations
  • 90 days data retention
  • Priority email support
  • Advanced analytics dashboard
  • Full API access
Choose Growth

Scale

High-volume teams

$30 /mo
  • 3,800 credits per month
  • All evaluation metrics
  • 20 concurrent evaluations
  • 365 days data retention
  • Dedicated support channel
  • Custom integrations
  • Team collaboration tools
Choose Scale
Self-hosting. Pull the Docker image and bring your own LLM provider keys. Commercial license required; no markup on model usage. Contact us for pricing.

How eval.ninja compares

Versus typical SaaS-only eval platforms

eval.ninja Typical SaaS Eval Tools
Run inside your network
Docker, anywhere
SaaS only
Bring your own keys
No markup on tokens
Bundled, marked up
Application-layer overhead
None
Queues, workers
Serverless deployment
Lambda, Cloud Run
Vendor-hosted only
Language-agnostic API
REST, any language
Often Python-first
Pricing model
Credits from $2.99 + licensed self-host
Seat-based, $50+/mo

Ready to run your first eval?

Frequently Asked Questions

Stop guessing if your LLM app works.

Licensed self-host with Docker. Or use the managed cloud. Same API either way.

Get 100 Free Credits
No credit card required.