Should LLM evals block pull requests?

Yes for stable quality gates and critical workflows. Use a fast smoke suite on every pull request and a larger benchmark suite on scheduled runs or high-risk changes.

Which metrics should run in CI?

For RAG applications, start with faithfulness, answer relevancy, context precision, and context recall. Gate on individual metrics instead of only using an average score.

How do I keep CI evals fast?

Run a small representative subset on each pull request, cache generated application responses where possible, and run full suites on schedules or merges to main.

How to Run LLM Evals in CI/CD

Short answer

Run a small eval suite on every pull request, gate on metric thresholds, and run a full benchmark on merges or schedules. Treat prompts, retrieval settings, and model changes as production code changes.

What belongs in CI

CI should catch obvious regressions quickly. It does not need to prove your entire AI system is perfect. The right pull request suite is small, representative, and strict enough to block dangerous changes.

20 to 50 examples for fast pull request feedback.
Stable examples that cover high-value workflows.
Per-metric thresholds, not only average scores.
Segment-level checks for critical query categories.
A larger scheduled suite for deeper coverage.

Recommended workflow

Generate application responses for your eval dataset.
Send user input, response, retrieved context, and reference answers to eval.ninja.
Store the scored result as a CI artifact.
Fail the build if required metrics fall below thresholds.
Compare against the main-branch baseline to catch regressions.

name: RAG Evaluation
on:
  pull_request:
    paths:
      - "src/rag/**"
      - "prompts/**"
      - "evals/**"

jobs:
  evaluate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Generate eval responses
        run: npm run eval:generate -- --dataset evals/smoke.json
      - name: Score with eval.ninja
        run: |
          curl -X POST https://api.eval.ninja/v1/evaluate \
            -H "Authorization: Bearer ${{ secrets.EVAL_NINJA_KEY }}" \
            -H "Content-Type: application/json" \
            -d @evals/generated-responses.json \
            -o eval-results.json
      - name: Check thresholds
        run: |
          node scripts/check-eval-thresholds.js \
            --results eval-results.json \
            --min-faithfulness 0.85 \
            --min-answer-relevancy 0.80 \
            --max-regression 0.05

Set thresholds by metric

Use metric-specific thresholds because each score means something different. Faithfulness protects against unsupported claims. Context recall protects against missing retrieval evidence. Answer relevancy protects against evasive answers.

{
  "minimums": {
    "faithfulness": 0.85,
    "answer_relevancy": 0.80,
    "context_precision": 0.70,
    "context_recall": 0.80
  },
  "max_regression_from_main": 0.05,
  "required_segments": ["billing", "security", "setup"]
}

Cloud and self-hosted paths

For most teams, the cloud API is the fastest way to add CI gates. For regulated environments, point the same request shape at a self-hosted eval.ninja deployment inside your network. The CI pattern does not change; only the base URL and credentials do.

The CLI is coming soon. Today, the reliable integration path is the REST API.

Common mistakes

Running the full benchmark on every small change

Large suites are useful, but they slow feedback. Use a smoke suite for pull requests and the full dataset for scheduled runs or merges to main.

Only checking the average score

A high average can hide a severe faithfulness regression. Gate on individual metrics and required segments.

Letting nondeterminism fail builds randomly

Pin models where possible, use stable judge configuration, and compare against tolerances instead of expecting exact score equality.