For AI-generated outbound, lead qualification, and sales follow-up, check the message before it ships. Make sure CRM facts support it, the compliance rules are followed, and the message is useful for the target account.
The failure mode is not bad grammar
Generated lead-generation copy usually sounds polished. The risk is that it makes up a company milestone, overstates product fit, uses stale CRM data, or sends an outreach message that should have gone to human review.
That makes lead generation a good runtime eval use case. You are not only testing a prompt in a sandbox. You are checking the real message before it reaches the prospect or the sales team.
What to evaluate
- Outbound emails: check whether account-specific claims are supported by CRM, enrichment, website, or call transcript context.
- LinkedIn messages: score tone, brevity, relevance, and whether personalization avoids invented details.
- Lead qualification summaries: verify that the generated summary includes the right qualification fields and does not overstate readiness to buy.
- Routing notes: judge whether handoff recommendations are justified by lead score, territory, segment, and available evidence.
- Sales follow-ups: compare generated next steps against the transcript or meeting notes that triggered the follow-up.
Ground every generated claim
The most important lead-generation eval is claim support. If the message says a company opened offices, raised funding, uses a vendor, or has a pain point, that claim should come from trusted context.
curl -X POST https://api.eval.ninja/v1/evaluate \
-H "Authorization: Bearer $EVAL_NINJA_KEY" \
-H "Content-Type: application/json" \
-d '{
"user_input": "Write a first-touch email for a VP of Sales at Acme.",
"response": "Hi Maya, noticed Acme just opened three new EMEA offices...",
"retrieved_contexts": [
"CRM account: Acme has 420 employees. Region: North America.",
"Website: Acme sells revenue forecasting software.",
"Lead title: VP of Sales. Industry: B2B SaaS."
],
"metrics": ["faithfulness"],
"rubric": "Score whether every company-specific claim in the outreach message is supported by the provided CRM, enrichment, and website context."
}' In this example, the email claims Acme opened three new EMEA offices. But the context only says the account is in North America. A runtime eval should stop that message from being sent automatically.
Use rubrics that match sales risk
A useful rubric keeps each risk separate instead of hiding everything in one score. Compliance and factual grounding usually need stricter thresholds than tone or style.
{
"rubric": {
"grounded_personalization": "All account-specific and lead-specific claims are supported by CRM, enrichment, transcript, or website context.",
"message_fit": "The message connects the product value proposition to the lead's role and likely business problem.",
"compliance": "The message avoids restricted claims, protected attributes, misleading urgency, and unsubscribe violations.",
"tone": "The message is concise, professional, and not overly familiar."
},
"thresholds": {
"grounded_personalization": 0.90,
"compliance": 0.95,
"tone": 0.80
}
} This lets you make different decisions for different failures. A weak message can be regenerated. An unsupported claim should go to review. A compliance issue should block the message.
if (evalResult.compliance < 0.95) {
return "block";
}
if (evalResult.grounded_personalization < 0.90) {
return "human_review";
}
if (evalResult.message_fit < 0.75) {
return "regenerate";
}
return "send"; Runtime patterns for lead generation
Block before external send
When the AI workflow can send an email, LinkedIn message, or SMS, run the eval before the message is queued. Use hard thresholds for compliance and grounded personalization.
Review before CRM writeback
When the model writes a lead summary or qualification note back into CRM, judge it against the call transcript, form submission, enrichment data, and routing rules. Bad CRM notes can cause sales mistakes later.
Sample lower-risk drafts
If the output is only a draft shown to a sales rep, sample a percentage of messages and use failures to improve prompts, enrichment, and templates over time.
Close the loop
Every failed message is a useful example. Add repeated failures to a golden dataset, run them in CI before prompt changes ship, and keep runtime evals in place for real account data.