What benchmark would you build for “reply quality” in SDR generation? [D]

Reddit r/MachineLearning / 5/1/2026

💬 OpinionIdeas & Deep AnalysisTools & Practical Usage

Key Points

  • The author is trying to benchmark the “reply quality” of AI-generated outbound SDR emails and follow-ups, but finds that common metrics (reply rate, reply sentiment, accuracy, editing needed, and spam-likeness) each fail to capture what “good” truly means.
  • They note that optimizing for reply rate can produce clickbaity but low-quality messages, while optimizing for factual accuracy can yield technically correct emails that still fail to engage.
  • Current internal practicality focuses on time-to-approve/send after human review, but the author argues this is a proxy rather than a direct measure of message quality.
  • The post asks what benchmark should be built, including whether it should be a single metric or a composite score and whether evaluation should be offline (benchmarks) or based on live campaign results.
  • Overall, the discussion frames reply-quality evaluation as a core, metric-driven problem where the “right” objective function is crucial for effective optimization.

Working on evaluating some AI-generated outbound (SDR-style emails along with follow-ups), and I’m running into a weird problem. Everyone talks about better personalisation or higher reply rates, but when you actually try to benchmark quality it gets messy fast.

A few things we’ve looked at:

a)reply rate (obvious, but noisy with a delayed signal)

b)positive vs negative replies (hard to label cleanly at scale)

c)factual accuracy about the prospect/company

d)how much editing a human has to do before sending

e)whether the message sounds human enough to not trigger spam radar

The issue for me at least, none of these fully capture “this is a good outbound message”. You can optimise for reply rate and end up with clickbaity nonsense. You can optimise for accuracy and get something technically correct but completely dead. Right now the most practical metric internally is probably the time to approve/send after human review process, but that feels like a proxy, not the thing itself. If you had to build a proper benchmark here, what would you optimise for? This seems like one of those problems where everyone says the metric isn''t important, but it seems like the core element.

  • single metric or composite?
  • offline eval vs live campaign data?
submitted by /u/Critical_Builder_902
[link] [comments]