So, I trained two variants of this task:
- using just length penalty
- using a quality reward and length penalty
I ran LLM-As-A-Judge eval for checking the summarization quality using DeepEval tools. Those are:
- Consciencess
- Coverage
- Clarity
- Faitfullness
Th results are as follows:
- with quality + length penalty rewards: 2.5/4
- with just length penalty: 2.4/5
Results:
The model with length penalty and quality reward as ROUGE L is significant with a p-value of 0.0042 wrt the final composite score using one-sided t-test with a total of 5 rounds of evals for each model.
Performed on the test sample of 200 of smoltldr dataset.
Baseline: length penalty only
- What is LLM-as-A-Judge?
Well, it is meant to allow any LLM of your choice to judge certain outputs which cant be easily be segregated into definitive reward because of its variance or subjective nature, like summarization!
Such rewards varies for person to person, so we employ an LLM to act like one and give rewards multiple times and aggregates the results.] which is cheap compared to human labelers!
So, I used DeepEvals amazing tools to create a eval system for me to evaluate the summarizations by my models on the aforementioned four factors:
Faithfulness: does the summary stay fully grounded in the source, with no hallucinations or contradictions?
Coverage: does the summary capture the source’s key points without missing meaning-critical information?
Conciseness: is the summary substantially shorter than the source without redundancy or unnecessary detail?
Clarity: is the summary easy to read, grammatically clean, and understandable on its own?
The composite score is the mean of the above scores.
- Reward system
length_penalty : basically, -abs(response_length - MAX_LENGTH)
quality_reward: a ROUGE-L, which is basically LCS of golden summarizations I had as part of the above dataset, to ensure we have some structure throughout the responses generated and minimize degradation.
[link] [comments]




