Hi, Everyone. I repost this since my previous one was deleted(I don't know why, might be low quality of writing?)
I’ve been working on a lightweight way to reduce hallucinations in LLMs without relying on external judges, extra human labels, or heavy preference-learning pipelines.
The basic idea is simple: let a frozen base model generate a “bad” counterfactual answer, then train the adapted model to contrast the correct answer against that bad branch only from the first point where they diverge.
Instead of updating on every sample, the method self-selects cases where the bad continuation is still getting too much support from the model.
In practice, this means only about 10% of the training examples actually trigger updates, but the model still improves factuality over standard CE training and DPO-style baselines.
I also tested it under out-of-distribution settings, where the gains remained consistent rather than only fitting the training benchmark.
It showed good performance on ood datasets.
Compared to DPO, it showed about 6%p decrease.
Compared to sft, it showed about 1%p decrease.
Both result used only about 10% dataset while DPO, SFT used full dataset.
I think it means two things:
1) samplewise fitting helps model to generalize on dataset.
2) many dataset does not always mean it will show good performance.
github link : genji970/hallucination-mitigation-via-contrastive-sampling-method: Selective contrastive post-training for hallucination mitigation in LLMs — improves factuality with ~10% data.
[link] [comments]




