After my last post about score analysis of ICLR, I am looking into the review itself now.
They evaled SQL code generation by LLM using nature language metric and not executation metric, and they tested it and found around 20% false positive rate. This is a major flaw how is it even getting oral?
[link] [comments]



