Everyone posts benchmarks and arena scores. I wanted to see if a local model could do something that makes actual money. So I took my Gemma 4 26B (IQ4_XS quant, running on a single 4090) and gave it a job: read 2,400 earnings call transcripts from the last 3 years and find language patterns that predict how the stock moves in the 5 days after.
Fine-tuned on about 800 labeled transcripts. The labels were simple: did the stock beat or miss its sector over the next week. Model's job wasn't price prediction. It was tagging sentences with forward-looking confidence scores and flagging specific language shifts, like when management switches between precise numbers and vague qualitative stuff.
Inference on all 2,400 took about 14 hours. Not fast but I only need to run this once a quarter so whatever.
Found two things.
Signal A: the real one. When CFOs shift from giving specific guidance numbers to vaguer language in the outlook section ("we feel good about our trajectory" instead of "we expect revenue between X and Y"), stock underperforms its sector by about 1.8% over 5 days. Tested on 600 out-of-sample transcripts. IC of 0.04. Tiny. But statistically significant and basically zero correlation with momentum, value, or any standard factor. That's the part that matters — it's not repackaging something that already exists.
Signal B: the ghost. Model also found what looked like a much stronger pattern. "Management confidence" in the prepared remarks section correlated with outperformance at IC 0.09. Got really excited for about two days. Then I regressed it against sector returns and the correlation was 0.85. Tech CEOs sound confident when tech is ripping. The model wasn't reading language patterns. It was picking up sector momentum through the backdoor of CEO tone.
Killed Signal B immediately. If I hadn't checked it against known factors I'd probably be trading it right now thinking I found some edge.
Takeaway — local models are actually great for this. Running everything locally meant I could throw proprietary transcripts at it without worrying about sending them through someone else's API. That matters a lot in finance. But you absolutely have to sanity check what the model finds against existing factors. It will find ghosts that look extremely convincing.
Next up I'm trying to focus the model specifically on the Q&A section of earnings calls, where management is off script and the language is less rehearsed. I think that's where the real signal lives but haven't proven it yet.
Anyone else using local models for financial text analysis? Curious what setups people are running and whether you've hit similar ghost signal problems.
[link] [comments]

