Is there a way to automate optimization of llamacpp arguments for fastest inference (prompt processing and token generation speed) ?
Maybe I just haven’t figured it out, but llama-bench seems cumbersome to use. I usually rely on llama-fit-params to help identify the best split of models across my GPUs and RAM, but llama-bench doesn’t have llama-fit-params. And while I can paste in the results of llama-fit-params into llama-bench, it’s a pain to have to adjust it for when I adjust context window size.
Wondering if anyone has found a more flexible way to go about all this
[link] [comments]