koboldcpp-1.48
Harder Better Faster Stronger Edition
NEW FEATURE: Context Shifting (A.K.A. EvenSmarterContext) - This feature utilizes KV cache shifting to automatically remove old tokens from con...
"This feature utilizes KV cache shifting to automatically remove old tokens from context and add new ones without requiring any reprocessing."
This means a major speed increase for people like me who rely on (slow) CPU inference (or big models). Consider a chatbot scenario and a long chat where old lines of dialogue need to be evicted from the context to stay within the (4096 token) context size. Previously the context had to be re-computed starting with the first changed/now missing token. This feature detects that, deletes the affected tokens from the KV cache and shifts the subsequent tokens in the KV cache so it can be re-used. Avoiding a computationally expensive re-calculation.
This is probably also more or less related to recent advancements like Streaming-LLM
This won't help once text gets inserted "in the middle" or the prompt gets changed in another way. But I managed to connect KoboldCPP as a backend for SillyTavern/Oobabooga and now I'm able to have unlimited length conversations without waiting excessively, once the chat history hits max tokens and the frontend starts dropping text.
It's just a clever way to re-use the KV cache in one specific case. But I've wished for this for quite some time.
I wasn't able to get good use out if the old 'Smartcontext' anyways and seems other people had the same problem. To me, this is a huge improvement. And it doesn't even need extra memory or anything.
I really like how the KoboldCPP dev(s(?)) and the llama.cpp community constantly implement all the crazy stuff.