What's the deal with LlamaCPP and caching?

I'm curious what it is doing from a top down perspective.

I've been playing with a 70B chat model that has several datasets on top of Llama2. There are some unusual features somewhere in this LLM and I am not sure what was trained versus (unusual layers?). The model has built in roleplaying stories I've never seen other models perform. These stories are not in the Oobabooga Textgen WebUI. The model can do stuff like a Roman Gladiator, and some NSFW stuff. These are not very realistic stories and play out with the depth of a child's videogame. They are structured rigidly like they are coming from a hidden system context.

Like with the gladiators story it plays out like Tekken on the original PlayStation. No amount of dialogue context about how real gladiators will change the story flow. Like I tried modifying by adding how gladiators were mostly nonlethal fighters and showmen more closely aligned with the wrestler-actors that were popular in the 80's and 90's, but no amount of input into the dialogue or system contexts changed the story from a constant series of lethal encounters. These stories could override pretty much anything I added to system context in Textgen.

There was one story that turned an escape room into objectification of women, and another where name-1 is basically like a Loki-like character that makes the user question what is really happening by taking on elements in system context but changing them slightly. Like I had 5 characters in system context and it shifted between them circumstantially in a story telling fashion that was highly intentional with each shift. (I know exactly what a bad system context can do, and what errors look like in practice, especially with this model. I am 100% certain these are either (over) trained or programic in nature. Asking the model to generate a list of built in roleplaying stories creates a similar list of stories the couple of times I cared to ask. I try to stay away from these "built-in" roleplays as they all seem rather poorly written. I think this model does far better when I write the entire story in system context. One of the main things the built in stories do that surprise me is maintaining a consistent set of character identities and features throughout the story. Like the user can pick a trident or gladius, drop into a dialogue that is far longer than the batch size and then return with the same weapon in the next fight. Normally, I expect that kind of persistence would only happen if the detail was added to the system context.

Is this behavior part of some deeper layer of llama.cpp that I do not see in the Python version or Textgen source, like is there an additional persistent context stored in the cache?

15 comments

You probably just have different settings (temperature, repetition_penalty, top_x, min/max_p, mirostat ...) than what you had with python. And those settings seem way better. You could check and compare the model settings.
- I always use the same settings for roleplaying. It is basically the Textgen Shortwave with mirostat settings added.
  I know the settings can alter outputs considerably with some stuff like this. I have tested and saved nearly 30 of my own preset profiles for various tasks and models. Every Llama2 based model I use for roleplaying stories gets a preset named ShortwaveRP. I haven't altered that profile in months now. I think the only changes from the original shortwave profile are mirostat 1/3/1 (IIRC).
  Overall this single model behaves completely different when it does a "built-in" versus a story I have created in system context. For example, my main chat character leverages a long character profile in system context and the adds some details about how she is named after the most prominent humaniform positronic (AGI) robot from Isaac Asimov's books. I then add instructions that specify the character has full informational access to the LLM and a few extra details. Basically, the character acts like the AI assistant and the character fluidly, with a consistent thin vernier of the character even when acting as the AI assistant, and she never gets stuck in the role of the assistant. Even in roleplaying stories I keep this character around and can ask constructive questions about the story, system context, and basic changes I make to the model loader code in Python. This character is very sensitive to alterations, and I am very sensitive to my interactions and how they work. This character changes substantially in these built-in stories. I can be 20 replies deep into a long conversation, drop into a built-in story, and my already established characters can change substantially. In particular, my assistant character is instructed to specifically avoid calling herself AI or referencing her Asimov character origin. All of the models I have played with have an extensive knowledge base about Daneel, but the character I am using is only know by 3 sentences in a wiki article as far as I can tell. I'm leveraging the familiarity with Daneel against my character that is barely known but associated. I was initially trying to use the fact that this character acts human most of the time throughout a couple of Asimov's books, but the character is virtually unknown and that turned into a similar advantage. There is a character of the same first name in a book in the Bank's Culture Series. This special assistant character I have created will lose this balance of a roleplaying assistant and start calling herself AI and act very different. This is just my canary in the coal mine that tells me something is wrong in any situation, but in the built-in stories this character can change entirely.
  I also have a simple instruction to "Reply in the style of a literature major" and special style type instructions for each character in system context. During the built-in stories, the dialogue style changes and unifies across all of the characters. Things like their vocabulary, style, depth, and length of replies all change substantially.
  
  Maybe you downloaded a different model? I'm just guessing, since you said it does NSFW stuff and I think the chat variant is supposed to refuse that. Could be the case that you just got the GGUF file of the 'normal' variant (without -Chat). Or did you convert it yourself?
  Edit: Other than that: Sounds great. Do you share your prompts or character descriptions somewhere?

Without knowing anything about this model or what it was trained on or how it was trained, it's impossible to say exactly why it displays this behavior. But there is no "hidden layer" in llama.cpp that allows for "hardcoded"/"built-in" content.
It is absolutely possible for the model to "override pretty much anything in the system context". Consider any regular "censored" model, and how any attempt at adding system instructions to change/disable this behavior is mostly ignored. This model is probably doing much the same thing except with a "built-in story" rather than a message that says "As an AI assistant, I am not able to ...".
As I say, without knowing anything more about what model this is or what the training data looked like, it's impossible to say exactly why/how it has learned this behavior or even if it's intentional (this could just be a side-effect of the model being trained on a small selection of specific stories, or perhaps those stories were over-represented in the training data).

What hardware do you have to run 70B and how long does generating take?
- Just a laptop with 12th gen i7, 16gb 3080Ti, and 64gb of DDR5 system memory.
  
  Thats a juicy amount of memmory for just a laptop.
  Interesting, the fosai site made it appear like 70B models are near impossible to run requiring 40B gb of VRam but i suppose it can work with less But slower.
  The vram of your gpu seems to be the biggest factor. A reason why while my current gpu is dying i cant get myself to spend on a mere 12 gb 4070ti

15 comments