I'm just dropping a small write-up for the set-up that I'm using with llama.cpp to run on the discrete GPUs using clbast.
You can use Kobold but it meant for more role-playing stuff and I wasn't really interested in that. Funny thing is Kobold can be set up to use the discrete GPU if needed.
Pick the clblast version, which will help offload some computation over to the GPU. Unzip the download to a directory. I unzipped it to a folder called this: "D:\Apps\llama"
You'd need a llm now and that can be obtained from HuggingFace or where-ever you'd like it from. Just note that it should be in ggml format. If you have a doubt, just note that the models from HuggingFace would have "ggml" written somewhere in the filename. The ones I downloaded were "nous-hermes-llama2-13b.ggmlv3.q4_1.bin" and
"Wizard-Vicuna-7B-Uncensored.ggmlv3.q4_0.bin"
Move the models to the llama directory you made above. That makes life much easier.
You don't really need to navigate to the directory using Explorer. Just open Powershell where-ever and you can also do
cd D:\Apps\llama\
Here comes the fiddly part. You need to get the device ids for the GPU. An easy way to check this is to use "GPU caps viewer", go to the tab titled OpenCl and check the dropdown next to "No. of CL devices".
The discrete GPU is normally loaded as the second or after the integrated GPU. In my case the integrated GPU was gfx90c and discrete was gfx1031c.
In the powershell window, you need to set the relevant variables that tell llama.cpp what opencl platform and devices to use. If you're using AMD driver package, opencl is already installed, so you needn't uninstall or reinstall drivers and stuff.
replace Wizard with nous-hermes-llama2-13b.ggmlv3.q4_1.bin or whatever llm you'd like. I like to play with 7B, 13B with 4_0 or 5_0 quantized llms. You might need to trawl through the fora here to find parameters for temperature, etc that work for you.
Checking if these work, I've posted the content at pastebin since formatting these was a paaaain: https://pastebin.com/peSFyF6H
salient features @ gfx1031c (6800M discrete graphics):
llama_print_timings: load time = 60188.90 ms
llama_print_timings: sample time = 3.58 ms / 103 runs ( 0.03 ms per token, 28770.95 tokens per second)
llama_print_timings: prompt eval time = 7133.18 ms / 43 tokens ( 165.89 ms per token, 6.03 tokens per second)
llama_print_timings: eval time = 13003.63 ms / 102 runs ( 127.49 ms per token, 7.84 tokens per second)
llama_print_timings: total time = 622870.10 ms
salient features @ gfx90c (cezanne architecture integrated graphics):
llama_print_timings: load time = 26205.90 ms
llama_print_timings: sample time = 6.34 ms / 103 runs ( 0.06 ms per token, 16235.81 tokens per second)
llama_print_timings: prompt eval time = 29234.08 ms / 43 tokens ( 679.86 ms per token, 1.47 tokens per second)
llama_print_timings: eval time = 118847.32 ms / 102 runs ( 1165.17 ms per token, 0.86 tokens per second)
llama_print_timings: total time = 159929.10 ms
I'm just going to cheat here a bit and use chatGPT to summarize this, since I don't want to do the calculation wrong. Hope it makes sense. I'm just excited to share this!
########## Integrated GPU #########
Total inference time = Load time + Sample time + Prompt eval time + Eval time
Total inference time = 26205.90 ms + (6.34 ms/sample * 103 samples) + 29234.08 ms + 118847.32 ms
Total inference time = 26205.90 ms + 653.02 ms + 29234.08 ms + 118847.32 ms
Total inference time = 174940.32 ms
So, the total inference time is approximately 174940.32 ms.
########## Discrete GPU 6800M #########
Total inference time = Load time + Sample time + Prompt eval time + Eval time
Total inference time = 60188.90 ms + (3.58 ms/sample * 103 samples) + 7133.18 ms + 13003.63 ms
Total inference time = 60188.90 ms + 368.74 ms + 7133.18 ms + 13003.63 ms
Total inference time = 80594.45 ms
So, the total inference time is approximately 80594.45 ms.
#####################################
Taking the difference Discrete - Integrated : 94345.87 ms.
Which is close to about 53% faster or about 1.5 minutes faster. The integrated GPU takes close to 175 seconds and the discrete finishes in about 81 seconds.
I do think that adding more RAM at some point could definitely help in improving the loading times, since the laptop has currently about 16Gb RAM.