RAG local models

RAG local models

by Lars Gabriel Annell Rydenvald -
Number of replies: 1

I don't remember the name of the TA I discussed this with this morning, but I finally managed to make RAG work with ollama https://github.com/LGabAnnell/inlp-tp/blob/master/main.py

The difference in models is quite funny

- mistral:latest doesn't call the tool unless told so multiple times over, and instead invents document source ids (and as such answer as well) to fulfill system prompt

- qwen3:latest does OK but ignores system prompt formatting (gives me markdown and doesn't always cite sources, but the tool is called and the info is mostly pulled correctly)

- gpt-oss:latest sometimes works, but often just calls the tool multiple times until context window is filled and after that it gives up and says "Hello, how may I help you today"...

In reply to Lars Gabriel Annell Rydenvald

Re: RAG local models

by Martin Rajman -
Hi Lars, Yannick here, the TA of Wednesday.

Nice to hear that you tested with different models. I wounder if the quantization 4bit you seem to be using is partly responsible for the lack of instruction following.

In my experience, below the 24B parameters mark I couldn’t get a model to be useful for RAG purposes. Below this, the models are more useful as classifier or as assistants for PR relations on an X account for instance. Maybe one could use them on smartphones to automatically add task to a calendar or something of this kind but I am not even convinced of that.

I guess that small models are just toys with which we can play to experiment but I don’t see much value in them as products.
Indeed, when you can access the top-end models for the price they cost, I wouldn’t bother risking 100time more errors for a small reduction in cost.

If you want to see the difference with a bigger model, just go on the deepseek chat or any free experimental version of an LLM provider and give it the results of the search engine + the question and you will easily see the staggering difference. Furthermore, trying the same model in 4bit quantization and the raw 16bits could give you a taste of the performance loss.

Anyway, congrats on going the difficult route of local LLMs, that is impressive !