Contents
In brief
The author builds a home chatbot with RAG over household knowledge — pool maintenance, HVAC, appointments, and more. Before vector search, each question passes through a classifier on Qwen 3 0.6B (600M parameters). Prompting alone scored ~10% accuracy; after fine-tuning with Unsloth and switching outputs to two-letter codes, accuracy reached ~92%.
What happened
The vector index carries metadata categories (pool, hvac, cooking, etc.). A query like “When did we replace the pool pump?” should map to pool first, then search only matching entries — narrowing the space improves RAG quality.
Qwen 3 4B handles general answers; Qwen 3 0.6B runs locally via Ollama for classification. The experiment asks whether ~850 labeled question–category pairs plus QLoRA in Unsloth is enough for a 600M model.
The baseline uses a strict prompt with 19 allowed category names. On 131 integration tests, only 13 were correct (~10%). Failures included over-broad labels (electric), invented categories (apartments), and truncated outputs (ac instead of hvac).
First fine-tuning raised accuracy to ~79% but left semantic confusion among water-related topics (pool, water heater, fountain). The second pass trains the model to emit fixed two-letter codes (KK = hvac, OO = pool) with no overlapping meaning in the raw text. Accuracy hit ~92% (120/131). Remaining errors mostly pair water heater with pool or gutters with mosquito.
Why it matters
Pre-RAG classification is a cheap way to boost recall without bloating the embedding index. A tiny model costs less than routing every query through 4B+, and you can retrain when categories change.
The post shows that output format often beats “one more paragraph in the prompt”: non-overlapping fixed codes stabilize tiny LLM generation better than 19 free-form label names. The pattern applies to request routing, support triage, and log filtering anywhere you need a lightweight classifier.
The author also describes a feedback loop: user corrections can feed the next training round without rewriting the pipeline.
In practice
- Establish a baseline — run the raw model on a held-out test set before any fine-tuning.
- Dataset — ~850 examples with a 70/15/15 train/eval/test split; balance rare categories.
- Unsloth + QLoRA — default hyperparameters are often enough; label quality matters more than early tuning.
- Output format — try fixed-length opaque codes instead of class names if the model truncates or confuses similar words.
- Post-processing — map codes to categories in application code; normalize synonyms (
ac→hvac) if needed. - Alternative — a follow-up article uses logistic regression on embeddings; classical ML may be simpler for some tasks.
| Stage | Accuracy (131 tests) |
|---|---|
| Prompt only | ~10% |
| Fine-tune, category names | ~79% |
| Fine-tune, two-letter codes | ~92% |
Takeaway
This home experiment is a clear case for “small LLM as a narrow tool”: it does not replace a larger model for answers, but reliably routes queries by metadata. If you build local RAG, budget a lightweight classifier and experiment with label format before inflating prompts.