← All posts

Fine-tuning Qwen 3 0.6B: from 10% to 92% question classification accuracy

A home chatbot experiment: fine-tune a tiny local LLM with Unsloth and two-letter category codes instead of label names for metadata-aware RAG.

Contents

In brief

The author builds a home chatbot with RAG over household knowledge — pool maintenance, HVAC, appointments, and more. Before vector search, each question passes through a classifier on Qwen 3 0.6B (600M parameters). Prompting alone scored ~10% accuracy; after fine-tuning with Unsloth and switching outputs to two-letter codes, accuracy reached ~92%.

What happened

The vector index carries metadata categories (pool, hvac, cooking, etc.). A query like “When did we replace the pool pump?” should map to pool first, then search only matching entries — narrowing the space improves RAG quality.

Qwen 3 4B handles general answers; Qwen 3 0.6B runs locally via Ollama for classification. The experiment asks whether ~850 labeled question–category pairs plus QLoRA in Unsloth is enough for a 600M model.

The baseline uses a strict prompt with 19 allowed category names. On 131 integration tests, only 13 were correct (~10%). Failures included over-broad labels (electric), invented categories (apartments), and truncated outputs (ac instead of hvac).

First fine-tuning raised accuracy to ~79% but left semantic confusion among water-related topics (pool, water heater, fountain). The second pass trains the model to emit fixed two-letter codes (KK = hvac, OO = pool) with no overlapping meaning in the raw text. Accuracy hit ~92% (120/131). Remaining errors mostly pair water heater with pool or gutters with mosquito.

Why it matters

Pre-RAG classification is a cheap way to boost recall without bloating the embedding index. A tiny model costs less than routing every query through 4B+, and you can retrain when categories change.

The post shows that output format often beats “one more paragraph in the prompt”: non-overlapping fixed codes stabilize tiny LLM generation better than 19 free-form label names. The pattern applies to request routing, support triage, and log filtering anywhere you need a lightweight classifier.

The author also describes a feedback loop: user corrections can feed the next training round without rewriting the pipeline.

In practice

  1. Establish a baseline — run the raw model on a held-out test set before any fine-tuning.
  2. Dataset — ~850 examples with a 70/15/15 train/eval/test split; balance rare categories.
  3. Unsloth + QLoRA — default hyperparameters are often enough; label quality matters more than early tuning.
  4. Output format — try fixed-length opaque codes instead of class names if the model truncates or confuses similar words.
  5. Post-processing — map codes to categories in application code; normalize synonyms (achvac) if needed.
  6. Alternative — a follow-up article uses logistic regression on embeddings; classical ML may be simpler for some tasks.
Stage Accuracy (131 tests)
Prompt only ~10%
Fine-tune, category names ~79%
Fine-tune, two-letter codes ~92%

Takeaway

This home experiment is a clear case for “small LLM as a narrow tool”: it does not replace a larger model for answers, but reliably routes queries by metadata. If you build local RAG, budget a lightweight classifier and experiment with label format before inflating prompts.