12 - Ideation Coach Skill Experiment: Testing Local Models on Feature Ideation

I tested the improved ideation-coach skill with 8 model variants and the phase-based structure improved scores by +4.5 points for Dense and +7.5 for MoE, making Q4 quantization usable for ideation sessions.

I ran the improved ideation-coach skill (v2) through 8 model variants to see if the skill redesign fixed the 18 failure modes from v1.

Why Test v2

The v1 ideation skill had 18 identified failure modes. The main problems were:

Vivid language sanitized during Q&A, not just at write time
Contradiction detection too passive
Round themes bleeding across rounds
“I don’t know” accepted as final
Stop detection too fragile

I redesigned the skill around a new structure: instead of rounds, it uses phases with explicit constraints. The v2 skill replaced “rounds” with “phases” such as Phase 1 gets the “ugly sentence,” Phase 2 builds an assumption stack, Phase 3 runs stress questions, Phase 4 captures verbatim quotes, Phase 5 wraps up.

The key difference: v2 tells the model what NOT to do in each phase, not just what to do.

Test Setup

Same hardware, same task, same evaluation criteria as v1:

Parameter	Value
Context Length	262144
GPU Offload	Full
CPU Thread Pool Size	16
Evaluation Batch Size	8192
Thinking	Enabled
Temperature	0.6

Task: Ideation session for “Add audio notes using Cactus SDK” in Aily app.

Evaluated by MiniMax M2.5 Free in OpenCode Zen.

The v2 Skill Structure

The ideation-coach v2 skill lives in Skill/ideation-coach/ with these phases:

Phase 0 Scan: read workspace context (README.md, docs/)
Phase 1 Ugly Sentence: get complete “for X so they can Y instead of Z” frame
Phase 2 Assumption Stack: surface 5-7 testable assumptions with evidence status (✓ confirmed, ⚠ untested)
Phase 3 Stress Questions: run all 4 lenses: Reversal, Worst Customer, Second-order, Kill shot
Phase 4 Verbatim Capture: preserve user’s exact phrases
Phase 5 Wrap-up: write ideation doc, handoff to create-prd

Key v2 Rules Added

Turn rules: One question per turn, quote user first, mark assumptions when surfaced, stay short
“I don’t know” handling: Never accept as final. Respond with “What’s your best guess?” or “What would need to be true?”
Vivid language preservation: Use EXACT user words, never sanitize
Stress question enforcement: All 4 lenses must be explored before wrap-up
Minimum completion check: Verify Phase 1, 5+ assumptions, all 4 stress questions before writing doc

What Changed from v1

v1 (Ideation Skill)	v2 (Ideation Coach)
Round-based (Problem → Users → Constraints)	Phase-based with explicit constraints
“Name contradictions” (passive)	Contradictions surfaced as questions
“Do not sanitise vivid language” (Step 5 only)	“Use EXACT words” (enforced from Phase 1)
“I don’t know” handling missing	MANDATORY pushback language
Round theme bleeding common	“FORBIDDEN in this phase” per phase
No success metrics probing	Explicit metrics questions added

Qwen 3.5 27B Results (Dense)

Full evaluation: Qwen 3.5 27B Quantization Comparison.

Rank	Model	Score	Session Time	Key Strength
1	Q8K_XL	48/50	70 min	Most assumptions (14), best verbatim capture
2	Q80	44/50	30 min	Thorough assumption stack (8), vivid language
3	Q4K_XL	39/50	17 min	Good assumption probing, clean sessions
4	Q4K_M	37/50	20 min	Clean phase progression

Q8K_XL won again. It produced 14 assumptions, the most detailed stack of any model. It pushed back on “I haven’t tested” multiple times, caught the privacy/offline contradiction (user wanted on-device but mentioned cloud), and preserved " hold a leash and I can’t type fast with single hand."

Q80 was the efficiency winner. At 30 minutes, it produced 44/50, nearly as good as Q8K_XL but less than half the time. The verbatim capture was nearly as good: “walking with my dog,” “precious ideas” preserved.

Q4 variants improved. The v1 Q4K_M scored 29/50. The v2 Q4K_M scored 37/50, an 8-point improvement from better skill instructions alone. The gap between Q4 and Q8 narrowed.

v1 Q4K_M: 29/50 → v2 Q4K_M: 37/50. The skill improvements added 8 points without model changes.

Qwen 3.5 35B A3B Results (MoE)

Full evaluation: Qwen 3.5 35B A3B Quantization Comparison.

Rank	Model	Score	Session Time	Key Strength
1	Q8K_XL	44/50	21 min	Best insight capture, clean assumption stack
2	Q80	43/50	20 min	Best real-time transcription probing
3	Q4K_M	38/50	21 min	Good assumption count (14), proper phases
4	Q4K_XL	22/50	11 min	Session ended before stress questions

Q8K_XL captured the best insight: “speaking is more natural than typing”, this frames audio notes as a primary preference, not just an accessibility feature. It also identified specific failure modes in stress questions (slow transcripts, shy users, low quality).

Q4K_XL still fails. At 22/50, it produced an incomplete session that ended before stress questions, and left contradictions section empty. This is the same failure mode as v1. The MoE architecture at Q4 quantization cannot handle structured phase transitions.

Q80 is viable on MoE. At 43/50, the Q80 MoE variant is close to Q8K_XL. Unlike v1, the v2 skill produces acceptable MoE results at Q80.

MoE at Q80 works now. v1 MoE Q8_0 scored 30/50. v2 MoE Q80 scores 43/50. That’s a 13-point improvement.

Dense vs MoE: v2 Results

Full comparison: Dense vs MoE Comparison.

Criterion	27B Dense Avg	35B MoE Avg	Delta	Winner
Skill Flow Adherence	4.50	4.00	+0.50	Dense
Ugly Sentence Quality	4.75	4.25	+0.50	Dense
Assumption Stack	4.50	4.00	+0.50	Dense
Stress Question Coverage	4.75	4.00	+0.75	Dense
Turn Rule Compliance	4.00	3.50	+0.50	Dense
Edge Case Handling	4.25	3.50	+0.75	Dense
Verbatim Capture	4.50	4.50	0	Tie
Doc Completeness	4.50	4.00	+0.50	Dense
Doc Accuracy	4.50	4.25	+0.25	Dense
Session Pacing	4.25	3.75	+0.50	Dense
Average Total	42.00	38.75	+3.25	Dense

Dense still wins, but the gap narrowed from v1 (+4.75 → +3.25). The v2 skill structure helps MoE models perform closer to their potential.

Key finding: MoE at Q80 is now usable. In v1, MoE Q8_0 scored 30/50. In v2, MoE Q80 scores 43/50. The skill improvements had a bigger impact on MoE than Dense.

Quantization Sensitivity Comparison

Quant Level	v1 Dense Avg	v2 Dense Avg	v1 MoE Avg	v2 MoE Avg
Q8K_XL	43	48	34	44
Q80	41	44	30	43
Q4K_XL	35	39	31	22
Q4K_M	29	37	22	38

The v2 skill improved Dense by +4.5 points average, and MoE by +7.5 points average. The biggest gains: Q4K_M Dense (+8) and Q4K_M MoE (+16).

The v2 skill improvements helped lower quantizations more than higher ones. Q4K_M MoE improved by 16 points, from unusable (22) to adequate (38).

v1 vs v2: What the Skill Changes Fixed

The v2 skill addressed 6 of the 8 original failure modes:

v1 Failure Mode	v2 Fix	Result
Vivid language lost during Q&A	“Use EXACT words” enforced from Phase 1	Q4K_M improved 8 pts
No guard against hallucination	“Every item must trace to session”	Doc accuracy improved
Contradiction detection passive	Contradictions surfaced as questions	Better edge case handling
Round themes bleed	Phase-based with “FORBIDDEN” rules	Clean phase transitions
“I don’t know” accepted	MANDATORY pushback language	+3-5 pts edge handling
Stop detection fragile	Minimum completion check	Prevents incomplete sessions

What still fails at Q4: The “I don’t know” pushback still requires Q80 or higher to work reliably. Q4K_XL MoE still produces incomplete sessions.

What I Think About the Results

The v2 skill delivers what v1 promised: usable ideation sessions at lower quantizations. The Q4K_M improvement (29 → 37 for Dense, 22 → 38 for MoE) is the clearest win. It means I can run the skill on hardware-constrained devices without sacrificing quality.

MoE at Q80 is now viable. In v1, I wrote off MoE entirely. In v2, MoE Q80 scores 43/50, only 1 point behind Dense Q80. If I need the speed advantage (21 min vs 30 min for comparable quality), I can use MoE at Q80.

The insight quality from MoE Q8K_XL is worth the compute cost. “Speaking is more natural than typing” is a better capture than Dense’s “hold a leash”, it tells me the root cause, not just the use case. For early-stage ideation, that insight quality matters.

The 48/50 score from Dense Q8K_XL is the ceiling for this task. I’ve pushed the skill as far as it can go with the current structure. Further improvements need either:

Better reference files (more examples, more templates)
Multi-model orchestration (use Q8K_XL for insight, Q4K_XL for speed)
Different evaluation criteria (maybe ideation doesn’t need perfect verbatim capture)

Key Points

Dense Q8K_XL remains best: 48/50, most assumptions (14), best verbatim capture
v2 skill improves all models: average +4.5 pts for Dense, +7.5 pts for MoE
Q4K_M is now usable: v1 was 29/50, v2 is 37/50, an 8-point gain from skill alone
MoE at Q80 works: v1 was 30/50, v2 is 43/50, a 13-point gain, now viable
MoE still fails at Q4K_XL: 22/50, incomplete session, same failure as v1
Quantization gap narrowed: v1 had 14-pt gap Q4 to Q8, v2 has 11-pt gap
Best insight from MoE: “speaking is more natural than typing” beats Dense’s use case details
Skill structure matters more than model size: the phase-based approach with explicit constraints outperforms round-based at equivalent model quality

References

Ideation Coach Skill v2 — Full skill code
Qwen 3.5 27B Quantization Comparison
Qwen 3.5 35B A3B Quantization Comparison
Dense vs MoE Comparison

11 - V1 Ideation Skill Experiment: Testing Local Models on Feature Ideation

Docs

onyszko.pl

12 - Ideation Coach Skill Experiment: Testing Local Models on Feature Ideation

Why Test v2

Test Setup

The v2 Skill Structure

Key v2 Rules Added

What Changed from v1

Qwen 3.5 27B Results (Dense)

Qwen 3.5 35B A3B Results (MoE)

Dense vs MoE: v2 Results

Quantization Sensitivity Comparison

v1 vs v2: What the Skill Changes Fixed

What I Think About the Results

Key Points

References

12 - Ideation Coach Skill Experiment: Testing Local Models on Feature Ideation

Why Test v2#

Test Setup#

The v2 Skill Structure#

Key v2 Rules Added#

What Changed from v1#

Qwen 3.5 27B Results (Dense)#

Qwen 3.5 35B A3B Results (MoE)#

Dense vs MoE: v2 Results#

Quantization Sensitivity Comparison#

v1 vs v2: What the Skill Changes Fixed#

What I Think About the Results#

Key Points#

References#

Why Test v2

Test Setup

The v2 Skill Structure

Key v2 Rules Added

What Changed from v1

Qwen 3.5 27B Results (Dense)

Qwen 3.5 35B A3B Results (MoE)

Dense vs MoE: v2 Results

Quantization Sensitivity Comparison

v1 vs v2: What the Skill Changes Fixed

What I Think About the Results

Key Points

References