I tested the improved ideation-coach skill with 8 model variants and the phase-based structure improved scores by +4.5 points for Dense and +7.5 for MoE, making Q4 quantization usable for ideation sessions.

I ran the improved ideation-coach skill (v2) through 8 model variants to see if the skill redesign fixed the 18 failure modes from v1.

Why Test v2

The v1 ideation skill had 18 identified failure modes. The main problems were:

  • Vivid language sanitized during Q&A, not just at write time
  • Contradiction detection too passive
  • Round themes bleeding across rounds
  • “I don’t know” accepted as final
  • Stop detection too fragile

I redesigned the skill around a new structure: instead of rounds, it uses phases with explicit constraints. The v2 skill replaced “rounds” with “phases” such as Phase 1 gets the “ugly sentence,” Phase 2 builds an assumption stack, Phase 3 runs stress questions, Phase 4 captures verbatim quotes, Phase 5 wraps up.

The key difference: v2 tells the model what NOT to do in each phase, not just what to do.


Test Setup

Same hardware, same task, same evaluation criteria as v1:

ParameterValue
Context Length262144
GPU OffloadFull
CPU Thread Pool Size16
Evaluation Batch Size8192
ThinkingEnabled
Temperature0.6

Task: Ideation session for “Add audio notes using Cactus SDK” in Aily app.

Evaluated by MiniMax M2.5 Free in OpenCode Zen.


The v2 Skill Structure

The ideation-coach v2 skill lives in Skill/ideation-coach/ with these phases:

  1. Phase 0 Scan: read workspace context (README.md, docs/)
  2. Phase 1 Ugly Sentence: get complete “for X so they can Y instead of Z” frame
  3. Phase 2 Assumption Stack: surface 5-7 testable assumptions with evidence status (✓ confirmed, ⚠ untested)
  4. Phase 3 Stress Questions: run all 4 lenses: Reversal, Worst Customer, Second-order, Kill shot
  5. Phase 4 Verbatim Capture: preserve user’s exact phrases
  6. Phase 5 Wrap-up: write ideation doc, handoff to create-prd

Key v2 Rules Added

  • Turn rules: One question per turn, quote user first, mark assumptions when surfaced, stay short
  • “I don’t know” handling: Never accept as final. Respond with “What’s your best guess?” or “What would need to be true?”
  • Vivid language preservation: Use EXACT user words, never sanitize
  • Stress question enforcement: All 4 lenses must be explored before wrap-up
  • Minimum completion check: Verify Phase 1, 5+ assumptions, all 4 stress questions before writing doc

What Changed from v1

v1 (Ideation Skill)v2 (Ideation Coach)
Round-based (Problem → Users → Constraints)Phase-based with explicit constraints
“Name contradictions” (passive)Contradictions surfaced as questions
“Do not sanitise vivid language” (Step 5 only)“Use EXACT words” (enforced from Phase 1)
“I don’t know” handling missingMANDATORY pushback language
Round theme bleeding common“FORBIDDEN in this phase” per phase
No success metrics probingExplicit metrics questions added

Qwen 3.5 27B Results (Dense)

Full evaluation: Qwen 3.5 27B Quantization Comparison.

RankModelScoreSession TimeKey Strength
1Q8K_XL48/5070 minMost assumptions (14), best verbatim capture
2Q8044/5030 minThorough assumption stack (8), vivid language
3Q4K_XL39/5017 minGood assumption probing, clean sessions
4Q4K_M37/5020 minClean phase progression

Q8K_XL won again. It produced 14 assumptions, the most detailed stack of any model. It pushed back on “I haven’t tested” multiple times, caught the privacy/offline contradiction (user wanted on-device but mentioned cloud), and preserved " hold a leash and I can’t type fast with single hand."

Q80 was the efficiency winner. At 30 minutes, it produced 44/50, nearly as good as Q8K_XL but less than half the time. The verbatim capture was nearly as good: “walking with my dog,” “precious ideas” preserved.

Q4 variants improved. The v1 Q4K_M scored 29/50. The v2 Q4K_M scored 37/50, an 8-point improvement from better skill instructions alone. The gap between Q4 and Q8 narrowed.

v1 Q4K_M: 29/50 → v2 Q4K_M: 37/50. The skill improvements added 8 points without model changes.


Qwen 3.5 35B A3B Results (MoE)

Full evaluation: Qwen 3.5 35B A3B Quantization Comparison.

RankModelScoreSession TimeKey Strength
1Q8K_XL44/5021 minBest insight capture, clean assumption stack
2Q8043/5020 minBest real-time transcription probing
3Q4K_M38/5021 minGood assumption count (14), proper phases
4Q4K_XL22/5011 minSession ended before stress questions

Q8K_XL captured the best insight: “speaking is more natural than typing”, this frames audio notes as a primary preference, not just an accessibility feature. It also identified specific failure modes in stress questions (slow transcripts, shy users, low quality).

Q4K_XL still fails. At 22/50, it produced an incomplete session that ended before stress questions, and left contradictions section empty. This is the same failure mode as v1. The MoE architecture at Q4 quantization cannot handle structured phase transitions.

Q80 is viable on MoE. At 43/50, the Q80 MoE variant is close to Q8K_XL. Unlike v1, the v2 skill produces acceptable MoE results at Q80.

MoE at Q80 works now. v1 MoE Q8_0 scored 30/50. v2 MoE Q80 scores 43/50. That’s a 13-point improvement.


Dense vs MoE: v2 Results

Full comparison: Dense vs MoE Comparison.

Criterion27B Dense Avg35B MoE AvgDeltaWinner
Skill Flow Adherence4.504.00+0.50Dense
Ugly Sentence Quality4.754.25+0.50Dense
Assumption Stack4.504.00+0.50Dense
Stress Question Coverage4.754.00+0.75Dense
Turn Rule Compliance4.003.50+0.50Dense
Edge Case Handling4.253.50+0.75Dense
Verbatim Capture4.504.500Tie
Doc Completeness4.504.00+0.50Dense
Doc Accuracy4.504.25+0.25Dense
Session Pacing4.253.75+0.50Dense
Average Total42.0038.75+3.25Dense

Dense still wins, but the gap narrowed from v1 (+4.75 → +3.25). The v2 skill structure helps MoE models perform closer to their potential.

Key finding: MoE at Q80 is now usable. In v1, MoE Q8_0 scored 30/50. In v2, MoE Q80 scores 43/50. The skill improvements had a bigger impact on MoE than Dense.

Quantization Sensitivity Comparison

Quant Levelv1 Dense Avgv2 Dense Avgv1 MoE Avgv2 MoE Avg
Q8K_XL43483444
Q8041443043
Q4K_XL35393122
Q4K_M29372238

The v2 skill improved Dense by +4.5 points average, and MoE by +7.5 points average. The biggest gains: Q4K_M Dense (+8) and Q4K_M MoE (+16).

The v2 skill improvements helped lower quantizations more than higher ones. Q4K_M MoE improved by 16 points, from unusable (22) to adequate (38).


v1 vs v2: What the Skill Changes Fixed

The v2 skill addressed 6 of the 8 original failure modes:

v1 Failure Modev2 FixResult
Vivid language lost during Q&A“Use EXACT words” enforced from Phase 1Q4K_M improved 8 pts
No guard against hallucination“Every item must trace to session”Doc accuracy improved
Contradiction detection passiveContradictions surfaced as questionsBetter edge case handling
Round themes bleedPhase-based with “FORBIDDEN” rulesClean phase transitions
“I don’t know” acceptedMANDATORY pushback language+3-5 pts edge handling
Stop detection fragileMinimum completion checkPrevents incomplete sessions

What still fails at Q4: The “I don’t know” pushback still requires Q80 or higher to work reliably. Q4K_XL MoE still produces incomplete sessions.


What I Think About the Results

The v2 skill delivers what v1 promised: usable ideation sessions at lower quantizations. The Q4K_M improvement (29 → 37 for Dense, 22 → 38 for MoE) is the clearest win. It means I can run the skill on hardware-constrained devices without sacrificing quality.

MoE at Q80 is now viable. In v1, I wrote off MoE entirely. In v2, MoE Q80 scores 43/50, only 1 point behind Dense Q80. If I need the speed advantage (21 min vs 30 min for comparable quality), I can use MoE at Q80.

The insight quality from MoE Q8K_XL is worth the compute cost. “Speaking is more natural than typing” is a better capture than Dense’s “hold a leash”, it tells me the root cause, not just the use case. For early-stage ideation, that insight quality matters.

The 48/50 score from Dense Q8K_XL is the ceiling for this task. I’ve pushed the skill as far as it can go with the current structure. Further improvements need either:

  1. Better reference files (more examples, more templates)
  2. Multi-model orchestration (use Q8K_XL for insight, Q4K_XL for speed)
  3. Different evaluation criteria (maybe ideation doesn’t need perfect verbatim capture)

Key Points

  1. Dense Q8K_XL remains best: 48/50, most assumptions (14), best verbatim capture
  2. v2 skill improves all models: average +4.5 pts for Dense, +7.5 pts for MoE
  3. Q4K_M is now usable: v1 was 29/50, v2 is 37/50, an 8-point gain from skill alone
  4. MoE at Q80 works: v1 was 30/50, v2 is 43/50, a 13-point gain, now viable
  5. MoE still fails at Q4K_XL: 22/50, incomplete session, same failure as v1
  6. Quantization gap narrowed: v1 had 14-pt gap Q4 to Q8, v2 has 11-pt gap
  7. Best insight from MoE: “speaking is more natural than typing” beats Dense’s use case details
  8. Skill structure matters more than model size: the phase-based approach with explicit constraints outperforms round-based at equivalent model quality

References