- Expert solo writing skips obvious steps
- Dual dialogue pulls out hidden reasoning
- Blind judges favored reasoning naturalness
- One hour can yield 10–20 usable samples
If you need better training data for AI, this paper argues that the best source may be a conversation, not a lone expert. The problem is that experts often skip steps they think are obvious, leaving holes in chain-of-thought (step-by-step reasoning) data. The BC Protocol pairs a domain expert with a knowledge engineer, using structured dialogue to pull out the hidden judgments behind the answer. In a controlled narrative-fiction test with 20 samples from the dual-dialogue group and 20 from the same expert writing alone, three judge models—GPT-4o, Claude Opus 4.5, and Gemini 2.5 Pro—gave the dialogue outputs a much higher score for naturalness of reasoning: 4.80 versus 1.30, with p = 2.4 × 10−8 and Cliff’s δ = 1.0. The dialogue method also showed promising but not yet significant gains in counterfactual density and reasoning-chain completeness, while solo writing had higher information density. The authors say one hour of voice dialogue can produce 10–20 usable CoT samples, and they frame this as a practical pipeline for vertical-domain LLM alignment.
A one-hour voice session can yield 10 to 20 usable reasoning samples. That is not a tiny side note; it changes where the work happens. Instead of asking one expert to sit alone and write every step, BC Protocol puts that expert in a guided dialogue with a knowledge engineer who keeps asking why, what next, and what if. The surprise is that the second person does not add noise. It helps pull out the steps the expert would normally skip, because those steps feel too obvious to say aloud. For anyone training a language model, that is the difference between a neat answer and a usable trail of thought.
Why a lone expert leaves gaps
The sharpest result comes from a direct face-off: 20 samples from dual dialogue and 20 from the same domain expert writing alone, all inside narrative fiction. Three judge models — GPT-4o, Claude Opus 4.5, and Gemini 2.5 Pro — scored the outputs blind across five dimensions, for 600 ratings in all. BC Protocol won most clearly on naturalness of reasoning process, where it scored 4.80 against 1.30 for solo writing, with p = 2.4 × 10−8 and Cliff's δ = 1.0. The dialogue samples also trended higher on counterfactual density and reasoning-chain completeness, though those gains did not reach the paper's significance cutoff at this sample size. Solo writing did win on information density, which fits the contrast: it compresses conclusions, while dialogue preserves the live path that got there.
How BC Protocol pulls thought into words
BC Protocol works by pairing two kinds of strength. The domain expert brings crystallized intelligence, the kind built from hard-earned domain know-how. The knowledge engineer brings fluid intelligence, the skill of asking fresh questions and following weak spots in the answer. Together, they turn hidden judgments into plain-language reasoning chains. The Participant Aptitude Model adds another layer, with six participant characteristics that shape whether elicitation works well. Calibrated ignorance keeps the questioner curious but not clueless, so the dialogue can probe gaps without derailing the expert. Selection over prescription follows from that: in this kind of task, who sits at the table may matter more than how many rules the table has.
naturalness of reasoning process
1.30 for solo expert writing- Crowdsourced annotation lacks deep reasoning paths.
- Expert solo writing skips steps it treats as obvious.
- RLHF gives preference signals, not reasoning chains.
“One hour of voice dialogue can produce 10–20 CoT samples directly usable for post-training.”
“BC dialogue outputs are live reasoning that preserves trial-and-error nodes.”
The six dimensions in the Participant Aptitude Model point to a practical truth: not every expert and not every interviewer will unlock the same depth of reasoning. That is why BC Protocol favors selection over prescription. If implicit knowledge sits at the center of the task, spending more on process design may return less than choosing the right people first. The dialogue format also supports counterfactual probing, where the knowledge engineer keeps testing what would happen if a detail changed. In this setup, that is the highest-intensity elicitation move, and it helps surface branches the expert might never volunteer in a solo draft.
Why this matters for training data
High-quality expert chain-of-thought is one of the bottlenecks in post-training, and BC Protocol turns that bottleneck into a live conversation. Instead of hoping a lone expert will remember every small step, the method externalizes those steps while the thinking still happens. That matters because the output is not just more text; it is text shaped like the path a model should learn. The paper says one hour of voice dialogue can produce 10 to 20 usable samples, which makes the process feel less like a rare craft session and more like a repeatable pipeline. For vertical-domain alignment, that is the real prize: reasoning data that arrives at collection time, not after the fact.
The next test is scale, not style
A one-hour dialogue that yields 10 to 20 usable chain-of-thought samples is the paper's most practical promise. It means the scarce resource is no longer a blank page, but the right pairing and the right questions. The current controlled test sits in narrative fiction, so the clean next step is to see whether the same advantage survives in other vertical-domain settings. If it does, BC Protocol would make expert reasoning easier to collect in real time, which is more useful than asking experts to reconstruct their own thought after the fact.

Comments