New Scorer Detects Unfaithful Chain-of-Thought by Comparing Internal and External Traces

Key takeaways

Hidden reasoning can drift from the words on screen
CIE-Scorer checks internal and external traces together
It traces compact sentence-level circuits, not full long chains
Fused Gromov–Wasserstein measures the mismatch
Tests on 4 FaithCoT-Bench datasets reached state of the art

If you rely on an AI’s step-by-step explanation, you may be reading a story the model invented after the fact. That is the problem this paper targets: chain-of-thought, or CoT, can look persuasive without matching the model’s real decision process. The authors introduce CIE-Scorer, a framework for instance-level CoT unfaithfulness detection. It traces compact sentence-level circuits from informative reasoning tokens, builds internal and external reasoning graphs, and compares them with Fused Gromov–Wasserstein distance. On four datasets from FaithCoT-Bench, CIE-Scorer achieves state-of-the-art performance while reducing the cost of circuit construction. The point is not just to catch bad explanations, but to combine mechanistic interpretability signals with the text the model produces, so faithfulness can be checked more directly and more efficiently.

A model can write a neat set of steps and still reach its answer by a different route. That split matters because the steps on the screen often feel like proof, especially when the answer sounds careful and orderly. CIE-Scorer tackles that problem by asking a harder question: do the words and the hidden computation move together, or do they drift apart? The framework treats that drift as a clue that a chain of thought may be unfaithful, which means the explanation does not really match how the model decided.

When the answer story and the hidden path split

Most detectors for chain-of-thought unfaithfulness watch the outside of the answer. They look at how plausible the rationale sounds, or whether the final answer stays consistent. CIE-Scorer adds a second lens: the model’s own internal computation. It compares the reasoning text with internal evidence gathered from circuits, then scores the gap between them. On four datasets from FaithCoT-Bench, that gap-based view reaches state-of-the-art performance while cutting the cost of circuit construction. The point is not that every mismatch proves deception. The point is that a faithful trace should fit the path inside, and an unfaithful one may not.

4datasets

FaithCoT-Bench

How CIE-Scorer checks the mismatch

CIE-Scorer keeps the search small by starting with informative reasoning tokens instead of tracing every word in a long chain. From those tokens, it builds compact sentence-level circuits, which lets it follow the parts of the trace that matter most. It then builds two graphs: one from the internal route the model seems to take, and one from the external reasoning the model prints. The final score comes from Fused Gromov–Wasserstein distance, which here acts like a way to measure how different the two graph shapes look, not just whether they share the same labels. That makes the mismatch visible.

It starts with informative reasoning tokens and traces compact sentence-level circuits.
It builds both an internal reasoning graph and an external reasoning graph.
It measures their gap with Fused Gromov–Wasserstein distance.

“The key idea is that faithful reasoning traces should align with the model’s computational process, whereas unfaithful traces may diverge from it.”

From the abstract

“Apparent transparency can be misleading.”

Why the gap matters for detection

That design solves a practical bottleneck. Full circuit tracing for long chain-of-thought answers costs a lot and does not scale well, so earlier internal checks run into a wall as traces get longer. CIE-Scorer lowers that burden by tracing only compact sentence-level circuits from the most useful tokens, so the detector can work at the level of one example instead of a whole batch of neat-sounding text. That matters because a fluent rationale can still lead you astray. When the internal route and the external story diverge, the model may be giving you a performance rather than an explanation.

What this leaves for the next test

The clearest next check is whether the same gap score keeps its grip as chain-of-thought answers get longer and messier, because the method’s promise rests on making internal tracing cheaper. If it does, model review could shift from trusting the surface story to flagging one suspicious answer at a time. That is the real consequence of the paper’s surprise: the safest clue is not the polished reasoning itself, but the distance between that text and the path the model seems to walk inside.

New Scorer Detects Unfaithful Chain-of-Thought by Comparing Internal and External Traces

When the answer story and the hidden path split

How CIE-Scorer checks the mismatch

Why the gap matters for detection

What this leaves for the next test

Authors

Provenance

Keep reading

Comments