
New Scorer Detects Unfaithful Chain-of-Thought by Comparing Internal and External Traces
If you rely on an AI’s step-by-step explanation, you may be reading a story the model invented after the fact. That is the problem this paper targets: chain-of-thought, or CoT, can look persuasive without matching the model’s real decision process. The authors introduce CIE-Scorer, a framework for instance-level CoT unfaithfulness detection. It traces compact sentence-level circuits from informative reasoning tokens, builds internal and external reasoning graphs, and compares them with Fused Gromov–Wasserstein distance. On four datasets from FaithCoT-Bench, CIE-Scorer achieves state-of-the-art performance while reducing the cost of circuit construction. The point is not just to catch bad explanations, but to combine mechanistic interpretability signals with the text the model produces, so faithfulness can be checked more directly and more efficiently.


















