- Résumé keywords failed to predict who produced
- Behavioral scores strengthened the full data picture
- Faster ramp-up translated into daily economic value
- Insurance experience alone would have blocked strong hires
A resume keyword can look impressive and still fail in the real world. In one Fortune 500 insurance carrier, data from 10,765 hired agents, 2022–2025, linked three separate systems: an applicant tracking system (ATS, which stores candidate profiles), a human resource information system (HRIS, which stores performance outcomes), and a behavioral assessment. The result was stark. Out of 8,181 unique skills pulled from ATS profiles, 3,597 were testable, and not a single keyword predicted production after correction for multiple tests. Thirty keywords were actually anti-predictive, and the median keyword was tied to 25% lower odds of production. Requiring insurance experience alone would have rejected 2,863 agents who later produced $17.7 million in annual premium credit. By contrast, the behavioral assessment reached AUC 0.647 on its own and 0.735 when combined with the other data. The paper also found that speed-to-production followed a measurable economic constant of $54 per day per agent, or $35 per day after controlling for source channel and tenure.
A hiring manager can leave with a gut feeling that never makes it into any system. That is the quiet problem behind this Fortune 500 insurance carrier’s hiring records: one platform held candidate profiles, another held later job outcomes, and a third held behavioral scores, but none of them told the whole story on its own. Once those threads were linked across 10,765 agents hired from 2022 to 2025, the old shortcuts started to look shaky. The sharpest surprise is simple: the résumé keywords people screen for most confidently did not predict production at all, even though the joined-up data could still recover useful signals about who would ramp quickly and who would not.
Why the résumé keywords fell flat
Out of 8,181 unique skills parsed from ATS profiles, only 3,597 were testable, and after the usual correction for checking many keywords at once, not a single keyword predicted production. Thirty keywords were actually anti-predictive, which means they pointed in the wrong direction rather than the right one. The median keyword was tied to 25% lower odds of production. Even a familiar filter such as insurance experience could mislead badly: requiring it alone would have rejected 2,863 agents who later produced $17.7 million in annual premium credit. That is what makes the finding so stark. The clues looked useful inside the résumé system, but once the hires met real performance data, the pattern broke.
How the missing pattern was recovered
The trick was not a new hiring score in isolation. It was the decision trace: a chain that linked screening inputs from the applicant tracking system, assessment signals from the behavioral test, and production outcomes from HR records. That fusion matters because each system answers a different question. The résumé file says what a candidate claimed or listed, the assessment captures stable behavioral tendencies, and the outcome system shows what happened after the hire. Only by connecting them could the hiring record stop behaving like three separate notebooks and start acting like one memory. In that combined view, the behavioral assessment reached AUC = 0.647 on its own and 0.735 when fused with ATS and behavioral scoring data.
when fused with ATS and behavioral scoring data
standalone behavioral assessment was 0.647- The ATS supplied 8,181 unique skills, but only 3,597 were testable.
- The behavioral assessment alone reached AUC 0.647, then improved to 0.735 in fusion.
- Speed-to-production carried an economic value of $54 per day per agent, or $35 after controls.
- High-scored agents captured $114/day from speed acceleration, versus $41/day for low-scored agents.
“Decision traces are structured evidence chains connecting screening inputs, assessment signals, and production outcomes.”
“These findings were invisible within any single system.”
What speed-to-production was really worth
Speed mattered in money, not just in feel. The joined records showed a measurable economic constant of $54 per day per agent before controls, or $35 per day after accounting for source channel and tenure. But the size of that gain changed with behavioral score, which is where the data became more interesting than a simple average. High-scored agents captured $114 per day from faster production, while low-scored agents captured $41 per day. So the same jump in ramp-up did not pay out evenly. The behavior signal helped explain who could turn speed into value, which is exactly the kind of pattern that stays hidden when hiring, performance, and assessment live in separate silos.
Why this changes hiring design
The practical lesson is not that one test should replace another. It is that a hiring system built around keywords alone can miss both good people and the logic behind past good decisions. Decision traces give enterprise hiring a way to keep institutional knowledge from evaporating when managers leave, because the evidence chain lives in data instead of in memory. That matters most where the cost of a bad filter is high and the outcome arrives later, after the person has already been hired. Here, the hidden cost was not abstract: it showed up as rejected agents, lost premium credit, and slower ramp-up that could finally be priced day by day.
What still needs to hold up next
The next test is whether the same trace-based pattern survives in another enterprise with different data systems, a different assessment tool, and a different kind of sales ramp. This deployment covered one Fortune 500 insurance carrier, so the real stress test is whether the chain still works when the process, not just the numbers, changes. If it does, decision traces could turn hiring from a memory problem into a record problem: less guesswork, fewer dead-end keywords, and a better shot at explaining why some people thrive after the hire while others do not. That is a narrow claim, but it is the one this study earns.

Comments