Why Amazon Scrapped Its AI Hiring Tool — And What It Reveals About Volume Recruiting
In 2018, Amazon quietly shut down an internal AI recruiting system that had been systematically disadvantaging women. The story is not just about bias — it is about what happens when hiring volume becomes unmanageable and organisations reach for automation before they have solved the underlying problem.
The Background
Between 2014 and 2017, Amazon's machine learning team built what they hoped would be a revolutionary hiring tool — a system that could review resumes and score candidates on a scale of one to five stars, much like the company's product review system. The idea was straightforward: train the AI on ten years of historical hiring data and let it identify patterns that predicted successful Amazon employees.
The scale of Amazon's hiring at the time made this ambition understandable. The company was processing tens of thousands of applications across hundreds of roles simultaneously. Their recruiting team was overwhelmed. The volume problem was real, urgent, and growing faster than their headcount could accommodate.
By 2015, the team realised something was wrong. The model was not neutral. It had learned to penalise resumes that included the word "women's" — as in "women's chess club" or "women's university." It downgraded graduates of all-women's colleges. It had, in effect, learned that Amazon historically hired mostly men for technical roles and concluded that being male was a predictor of success.
Why This Happened
The bias was not accidental in the sense of being a coding error. It was a direct product of what the system was trained on. Amazon's historical hiring data reflected a decade of decisions made by humans who had their own biases — conscious and unconscious. The AI did not introduce bias; it industrialised it and applied it at scale.
But the deeper issue is why the system existed in the first place. Amazon needed to automate because volume had made human review unsustainable. They were not trying to cut corners on quality — they were drowning in applications and needed a way to surface candidates faster. The automation was a symptom of a volume problem that had no good solution at the time.
When you receive 50,000 applications for 500 roles, you have two choices with traditional methods: hire enough reviewers to properly assess each one (expensive, slow, and still subject to human fatigue and bias), or automate the first filter (fast, scalable, but dangerously dependent on the quality of your training data and the assumptions baked into your model).
Amazon chose automation. The result was a system that made confident, fast, systematic errors — at scale.
The Attempts to Fix It
Amazon's engineers were not oblivious. Once the gender bias was identified, they attempted to correct the model — removing explicitly gendered terms from its consideration, retraining on adjusted data sets, adding manual override mechanisms. But as each fix was applied, new problems emerged. The system found proxy variables: certain university names, certain extracurricular activities, certain phrasing patterns that correlated with gender without explicitly mentioning it.
By 2017, internal confidence in the system had deteriorated to the point where recruiters were advised not to use its recommendations in isolation. By 2018, the project was quietly shelved. Amazon confirmed the tool's existence to Reuters, noting that it was never actually used to evaluate candidates — a claim that was met with some scepticism given the multi-year investment and the timeline of events.
What This Means for Every Hiring Organisation
The Amazon story is often discussed as a cautionary tale about AI bias. That framing is correct but incomplete. The more important lesson is about the conditions that made the flawed automation seem necessary in the first place.
When application volume becomes unmanageable, organisations make compromises. They automate keyword filtering, which eliminates candidates based on the presence or absence of specific terms regardless of equivalent experience. They rely on resume screening software that scores candidates on criteria that were never designed to predict job performance. They hire junior recruiters to triage volume at the cost of domain expertise. Or — as Amazon did — they build sophisticated AI systems that inherit all of the flaws of the historical decisions they were trained on.
None of these compromises are made by bad people trying to do bad things. They are made by good people trying to solve an impossible problem with inadequate tools.
The Problem With Resume Screening as a First Filter
At its core, Amazon's failed tool was trying to do something that resume screening — human or automated — is fundamentally ill-suited to do: assess whether a person can do a job based on a document that describes what they have done before.
Resumes are marketing documents. They are curated, structured, and optimised by people who know that a human or machine will make a judgement within seconds. The information they contain is self-reported, unverified, and presented in whatever format the candidate believes will be most persuasive to their audience. They tell you almost nothing about how a person thinks under pressure, how they communicate in complex situations, or whether their stated experience is deep or superficial.
The right first filter is not a document review — it is a conversation. Specifically, a structured conversation with someone who understands the role deeply enough to ask the questions that reveal genuine capability. The problem has always been that such conversations take time, and time is exactly what high-volume hiring does not have.
A Different Approach
The solution to Amazon's problem — and to the broader volume problem in hiring — is not better resume screening. It is replacing resume screening with something more revealing: a structured interview conducted by an intelligence that actually understands the role.
When a hiring manager's expertise and interview approach is embedded into an AI system, that system can do something Amazon's tool could never do: ask follow-up questions, probe inconsistencies, test whether a candidate's stated Salesforce experience is surface-level or substantive, and assess the quality of their thinking rather than the keywords on their resume.
This is the approach that SureScreen Recruit is built on. Not automated resume scoring — which is where Amazon went wrong — but automated interviewing, conducted by a clone of the hiring manager, available to every candidate at any time, without the bias that comes from historical hiring patterns or the fatigue that comes from reviewing hundreds of documents in sequence.
Amazon's story is a reminder that solving the volume problem badly is worse than not solving it at all. The goal is not to automate faster — it is to interview better.
Key Takeaways
- Volume overload drove Amazon to automate in a way that industrialised existing bias rather than eliminating it
- Resume screening — human or AI — is the wrong first filter because resumes are marketing documents, not capability assessments
- The fix is not better screening software — it is replacing screening with structured, expert-led interviews that can scale
- AI interviewing trained on role expertise, not historical hiring outcomes, avoids the bias trap entirely
SureScreen Recruit takes a different approach
Our AI hiring manager clone interviews every candidate based on role expertise — not historical hiring patterns. No resume scoring. No keyword filtering. Just structured conversations that reveal genuine capability.