DSAR and Unstructured Data: Why Keyword Search Is No Longer Enough
Why Keyword Search Is No Longer Enough to Handle a DSAR on Unstructured Data
In a DSAR workflow, keyword search can be a useful starting point. But as soon as the data is scattered across emails, attachments, and free-form documents, it quickly becomes insufficient to produce a reliable, complete, and defensible response.
DSAR stands for Data Subject Access Request, meaning a request for access to personal data under the GDPR, generally linked to the right of access set out in Article 15.
Introduction
When an organization receives an access request under the GDPR, as part of the right of access provided for in Article 15, one of the first reflexes is often to run keyword searches across the systems in scope. The person’s name, email address, employee ID, and sometimes a few spelling variants: the approach feels simple, fast, and reasonable.
In some cases, it does help retrieve part of the relevant data. But as soon as a DSAR covers emails, office documents, attachments, internal notes, or other forms of unstructured content, this method quickly reaches its limits.
The issue is not only that it lacks precision. More importantly, it often creates an illusion of coverage. It feels like the organization has “searched,” while in reality a significant share of the relevant data may remain invisible, poorly qualified, or buried in documentary noise.
For legal, HR, and IT teams, the goal is therefore not to eliminate keyword search altogether, but to understand why it is no longer sufficient on its own in a modern DSAR process.
Why keyword search remains attractive
If this approach is still so common, that is no accident. It offers several immediate advantages:
- it is easy to explain,
- it already exists in most tools,
- it allows teams to launch an initial collection quickly,
- it creates a sense of control.
In highly structured environments, it may even work reasonably well for some simple cases. When a piece of data is stable, well-formed, and stored in an identifiable field, a targeted search can retrieve useful elements with limited effort.
The problem begins when the requester’s personal data is no longer contained in clean fields, but scattered across natural language, conversational exchanges, comments, reply histories, or heterogeneous attachments.
The first problem: unstructured data does not always speak the language of the query
In emails and free-form documents, personal data is not always mentioned in a uniform way.
A person may be referred to:
- by first name only,
- by an initial,
- by an internal nickname,
- by a job title,
- by a partial email address,
- or even through implicit context without being named directly.
In that kind of content, a search based on a few explicit keywords captures only part of the real scope. It retrieves what matches the query, but not necessarily what actually relates to the individual.
That is a fundamental difference: finding occurrences is not the same as identifying relevant data in context.
The second problem: too much noise, not enough signal
Keyword search does not only create omissions. It also creates a large number of false positives.
The broader or more ambiguous the search terms are, the more results the team has to review. The requester’s name may appear in:
- automatic signatures,
- distribution lists,
- reply chains with no real relevance,
- copied emails,
- documents where the person is only mentioned in passing.
The result is that teams spend time reviewing large volumes of low-value material, while some genuinely important content may remain hidden simply because it does not contain the expected textual markers.
In other words, keyword search often suffers from a double weakness:
- it misses important material,
- it overloads the review with secondary results.
The third problem: context disappears
In a DSAR, the point is not only to find a name. It is to understand what the document actually says about the data subject.
But keyword search does not understand context. It does not easily distinguish between:
- a simple administrative mention,
- substantive information about the person,
- an internal assessment,
- third-party data,
- an exchange whose meaning depends on several messages in a thread.
That limitation is critical in unstructured data. A single email may look harmless in isolation, while its real importance only becomes clear when the full thread is read or when several related documents are connected.
This is exactly where purely lexical approaches show their weakness: they find text, but they do not prioritize meaning correctly.
Why this becomes a real compliance risk
In DSAR handling, the limits of keyword search are not just an efficiency issue. They can become a quality issue and, in some cases, a compliance issue.
An organization generally needs to be able to show that it implemented a reasonable and consistent search method in light of the scope of the request made under Article 15. If the method is too rudimentary for a complex corpus, several risks appear:
- omission of relevant documents,
- incomplete review of certain sources,
- inconsistent results from one case to another,
- excessive burden on validation teams,
- difficulty explaining the methodological choices.
The danger is not just “missing a file.” The danger is building a process that looks industrialized while still relying on a search mechanism that is too simplistic for the data actually being handled.
What a more robust approach looks like
The answer is not to abandon keyword search entirely. It remains useful as an entry point. But it needs to be embedded in a broader method.
A more robust approach generally combines:
- structured collection of relevant sources,
- a richer search logic that goes beyond a few fixed terms,
- the ability to group documents by context,
- review calibrated to the level of risk,
- human oversight for ambiguous cases.
The goal is not to make search more “intelligent” in the abstract. The goal is to better align the search method with the documentary reality of the DSAR.
The role of document analysis technologies
As volumes grow, organizations naturally look beyond the standard internal search engine.
Document analysis technologies can help to:
- detect entities beyond exact expressions,
- connect identification variants,
- classify documents by type or sensitivity,
- surface contextual relationships,
- accelerate review of the heaviest corpora.
But here too, precision matters. These tools do not remove the need for clear governance. They can improve discovery, triage, and prioritization, without removing the need for human review on sensitive cases.
Their practical value lies above all in reducing dependence on a binary logic of “the term exists / the term does not exist,” which quickly becomes too limited for unstructured data.
What legal, HR, and IT teams should really aim for
For internal teams, the real objective is not to achieve a “perfect” search. It is to build a DSAR process that is solid enough to:
- cover the important sources,
- reduce avoidable omissions,
- limit review overload,
- explain the method used,
- preserve a defensible response.
That often means moving beyond a purely technical mindset and returning to a workflow mindset:
- where are we searching?
- how are we prioritizing?
- how are we qualifying the results?
- which cases require deeper review?
- how are we documenting the key judgments?
It is this combination of method, technology, and control that makes the process more credible.
Conclusion
Keyword search has not disappeared from DSAR handling. It remains useful for launching a collection, filtering a corpus, or retrieving certain explicit occurrences.
But on unstructured data, it is no longer enough. It misses relevant information, generates a great deal of noise, and often fails to restore the context needed for serious review.
For organizations that want to industrialize GDPR access responses without sacrificing quality, the issue is not to search for more words. The issue is to adopt a more contextual, more structured, and more defensible approach to document review.