While there is a level of hype around artificial intelligence,
its strategic use can provide us with smarter ways to identify key
information, and analyse and interpret vast amounts of data. In
this article, we use a recent case to outline methodology and
potential pitfalls of leveraging AI in investigations.
Whether it's the out-of-the-blue announcement of a
groundbreaking AI development that sends knock-on effects through
the sector, or a cautionary tale of lawyers referencing citations
made up by an AI chatbot, it feels like AI is rarely out of the
news these days.
Technological breakthroughs sometimes come with inflated
expectations that they'll be the solution to all problems (a
process encapsulated by the Gartner hype cycle). However, sometimes they
wind up as a solution in search of a problem instead.
That said, it would be foolish to write off any new, much-hyped
technologies as passing fads, so it's important that we take an
open-minded view and think about how we can best use them.
Consequently, on a recent case, we explored how we could use these
newtools to do our job better. In doing so, we gained some useful
experience regarding the perks and pitfalls of leveraging AI in
investigations.
What is the problem we are trying to solve with artificial intelligence?
The use of intelligent tools and technologies is, of course,
nothing new in the investigations world. We have been using tried
and tested tools effectively for a number of years, in particular
through review platforms, supplemented with tools that look at
documents more intelligently – context searching,
relationship mapping, and so on. The benefits they have brought
mean that we can conduct investigations more quickly, efficiently
and consistently.
However, we are operating in an increasingly data-heavy,
data-driven world, with the regulatory environment creating more
complexity and an exponentially increasing volume of data. Even
with our existing intelligent methods of review, investigations can
take too long and be too expensive for our clients using our
current methods. Aligned to its "buzz", our clients also
want to know how we are using AI to help them reduce costs.
We therefore need to find smarter ways to analyse and interpret the
data that we have and find ways to embed the use of AI –
either to get to key data more quickly or to make use of the more
visual, interactive nature of these tools – helping them
translate better to their users and their audience.
Case study: how can GenAI help investigators?
Last year, we worked on an investigation into a recently
acquired company and a potentially fraudulent EBITDA calculation.
Like many investigations, documentary evidence was key to
establishing the facts, but in this instance we had over two
million "relevant" documents to digest, in multiple
languages and formats. In addition, we had more than 2,000 pages of
expert reports to consider.
As described above, we have a suite of robust, reliable tools that
we use for situations of this kind, and we deployed those tools on
this matter too. However, we noted fairly quickly that the sheer
volume of documents – and the complexity of the matter
– made answering relatively simple questions (e.g.,
"What role did person X have?") difficult to get our arms
around. Given the rise of large language models (LLMs) such as
ChatGPT, many of our team wished we had an equivalent tool that
could be used to ask questions of our pile of documents in a
similar fashion.
With this in mind, our technology consultants developed an in-house
implementation of a process called "Retrieval Augmented
Generation" (or RAG). RAG is a method that can be used to
enhance LLMs with more specific information, combining the
strengths of tools such as Relativity or Brainspace with the
capabilities of LLMs, and this seemed like the perfect use case for
it.
Fundamentally, the machinations of RAG involved the following
steps:
- Step 1 – Set-up:
A specific kind of database, known as a vector database, was built to store chunks of text from the documents. This database uses advanced mathematics to convert those chunks of text into strings of numbers, which are later used to identify how "similar" those chunks of text are in terms of content and meaning. - Step 2 – Retrieval:
A user enters a query, usually in the form of a question (e.g., "What role did person X have?") The vector database finds the chunks of text that are closest in meaning to the question (and likely to contain the answer) and returns them. In this particular scenario, we used the 20 most relevant chunks. - Step 3 – Generation:
The same query is then passed to the LLM, along with all the chunks of text that contain the answer. The LLM can then use this information (known as the "context") and answer the question, translating if required.
It's worth thinking of the above as how one might research
something in a library before the advent of computers. First, you
select the most relevant books (maybe using an indexing system such
as the Dewey Decimal system) and read them. Then, you answer the
question based on what you have learned. With RAG, computers are
doing both steps within fractions of seconds. However, the
"Retrieval" step is the most important part, as this is
what gives the LLM the information it would need to answer the
question.
To refine that step, our technology consultants, working with our
investigators, built filters that allowed the user to limit
searches to emails, instant messenger chats, or documents, or to a
certain period in time. We also included the ability to jump
straight to the document in our search database, so that our
investigators could check back to the source document. This
significantly increased how well the system performed and how
useful it was for the team.
In this scenario, the use of RAG was a significant technological
enabler for the team. It proved invaluable for obtaining simple,
but important, facts as well as for exploring high-level hypotheses
to see if the documents had any pertinent information ("I
wonder if..."-type queries), without having to worry about
language barriers. More importantly, as the system always cited the
source documents when providing answers, the investigators were
able to fact-check the information they were receiving. When
preparing an expert report, it is critical to be able to cite
original documents, as ChatGPT and their like are notoriously
unreliable witnesses.
Potential challenges and limitations
As we described at the outset of this article, there is a danger in treating a new technology as a universal panacea, so it's important to keep in mind what these technologies don't do well, as much as what they do, so that we're able to deploy them appropriately and mitigate their shortcomings.
Hallucinations
Hallucinations are a known issue with LLMs, whereby they present incorrect or nonsensical information as factual (with some headline-makers including Google's AI Overview recommending that users put glue on pizza, or eat rocks). Consequently, our investigators took a "trust, but verify" approach. One of the advantages of RAG systems is that they cite the sources used, so our investigators were able to check back to the original document to verify the facts.
Proving negatives
One rule we ensured that the users of the tool knew was
"just because you didn't find it, it doesn't mean
it's definitely not there". As we described above, the
"Retrieval" step is incredibly important. Our technology
consultants worked with our investigators to ensure it was as
accurate as possible, but it isn't infallible.
The LLM will answer the question based purely on what it has been
told (which is what the "Retrieval" step found); if that
earlier step didn't find the document, the LLM won't know.
If the "Retrieval" step didn't find a relevant
document (even if one existed), the LLM will respond with a
metaphorical shrug.
To mitigate this, we ensured our investigators kept this in mind
when asking questions. For example, if they wanted to know
"Did person X ever hold role Y?", and the LLM responded
"No", they shouldn't just take the result at
face-value but should corroborate (e.g., using more established
tools).
An analogy that our technology consultants used that helped this
message land was that the AI tool was like fishing. If you
don't catch the fish (or document), it doesn't mean
it's not there, and you might have more luck with a different
bait (or query).
"Garbage in, garbage out"
Closely connected to the above, is one of the basic rules of
computers – garbage in, garbage out. RAG tools are only as
good as the data they have available to them. If the documents that
were added to the tool don't contain the useful information
– or worse, contain information that is false – the
answers that the LLM gives will be wrong. This is especially
something to keep in mind with emails, which are created by humans,
who are both fallible and have been known to make things up.
Once again, the mitigation here is to see what source is being
cited when receiving the answer, checking that it says what the LLM
says it did, and then deciding if the source can be relied upon. In
addition, recent LLMs include "reasoning engines" that
break questions down into steps, as a way of "showing their
working", so it's possible to see what steps were followed
to arrive at an answer and identify areas where it might have
gotten slightly off track.
Repeatability
One known issue with commercial LLMs is that they tend not to
give the same answer twice when asked exactly the same question.
This is partially by design, to allow the models to be more
creative when it comes to the answers they give (and potentially
appear more "human"). However, this isn't ideal for
investigative purposes, when a degree of predictability is more
desirable.
There are ways to mitigate this. Most LLMs have a configuration
setting called "temperature", which is used to adjust the
amount of 'randomness' in how they configure their
responses. By lowering this to zero, the LLMs (in theory) should be
much more deterministic, although in practice there can be small
variances. In reality, though, it's preferred not to rely on
this at all, and always refer to the source document rather than
the LLM's interpretation. This way, you remove any element of
uncertainty.
What might the future hold?
There are few more dangerous things to do in an article discussing technology trends than to try to predict the future. There also isn't enough room to cover all upcoming developments in the AI space. So, rather than attempt to make any exact or holistic predictions, we will discuss a few ways that the use of LLMs may evolve as the technology itself develops.
Context windows of opportunity?
We discussed above how the typical use of RAG, as used in our
matter, relies on an initial search of documents (held in a
"vector database") to limit the millions of documents to
those that are most likely (20, in our case), and pass those to the
LLM. This is because the LLM has a concept called the "context
window", which represents the amount of short-term information
it can hold in its "brain" at once when answering a
question.
These are getting larger. OpenAI's GPT4, released in March
2023, can hold around 6-8,000 English words in its context window
(around 12-16 pages). GPT 4o, released just over a year later in
May 2024, can hold around 100-120,000 words (around 200-240 pages).
The larger the window, the greater the number of documents that can
be held at once, alongside a greater margin for error in the
"retrieval" step. With a large enough context window, and
a small enough matter, it might not be needed at all.
Specialist contextual knowledge
Relatedly, by increasing the size of the short term
"brain" of the LLMs, this opens up the possibility to
ensure that the LLMs have important specialist knowledge that they
might require to answer questions. LLMs typically have a limit to
what they "know" (based on what they were trained on),
usually limited by what documents they had access to, and when they
were trained. For example, ChatGPT's latest GPT-4o model has a
knowledge cut-off of June 2024, so it wouldn't be able to tell
you who won the Best Picture Oscar in 2025, or (more relevantly)
know about an update to financial reporting standards released this
year.
With a bigger "brain" to play with, the LLMs can also be
passed some of this relevant reference data when asked questions
about documents. This would potentially allow them to incorporate
contemporary or expert knowledge into their responses beyond that
which they were trained on. Although this process is technically
feasible now, increasing capabilities will make this more
practical, allowing more of this "specialist knowledge"
to be provided.
A model for every occasion
Put simply, LLMs at present are typically created by taking a
huge amount of data (e.g., the entire public internet) and using a
massive amount of computing resources to produce a trained model.
As they're not trained on the specific data that is part of a
matter (usually because it is not public), the LLM doesn't have
that information available.
There are currently ways to augment existing models to embed
additional information fully (beyond the use of RAG), using a
process called "fine-tuning". This involves taking an
additional set of data (e.g., data that contains the specialist
knowledge required) and using it to add some "supplemental
training" to the model. It is currently a time-consuming and
costly process, but as LLM technology evolves, this process will
become more viable.
In summary
History is filled with examples of when new technologies
generated much excitement, only for reality to fall short of the
inflated expectations.
It is therefore important to try to strike the right balance
between open-mindedness and objectivity. AI has, so far, shown
itself to be a powerful tool in investigations, but only in the
hands of people who are aware of its strengths and limitations, and
can form a series of questions most likely to retrieve useful
answers. After all, with great power comes great
responsibility.
By combining the application of this new technology with other
well-established techniques to reinforce our findings, we were able
to leverage it to quickly grasp the key points of our matter, while
avoiding the pitfalls that could have created problems for us
further down the line. We are looking forward to seeing where this
road continues to take us, even if the road might present a few
bumps along the way.
Originally published by Fraud Intelligence
The content of this article is intended to provide a general guide to the subject matter. Specialist advice should be sought about your specific circumstances.