ARTICLE
12 June 2025

AI As Your Co-detective

A
AlixPartners

Contributor

AlixPartners is a results-driven global consulting firm that specializes in helping businesses successfully address their most complex and critical challenges.
While there is a level of hype around artificial intelligence, its strategic use can provide us with smarter ways...
United Kingdom Technology

While there is a level of hype around artificial intelligence, its strategic use can provide us with smarter ways to identify key information, and analyse and interpret vast amounts of data. In this article, we use a recent case to outline methodology and potential pitfalls of leveraging AI in investigations.

Whether it's the out-of-the-blue announcement of a groundbreaking AI development that sends knock-on effects through the sector, or a cautionary tale of lawyers referencing citations made up by an AI chatbot, it feels like AI is rarely out of the news these days.

Technological breakthroughs sometimes come with inflated expectations that they'll be the solution to all problems (a process encapsulated by the Gartner hype cycle). However, sometimes they wind up as a solution in search of a problem instead.

That said, it would be foolish to write off any new, much-hyped technologies as passing fads, so it's important that we take an open-minded view and think about how we can best use them. Consequently, on a recent case, we explored how we could use these newtools to do our job better. In doing so, we gained some useful experience regarding the perks and pitfalls of leveraging AI in investigations.

What is the problem we are trying to solve with artificial intelligence?

The use of intelligent tools and technologies is, of course, nothing new in the investigations world. We have been using tried and tested tools effectively for a number of years, in particular through review platforms, supplemented with tools that look at documents more intelligently – context searching, relationship mapping, and so on. The benefits they have brought mean that we can conduct investigations more quickly, efficiently and consistently.

However, we are operating in an increasingly data-heavy, data-driven world, with the regulatory environment creating more complexity and an exponentially increasing volume of data. Even with our existing intelligent methods of review, investigations can take too long and be too expensive for our clients using our current methods. Aligned to its "buzz", our clients also want to know how we are using AI to help them reduce costs.

We therefore need to find smarter ways to analyse and interpret the data that we have and find ways to embed the use of AI – either to get to key data more quickly or to make use of the more visual, interactive nature of these tools – helping them translate better to their users and their audience.

Case study: how can GenAI help investigators?

Last year, we worked on an investigation into a recently acquired company and a potentially fraudulent EBITDA calculation. Like many investigations, documentary evidence was key to establishing the facts, but in this instance we had over two million "relevant" documents to digest, in multiple languages and formats. In addition, we had more than 2,000 pages of expert reports to consider.

As described above, we have a suite of robust, reliable tools that we use for situations of this kind, and we deployed those tools on this matter too. However, we noted fairly quickly that the sheer volume of documents – and the complexity of the matter – made answering relatively simple questions (e.g., "What role did person X have?") difficult to get our arms around. Given the rise of large language models (LLMs) such as ChatGPT, many of our team wished we had an equivalent tool that could be used to ask questions of our pile of documents in a similar fashion.

With this in mind, our technology consultants developed an in-house implementation of a process called "Retrieval Augmented Generation" (or RAG). RAG is a method that can be used to enhance LLMs with more specific information, combining the strengths of tools such as Relativity or Brainspace with the capabilities of LLMs, and this seemed like the perfect use case for it.

Fundamentally, the machinations of RAG involved the following steps:

  • Step 1 – Set-up:
    A specific kind of database, known as a vector database, was built to store chunks of text from the documents. This database uses advanced mathematics to convert those chunks of text into strings of numbers, which are later used to identify how "similar" those chunks of text are in terms of content and meaning.
  • Step 2 – Retrieval:
    A user enters a query, usually in the form of a question (e.g., "What role did person X have?") The vector database finds the chunks of text that are closest in meaning to the question (and likely to contain the answer) and returns them. In this particular scenario, we used the 20 most relevant chunks.
  • Step 3 – Generation:
    The same query is then passed to the LLM, along with all the chunks of text that contain the answer. The LLM can then use this information (known as the "context") and answer the question, translating if required.

It's worth thinking of the above as how one might research something in a library before the advent of computers. First, you select the most relevant books (maybe using an indexing system such as the Dewey Decimal system) and read them. Then, you answer the question based on what you have learned. With RAG, computers are doing both steps within fractions of seconds. However, the "Retrieval" step is the most important part, as this is what gives the LLM the information it would need to answer the question.

To refine that step, our technology consultants, working with our investigators, built filters that allowed the user to limit searches to emails, instant messenger chats, or documents, or to a certain period in time. We also included the ability to jump straight to the document in our search database, so that our investigators could check back to the source document. This significantly increased how well the system performed and how useful it was for the team.

In this scenario, the use of RAG was a significant technological enabler for the team. It proved invaluable for obtaining simple, but important, facts as well as for exploring high-level hypotheses to see if the documents had any pertinent information ("I wonder if..."-type queries), without having to worry about language barriers. More importantly, as the system always cited the source documents when providing answers, the investigators were able to fact-check the information they were receiving. When preparing an expert report, it is critical to be able to cite original documents, as ChatGPT and their like are notoriously unreliable witnesses.

Potential challenges and limitations

As we described at the outset of this article, there is a danger in treating a new technology as a universal panacea, so it's important to keep in mind what these technologies don't do well, as much as what they do, so that we're able to deploy them appropriately and mitigate their shortcomings.

Hallucinations

Hallucinations are a known issue with LLMs, whereby they present incorrect or nonsensical information as factual (with some headline-makers including Google's AI Overview recommending that users put glue on pizza, or eat rocks). Consequently, our investigators took a "trust, but verify" approach. One of the advantages of RAG systems is that they cite the sources used, so our investigators were able to check back to the original document to verify the facts.

Proving negatives

One rule we ensured that the users of the tool knew was "just because you didn't find it, it doesn't mean it's definitely not there". As we described above, the "Retrieval" step is incredibly important. Our technology consultants worked with our investigators to ensure it was as accurate as possible, but it isn't infallible.

The LLM will answer the question based purely on what it has been told (which is what the "Retrieval" step found); if that earlier step didn't find the document, the LLM won't know. If the "Retrieval" step didn't find a relevant document (even if one existed), the LLM will respond with a metaphorical shrug.

To mitigate this, we ensured our investigators kept this in mind when asking questions. For example, if they wanted to know "Did person X ever hold role Y?", and the LLM responded "No", they shouldn't just take the result at face-value but should corroborate (e.g., using more established tools).

An analogy that our technology consultants used that helped this message land was that the AI tool was like fishing. If you don't catch the fish (or document), it doesn't mean it's not there, and you might have more luck with a different bait (or query).

"Garbage in, garbage out"

Closely connected to the above, is one of the basic rules of computers – garbage in, garbage out. RAG tools are only as good as the data they have available to them. If the documents that were added to the tool don't contain the useful information – or worse, contain information that is false – the answers that the LLM gives will be wrong. This is especially something to keep in mind with emails, which are created by humans, who are both fallible and have been known to make things up.

Once again, the mitigation here is to see what source is being cited when receiving the answer, checking that it says what the LLM says it did, and then deciding if the source can be relied upon. In addition, recent LLMs include "reasoning engines" that break questions down into steps, as a way of "showing their working", so it's possible to see what steps were followed to arrive at an answer and identify areas where it might have gotten slightly off track.

Repeatability

One known issue with commercial LLMs is that they tend not to give the same answer twice when asked exactly the same question. This is partially by design, to allow the models to be more creative when it comes to the answers they give (and potentially appear more "human"). However, this isn't ideal for investigative purposes, when a degree of predictability is more desirable.

There are ways to mitigate this. Most LLMs have a configuration setting called "temperature", which is used to adjust the amount of 'randomness' in how they configure their responses. By lowering this to zero, the LLMs (in theory) should be much more deterministic, although in practice there can be small variances. In reality, though, it's preferred not to rely on this at all, and always refer to the source document rather than the LLM's interpretation. This way, you remove any element of uncertainty.

What might the future hold?

There are few more dangerous things to do in an article discussing technology trends than to try to predict the future. There also isn't enough room to cover all upcoming developments in the AI space. So, rather than attempt to make any exact or holistic predictions, we will discuss a few ways that the use of LLMs may evolve as the technology itself develops.

Context windows of opportunity?

We discussed above how the typical use of RAG, as used in our matter, relies on an initial search of documents (held in a "vector database") to limit the millions of documents to those that are most likely (20, in our case), and pass those to the LLM. This is because the LLM has a concept called the "context window", which represents the amount of short-term information it can hold in its "brain" at once when answering a question.

These are getting larger. OpenAI's GPT4, released in March 2023, can hold around 6-8,000 English words in its context window (around 12-16 pages). GPT 4o, released just over a year later in May 2024, can hold around 100-120,000 words (around 200-240 pages). The larger the window, the greater the number of documents that can be held at once, alongside a greater margin for error in the "retrieval" step. With a large enough context window, and a small enough matter, it might not be needed at all.

Specialist contextual knowledge

Relatedly, by increasing the size of the short term "brain" of the LLMs, this opens up the possibility to ensure that the LLMs have important specialist knowledge that they might require to answer questions. LLMs typically have a limit to what they "know" (based on what they were trained on), usually limited by what documents they had access to, and when they were trained. For example, ChatGPT's latest GPT-4o model has a knowledge cut-off of June 2024, so it wouldn't be able to tell you who won the Best Picture Oscar in 2025, or (more relevantly) know about an update to financial reporting standards released this year.

With a bigger "brain" to play with, the LLMs can also be passed some of this relevant reference data when asked questions about documents. This would potentially allow them to incorporate contemporary or expert knowledge into their responses beyond that which they were trained on. Although this process is technically feasible now, increasing capabilities will make this more practical, allowing more of this "specialist knowledge" to be provided.

A model for every occasion

Put simply, LLMs at present are typically created by taking a huge amount of data (e.g., the entire public internet) and using a massive amount of computing resources to produce a trained model. As they're not trained on the specific data that is part of a matter (usually because it is not public), the LLM doesn't have that information available.

There are currently ways to augment existing models to embed additional information fully (beyond the use of RAG), using a process called "fine-tuning". This involves taking an additional set of data (e.g., data that contains the specialist knowledge required) and using it to add some "supplemental training" to the model. It is currently a time-consuming and costly process, but as LLM technology evolves, this process will become more viable.

In summary

History is filled with examples of when new technologies generated much excitement, only for reality to fall short of the inflated expectations.

It is therefore important to try to strike the right balance between open-mindedness and objectivity. AI has, so far, shown itself to be a powerful tool in investigations, but only in the hands of people who are aware of its strengths and limitations, and can form a series of questions most likely to retrieve useful answers. After all, with great power comes great responsibility.

By combining the application of this new technology with other well-established techniques to reinforce our findings, we were able to leverage it to quickly grasp the key points of our matter, while avoiding the pitfalls that could have created problems for us further down the line. We are looking forward to seeing where this road continues to take us, even if the road might present a few bumps along the way.

Originally published by Fraud Intelligence

The content of this article is intended to provide a general guide to the subject matter. Specialist advice should be sought about your specific circumstances.

Mondaq uses cookies on this website. By using our website you agree to our use of cookies as set out in our Privacy Policy.

Learn More