Australia: Deduplication demystified

Last Updated: 6 August 2019

Litigation can attract significant costs for law firms and 70–75 per cent of this cost is usually attributed to the review of discovery documents. Decreasing the volume of documents that need to be reviewed from outset efficiently can save significant time and costs.

Within any set of data, there will usually be duplicate documents. One of the easiest ways to quickly reduce volume is to remove the duplicated documents.

How Does Deduplication Work?

Deduplication (or deduping) is a common process often used in eDiscovery to reduce the amount of data to be searched or reviewed.

Each electronic file or email is assigned a unique identifier or "MD5" which is created in raw data format (bit by bit) and used for deduplication.

Theoretically, when there are multiple identical documents that have been allocated the same MD5 value in the data set, only the first loaded document will be available for search or review. The other identical documents processed after will be "deduped" out. But hold on, that's not quite true.

Why am I Seeing Duplicated Documents After Applying Deduplication?

As every discovery has different requirements, there are options and flexibility built into the deduplication process.

Global Deduplication vs. Custodian Deduplication

In global deduplication, we deduplicate the document against every document within the data set. In custodian deduplication, we deduplicate the document against every document within the data set which belongs to a unique 'custodian'. Each custodian then has a unique copy of the same document.

Parent Document vs. Child Document

In regard to emails, each email will have a parent document with attachments, also known as child documents. In order to maintain the full family relationship, the deduplication process would not compare the MD5 identifiers of the attachments or 'child' documents. When processing email attachments with an identical MD5 identifier, this file would not be "deduped" out if the parent email document has already been loaded with the same MD5 identifier.

Why Do Computers See Differences in Files that Look Alike?

In theory, all the parent documents with the same MD5 identifiers should be "de-duped" out in the process described above. However, the document will not be identified as a duplicate if it has identical content but the MD5 identifiers are different.

The most common reason this occurs are different document formats. Each file format (e.g. PDF, DOC, DOCX) may contain some metadata unique to its application and may create files with different MD5 identifiers.

Another example where look-alike files occur is an image file. The computer would treat a text searchable document (i.e. you can select the text then copy and paste) and a text non-searchable document (i.e. image within the document) as different although the two documents are in the same file format (e.g. PDF) and the contents are identical.


Near-Deduplication, also called textual-deduplication, is a method of grouping together "nearly identical" documents based on its content (i.e. Extracted Text from the document).

Before applying near-deduplication, all the text non-searchable documents (image and document without text) need to be sent for a process of "Imaging".

This process will provide the percentage of similarity of the document against the other nearly identical documents by comparing the extracted text (document content or email body). It is typically used to reduce review costs, and to ensure consistent coding during review.


As mentioned earlier, the MD5 is a unique identifier produced from the document in its raw data format and two documents are identical if their MD5 identifier values match. What if we know the two files are different but they are identical in nature?

Original emails vs. Archived emails

Most of email archiving systems will archive a complete version of an email (with attachments) and keep a trimmed version of an email (with the attachments removed) in the mailbox. This is typically done to reduce the storage size in the email system.

Custom-deduplication uses selected metadata from the email. In the above case, most of the metadata from the two emails should be identical. A custom MD5 identifier can be created from the metadata "Subject", "From", "To" and "Date Sent" to address this situation.


Nowadays, almost every eDiscovery matter requires a different level of deduplication to reduce the volume of documents and subsequently, the overall review cost. Deduplication also ensures coding consistency during review. However, deduplication can be a double-edged sword, it may have unintended consequences if not planned thoroughly. Consult your eDiscovery expert at Law In Order to understand which approach is suitable to your matter.


1 Reference:

To print this article, all you need is to be registered on

Click to Login as an existing user or Register so you can print this article.

Some comments from our readers…
“The articles are extremely timely and highly applicable”
“I often find critical information not available elsewhere”
“As in-house counsel, Mondaq’s service is of great value”

Related Topics
Related Articles
Up-coming Events Search
Font Size:
Mondaq on Twitter
Mondaq Free Registration
Gain access to Mondaq global archive of over 375,000 articles covering 200 countries with a personalised News Alert and automatic login on this device.
Mondaq News Alert (some suggested topics and region)
Select Topics
Registration (please scroll down to set your data preferences)

Mondaq Ltd requires you to register and provide information that personally identifies you, including your content preferences, for three primary purposes (full details of Mondaq’s use of your personal data can be found in our Privacy and Cookies Notice):

  • To allow you to personalize the Mondaq websites you are visiting to show content ("Content") relevant to your interests.
  • To enable features such as password reminder, news alerts, email a colleague, and linking from Mondaq (and its affiliate sites) to your website.
  • To produce demographic feedback for our content providers ("Contributors") who contribute Content for free for your use.

Mondaq hopes that our registered users will support us in maintaining our free to view business model by consenting to our use of your personal data as described below.

Mondaq has a "free to view" business model. Our services are paid for by Contributors in exchange for Mondaq providing them with access to information about who accesses their content. Once personal data is transferred to our Contributors they become a data controller of this personal data. They use it to measure the response that their articles are receiving, as a form of market research. They may also use it to provide Mondaq users with information about their products and services.

Details of each Contributor to which your personal data will be transferred is clearly stated within the Content that you access. For full details of how this Contributor will use your personal data, you should review the Contributor’s own Privacy Notice.

Please indicate your preference below:

Yes, I am happy to support Mondaq in maintaining its free to view business model by agreeing to allow Mondaq to share my personal data with Contributors whose Content I access
No, I do not want Mondaq to share my personal data with Contributors

Also please let us know whether you are happy to receive communications promoting products and services offered by Mondaq:

Yes, I am happy to received promotional communications from Mondaq
No, please do not send me promotional communications from Mondaq
Terms & Conditions (the Website) is owned and managed by Mondaq Ltd (Mondaq). Mondaq grants you a non-exclusive, revocable licence to access the Website and associated services, such as the Mondaq News Alerts (Services), subject to and in consideration of your compliance with the following terms and conditions of use (Terms). Your use of the Website and/or Services constitutes your agreement to the Terms. Mondaq may terminate your use of the Website and Services if you are in breach of these Terms or if Mondaq decides to terminate the licence granted hereunder for any reason whatsoever.

Use of

To Use you must be: eighteen (18) years old or over; legally capable of entering into binding contracts; and not in any way prohibited by the applicable law to enter into these Terms in the jurisdiction which you are currently located.

You may use the Website as an unregistered user, however, you are required to register as a user if you wish to read the full text of the Content or to receive the Services.

You may not modify, publish, transmit, transfer or sell, reproduce, create derivative works from, distribute, perform, link, display, or in any way exploit any of the Content, in whole or in part, except as expressly permitted in these Terms or with the prior written consent of Mondaq. You may not use electronic or other means to extract details or information from the Content. Nor shall you extract information about users or Contributors in order to offer them any services or products.

In your use of the Website and/or Services you shall: comply with all applicable laws, regulations, directives and legislations which apply to your Use of the Website and/or Services in whatever country you are physically located including without limitation any and all consumer law, export control laws and regulations; provide to us true, correct and accurate information and promptly inform us in the event that any information that you have provided to us changes or becomes inaccurate; notify Mondaq immediately of any circumstances where you have reason to believe that any Intellectual Property Rights or any other rights of any third party may have been infringed; co-operate with reasonable security or other checks or requests for information made by Mondaq from time to time; and at all times be fully liable for the breach of any of these Terms by a third party using your login details to access the Website and/or Services

however, you shall not: do anything likely to impair, interfere with or damage or cause harm or distress to any persons, or the network; do anything that will infringe any Intellectual Property Rights or other rights of Mondaq or any third party; or use the Website, Services and/or Content otherwise than in accordance with these Terms; use any trade marks or service marks of Mondaq or the Contributors, or do anything which may be seen to take unfair advantage of the reputation and goodwill of Mondaq or the Contributors, or the Website, Services and/or Content.

Mondaq reserves the right, in its sole discretion, to take any action that it deems necessary and appropriate in the event it considers that there is a breach or threatened breach of the Terms.

Mondaq’s Rights and Obligations

Unless otherwise expressly set out to the contrary, nothing in these Terms shall serve to transfer from Mondaq to you, any Intellectual Property Rights owned by and/or licensed to Mondaq and all rights, title and interest in and to such Intellectual Property Rights will remain exclusively with Mondaq and/or its licensors.

Mondaq shall use its reasonable endeavours to make the Website and Services available to you at all times, but we cannot guarantee an uninterrupted and fault free service.

Mondaq reserves the right to make changes to the services and/or the Website or part thereof, from time to time, and we may add, remove, modify and/or vary any elements of features and functionalities of the Website or the services.

Mondaq also reserves the right from time to time to monitor your Use of the Website and/or services.


The Content is general information only. It is not intended to constitute legal advice or seek to be the complete and comprehensive statement of the law, nor is it intended to address your specific requirements or provide advice on which reliance should be placed. Mondaq and/or its Contributors and other suppliers make no representations about the suitability of the information contained in the Content for any purpose. All Content provided "as is" without warranty of any kind. Mondaq and/or its Contributors and other suppliers hereby exclude and disclaim all representations, warranties or guarantees with regard to the Content, including all implied warranties and conditions of merchantability, fitness for a particular purpose, title and non-infringement. To the maximum extent permitted by law, Mondaq expressly excludes all representations, warranties, obligations, and liabilities arising out of or in connection with all Content. In no event shall Mondaq and/or its respective suppliers be liable for any special, indirect or consequential damages or any damages whatsoever resulting from loss of use, data or profits, whether in an action of contract, negligence or other tortious action, arising out of or in connection with the use of the Content or performance of Mondaq’s Services.


Mondaq may alter or amend these Terms by amending them on the Website. By continuing to Use the Services and/or the Website after such amendment, you will be deemed to have accepted any amendment to these Terms.

These Terms shall be governed by and construed in accordance with the laws of England and Wales and you irrevocably submit to the exclusive jurisdiction of the courts of England and Wales to settle any dispute which may arise out of or in connection with these Terms. If you live outside the United Kingdom, English law shall apply only to the extent that English law shall not deprive you of any legal protection accorded in accordance with the law of the place where you are habitually resident ("Local Law"). In the event English law deprives you of any legal protection which is accorded to you under Local Law, then these terms shall be governed by Local Law and any dispute or claim arising out of or in connection with these Terms shall be subject to the non-exclusive jurisdiction of the courts where you are habitually resident.

You may print and keep a copy of these Terms, which form the entire agreement between you and Mondaq and supersede any other communications or advertising in respect of the Service and/or the Website.

No delay in exercising or non-exercise by you and/or Mondaq of any of its rights under or in connection with these Terms shall operate as a waiver or release of each of your or Mondaq’s right. Rather, any such waiver or release must be specifically granted in writing signed by the party granting it.

If any part of these Terms is held unenforceable, that part shall be enforced to the maximum extent permissible so as to give effect to the intent of the parties, and the Terms shall continue in full force and effect.

Mondaq shall not incur any liability to you on account of any loss or damage resulting from any delay or failure to perform all or any part of these Terms if such delay or failure is caused, in whole or in part, by events, occurrences, or causes beyond the control of Mondaq. Such events, occurrences or causes will include, without limitation, acts of God, strikes, lockouts, server and network failure, riots, acts of war, earthquakes, fire and explosions.

By clicking Register you state you have read and agree to our Terms and Conditions