Web Scraping In Quebec: Lessons Learned From The OpenAI Investigation & Practical Guidance

Article Insights

McMillan LLP are most popular:

with Senior Company Executives, HR and Finance and Tax Executives
with readers working within the Aerospace & Defence industries

In May 2026, the Office of the Privacy Commissioner of Canada (the “OPC“) and its provincial counterparts, together with the Quebec privacy regulator (the Commission d’accès à l’information du Québec or the “CAI“), published their findings following a joint investigation into certain data collection and processing practices of OpenAI OpCo, LLC (“OpenAI“). The investigation addressed a broad range of privacy issues arising from the development and operation of large-scale artificial intelligence (“AI”) systems.

Whilst the OpenAI findings are wide-ranging, this bulletin will focus specifically on the CAI’s assessment of the online scraping of personal information and the privacy considerations under Quebec privacy laws. In addition, we will draw on its assessment to offer organizations practical guidance to better align their data-scraping activities with the CAI’s expectations.

1. Web Scraping in Today’s Digital Economy

The Internet has, for decades, functioned as the world’s largest publicly accessible repository of information. Every day, billions of pages of text, images, user profiles, forum posts, news articles, and social media updates are generated and made accessible online. For organizations seeking to extract value from this data, web scraping has become a widely used technique. This technique broadly refers to the automated collection of data from publicly accessible websites and online platforms, typically using specialized crawlers or bots that systematically retrieve and store content at scale.

The prevalence of web scraping cuts across industries. For example, price comparison services rely on it to aggregate retail data, financial institutions use it to monitor market sentiment, and research organizations deploy it to analyze large bodies of text or content. In each of these scenarios, the appeal is the same: vast quantities of data are readily available, and the automated collection of data available online can be far more efficient than other alternatives.

However, with the rapid expansion of AI and, in particular, the development of large language models (“LLM”), web scraping has taken on an entirely new dimension of importance. Training a sophisticated AI model requires enormous volumes of text data to enable it to learn grammatical structures, contextual relationships, factual associations, linguistic patterns, or make other associations. Web-scraped datasets have therefore become a primary source of training data used by AI developers.

On the other hand, scraping data carries potential compliance risks that are not necessarily confined to any single country or jurisdiction. Privacy regulators around the world are seeing an increasing number of incidents involving data scraping, particularly from social media and other websites that host and disseminate data online. For instance, in August 2024, the OPC joined eleven other international data protection authorities in issuing a joint statement¹ to major social media organizations, setting out key privacy principles and signaling a coordinated global regulatory attention to web scraping.

Against this backdrop, a growing body of guidance, enforcement decisions, and commentary from privacy authorities around the world addresses privacy concerns related to web scraping. More recently, the CAI had the opportunity to assess OpenAI’s web scraping practices for compliance with Quebec’s Act respecting the protection of personal information in the private sector (the “Quebec Privacy Act“), governing the collection, use, and communication of personal information.

2. Background on the Joint Investigation Into OpenAI

In May 2023, following a complaint filed in Canada, the OPC, the CAI, the Office of the Information and Privacy Commissioner of Alberta, and the Office of the Information and Privacy Commissioner for British Columbia launched a joint investigation into OpenAI, the creator of ChatGPT.

The investigation examined a wide range of OpenAI’s data practices in connection with the development and deployment of OpenAI’s GPT-3.5 and GPT-4 AI models, the underlying models powering ChatGPT. Among the issues considered were whether OpenAI had a legitimate basis for collecting the personal information of Canadian individuals, including information scraped from the web, to train its models, whether such individuals were properly notified about OpenAI’s data collection and processing activities, and whether they provided meaningful consent to such activities.

OpenAI cooperated with the investigation and provided extensive representations. Ultimately, the joint investigation findings were published in May 2026² representing one of the most substantive Canadian regulatory assessments to date.

3. CAI Findings: Legal Considerations Regarding Web Scraping

3.1. Web Scraping Will Trigger Privacy Obligations

In the context of the investigation, OpenAI argued, in part, that any collection of personal information in the course of its training activities was incidental to the primary goal of building an LLM. The CAI, together with the other offices, rejected this characterization and found that the collection of personal information through web scraping, even if incidental to other core activities, was subject to the full application of Quebec’s privacy framework.

In fact, OpenAI’s web scraping activities led it to collect vast amounts of personal information for the purpose of training its AI models. In the CAI’s view, the fact that personal information was captured indiscriminately as part of a broader dataset did not reduce the quantity or sensitivity of what was already collected, nor did it diminish OpenAI’s obligation to comply with the Quebec Privacy Act. As the CAI put it, whether or not the collection of personal information was incidental has no bearing on OpenAI’s obligation to comply with applicable privacy requirements, as personal information was effectively collected.

As a result, the Canadian privacy regulators and the CAI clearly established that web scraping activities will trigger the application of privacy laws when personal information is collected, no matter the purpose (or lack thereof) for which it was collected.

3.2. Data Online Is Not “Public” Information

Another aspect considered by the CAI was whether information accessible online was “public” information within the meaning of the Quebec Privacy Act and therefore not subject to its requirements. More specifically, the Quebec Privacy Act provides that it “does not apply to personal information which by law is public.”

The CAI held that this exception in the statute did not apply to information that was simply accessible on the web without restriction. The mere fact that a person posted information online at some point in time, whether on a personal blog, a forum, or a review platform, did not render that information public for the purposes of the Quebec Privacy Act, nor did it authorize third parties to collect and use it indiscriminately for any purpose.

In addition, organizations should not infer from the individual’s act of posting information online or making publications on the web without restriction that they were adequately informed of the use that could be made of such information and how it could be further communicated to third parties, or that they have provided valid consent for such use or communication. The context in which the information was originally posted or made available online must be considered to determine whether the individual was given proper notice that their personal information could be harvested by third parties and for what purposes.

The CAI’s position on this point is not a particularly novel interpretation. The CAI had already articulated this position in the past, particularly in an earlier investigation into Clearview AI³, a company that had built a facial recognition database by scraping billions of publicly accessible facial images from the web. In that investigation, the CAI held that there were no Quebec statutes under which personal information could be deemed public solely because it was posted online or made available on social media. The CAI took the position that exceptions to consent requirements under privacy legislation must be interpreted narrowly and cannot be stretched to legitimize large-scale commercial exploitation of personal information simply because that data was technically accessible at some point.

3.3. Collection of Information Directly From an Individual and from Third Parties

The CAI drew an important distinction between a collection that occurs directly from the person concerned and a collection that occurs from a third party. The CAI stated that, in the context of data scraping on the Internet, collection can be considered to have been carried out directly from the individual concerned only in rare situations. For example, this may be the case when an organization scrapes content from a web page that belongs to the individual concerned, or the platform’s terms of service specifically provide for this. In all other situations, the rules governing the collection of personal information from third parties will apply, which is particularly relevant in the context of social media and networking platforms.

For social media networks, information published is generally subject to licences granted by users to the platform authorizing the publication of their information on the platform. As a result, when an organization scrapes data from a social media platform, it is effectively collecting licensed information relating to an individual from a third party (and not directly from the individual concerned). This will therefore trigger the consent rules that apply to the collection of personal information from third parties rather than the rules that apply to the collection of personal information directly from the individual concerned.

This distinction matters because, under the third-party collection framework, the organization collecting the information must be able to demonstrate that consent was provided at the time the information was originally published on the website or platform. In other words, to comply with the Quebec Privacy Act, an organization collecting scraped information must be able to demonstrate that at the moment of the original online publication by the individuals concerned, they were duly notified and consented to the communication of their personal information to third parties (i.e., the scraping organization) for their intended purposes (i.e., AI model training, or other purposes).

3.4. Lawful Web Scraping May be Possible

Notwithstanding the obligations discussed above, the CAI did not take the position that web scraping of personal information was inherently unlawful. It acknowledged that a lawful scraping was possible, but that the lawfulness could only be determined through a careful, context-specific analysis. Accordingly, it is entirely possible that users, based on the information provided at the time of collection, the terms of service applicable to the website or platform, and the privacy policies in effect at the time of initial publication, may have been duly informed and consented to their personal information being made accessible on the web and communicated to third parties and harvested for the purpose of training AI models.

3.5. No Presumption of Consent

OpenAI argued that where a third party published personal information online, it was reasonable to assume that the publication had been authorized by the individual concerned, thereby allowing consent to be inferred. The CAI rejected this approach. Instead, it held that organizations should have taken into account the general context of the publication to ensure the individual concerned had consented to the communication to a third party, and that, in case of doubt, it should have refrained from collecting the information (instead of presuming it had the right to do so).

Consider, for example, a situation in which a family member posts personal information about another individual, including their name, photograph, and location, on a publicly accessible social media page. The individual whose information appears online may never have consented to that posting, let alone to the subsequent scraping of their information. Under the CAI’s approach, an organization should not assume that, simply because information is publicly accessible, the collection is appropriate. It must consider the context of the initial collection and assess whether proper notice was given to the individual and valid consent was provided. If in doubt, the position should be to refrain from collecting.

Furthermore, the CAI also indicated that OpenAI should have assessed whether the publication of such information constituted communication without consent in the context of a secondary use of personal information initially collected for another purpose by a third party or whether the information concerned a person under the age of fourteen (14), where the consent of the person exercising parental authority or of the tutor may be required.

Concerns about the mass scraping of personal information available online are highlighted by privacy authorities around the world. Once personal information has been scraped, the individual whose information was collected effectively loses control over it. Data scrapers may aggregate and combine scraped information from one site with personal information obtained from other sources and use it for various purposes. Moreover, even if individuals withdraw their consent with the platform where they initially published their information or request the deletion of their information, the exercise of their rights may not reach data scrapers who will likely continue to use and share the information they had already scraped, limiting individuals’ control over their online presence and reputation.⁴

3.6. Personal Information Can Have Different Lifecycles

The Quebec Privacy Act provides that an organization may not communicate personal information to third parties without the consent of the individual concerned. OpenAI argued that such a prohibition did not apply to its collection of personal information from the web or third-party data providers, on the basis that the collection and communication of personal information were distinct concepts, and that receiving a disclosure of personal information was not technically the same as collecting it.

The CAI did not agree with the above position and held that web scraping attracted the rules governing the communication of personal information to a third party in one party’s hands and the collection of personal information in another party’s hands. For example, when a website operator makes personal information accessible on the web, that act of publication may in itself constitute the communication of that information to third parties (such as to scraping organizations). At the same time, when the scraping organization scrapes the information, it is effectively collecting personal information, and so a new personal information life cycle begins within that organization. Viewed this way, the same piece of personal information can have several life cycles across different organizations, attracting specific rules under the Quebec Privacy Act.

In addition, when the scraping organization develops an AI model that may reveal the same piece of personal information to users in its outputs, this can constitute a new communication of personal information to third parties (this time to the users of the AI system), triggering once more the application of the rules relating to communication to third parties. In other words, the chain of communication to third parties does not end at the point of scraping and can extend to every subsequent communication of that information.

3.7. Documenting Data Collection and Processing Activities

In its findings, the CAI found that OpenAI did not establish the online sources from which it scraped or obtained scraped information. This lack of source-level transparency made it impossible for OpenAI to demonstrate for each source that the individuals whose information was collected had been properly informed or had provided valid consent. As a result, in the absence of evidence that appropriate compliance checks were carried out, OpenAI did not comply with the Quebec Privacy Act.

A practical lesson from the OpenAI findings is that an organization must not just assert that it is acting in compliance with privacy laws but sufficiently document the context in which personal information it holds was collected, is used, and will be communicated, and, depending on the situation, that the individuals concerned were duly informed and provided valid consent.

3.8. Contractual Measures With Third Parties

Many organizations do not scrape the web themselves but instead obtain pre-assembled datasets from third-party data brokers, content aggregators, or commercial data partners that have collected the information through scraping. To complement its own direct collection activities, OpenAI collected personal information by entering into data-sharing agreements with data providers. In this context, OpenAI represented that it had contractually ensured that its data providers had provided appropriate notices to the individuals concerned and, where applicable, obtained valid consent.

Regarding the contractual measures taken by OpenAI, the CAI took a measured position. It found that, taking into account the refinements made to OpenAI’s agreements over time, the contractual measures in place at the time of the investigation and the consent verification mechanisms were reasonable in the circumstances. As a result, subject to evidence to the contrary, it found that OpenAI’s contractual practices with its data providers complied with Quebec’s statutory requirements. Nonetheless, the CAI encouraged OpenAI to implement additional measures to strengthen its privacy governance, including conducting privacy audits, to ensure ongoing compliance.

Outside the context of this OpenAI investigation, privacy authorities have gone further than the CAI in specifying what adequate contractual governance should look like in practice. In October 2024, sixteen privacy authorities, including the OPC, issued a concluding joint statement on data scraping⁵ following direct engagement with major social media companies. That statement made clear that contractual terms must not, in and of themselves, render data scraping lawful. Organizations must ensure that they have a lawful basis for granting access or permitting the collection of personal information, that they are transparent about the nature of scraping allowed, and that they obtain consent where required by law.

4. Application of Lessons Learned

Drawing on the CAI’s formal recommendations in the OpenAI findings, and in light of the considerations outlined above, here are a few practical guidelines organizations may wish to consider. It must be emphasized that the lawfulness of web scraping activities under the Quebec Privacy Act must be assessed on a case-by-case basis. As a result, the guidelines below should be considered as general information. Each web scraping activity must be evaluated in its specific context, taking into account all relevant privacy factors.

4.1. Implement Source-Level Checks Before Collection

Before collecting personal information from a publicly accessible source, whether directly or through a third-party dataset, organizations should assess the source to determine the context in which the information was originally published, whether individuals were clearly informed that their personal information could be made accessible online and subsequently harvested by third parties, whether the terms of service of the relevant platform allowed third-party data collection, and whether the information could relate to individuals under fourteen (14) years of age, which would require parental or tutor consent, or involve sensitive information. These checks should be documented for compliance purposes.

4.2. Do Not Presume Consent

Organizations should implement internal guidelines that prohibit the assumption that information available online can be freely collected and used by various business units. Appropriate internal governance processes should be implemented to ensure that each data source is evaluated individually for compliance purposes, particularly to ensure that the individuals concerned consented to the communication and use of their personal information by third parties. Where doubt exists as to whether the individuals concerned provided valid consent, the default position should be to refrain from scraping or collecting the information in question.

4.3. Verify The Chain of Notification and Consent

Where personal information is obtained from a third-party data provider that obtained it through scraping, organizations should ensure that their data sharing agreements require their data provider to warrant that appropriate notice was given and, where required, consent was obtained at the point of initial collection. Agreements should also require partners to specify the sources of the data they provide. Whenever necessary, organizations should also consider exercising audit rights to periodically verify the data provider’s warranties and representations in this regard.

4.4. Assess Downstream Use of Personal Information

Organizations collecting scraped personal information and then using or communicating it may trigger multiple provisions of the Quebec Privacy Act, including the consent rules applicable to secondary uses and rules governing the communication of personal information by third parties. An analysis that addresses only the initial collection of personal information, without addressing downstream use, communication, and processing, may leave compliance blind spots. Ensure that appropriate notice has been provided and valid consent obtained for all downstream purposes for which personal information may be used or communicated.

4.5. Exercise Particular Caution With Minors

The CAI specifically highlighted the risks associated with the collection of personal information concerning individuals under fourteen (14), where parental or tutor consent is required. Organizations should implement procedures to identify and exclude personal information concerning minors from web-scraped datasets or ensure they take specific measures to comply with the requirements of the Quebec Privacy Act. Organizations should consider the personal information of minors to be sensitive and ensure they implement more rigorous privacy governance practices.

5. Takeaways

Web scraping is a prevalent data collection technique in the modern digital economy, and its use is only expected to grow as the demand for training data for AI systems continues to accelerate. Against this backdrop, it is essential that organizations resist the temptation to treat data available online as an “open buffet” of limitless and freely available sources of data that can be collected and repurposed without restriction.

In Quebec, the Quebec Privacy Act applies wherever personal information is collected from online sources, regardless of how that information became available online or whether the collection was direct or through a third-party intermediary. No matter how personal information from online sources is collected, organizations must (i) demonstrate, at the source level, that individuals were properly informed and that consent was validly obtained, (ii) assess the full chain of custody of the information they collect, (iii) exercise particular vigilance when their scraping activities may capture information about children or sensitive information, and (ii) adequately document their privacy assessment and compliance efforts, including with third parties.

Ultimately, organizations should approach web scraping as a data processing activity requiring the same level of rigour, governance, and accountability as any other form of data processing activity they may engage in. Privacy impact assessments, internal legal review, and, where appropriate, consultation with privacy counsel may be advisable before embarking on any web scraping operation involving personal information of Quebec individuals.

We hope this bulletin has provided you with information to better manage and steer your compliance efforts.

Footnotes

1 Joint Statement on data scraping and the protection of privacy, by the Office of the Privacy Commissioner of Canada et al., August 23, 2024.

2 PIPEDA Findings #2026-002: Joint Investigation of OpenAI OpCo, LLC, by the Office of the Privacy Commissioner of Canada et al., May 6, 2026.

3 PIPEDA Findings #2021-001, Joint Investigation of Clearview AI, Inc., by the Office of the Privacy Commissioner of Canada et al., February 2, 2021,

4 Joint Statement on data scraping and the protection of privacy, by the Office of the Privacy Commissioner of Canada et al., 24 August 2023.

5 Concluding Joint Statement on Data Scraping and the Protection of Privacy, by the Office of the Privacy Commissioner of Canada et al., October 2024.

The foregoing provides only an overview and does not constitute legal advice. Readers are cautioned against making any decisions based on this material alone. Rather, specific legal advice should be obtained.

[View Source]

Web Scraping In Quebec: Lessons Learned From The OpenAI Investigation & Practical Guidance

Contributor

Privacy

Contributor

Canada