To determine whether a website may be scraped, a web scraper must understand when their actions lead to increased legal risk. Website owners may use several legal theories to attempt to block unauthorized scraping of their website's content. Pre-emptively addressing these theories will reduce the legal risk associated with web scraping. These theories include:
- Copyright Infringement;
- Breach of Contract;
- Tort Theories (Hot News Misappropriation and Trespass to Chattels); and
- The Computer Fraud and Abuse Act ("CFAA")
Following is an analysis of the four theories most commonly raised in U.S.-based web scraping disputes, and the key legal compliance considerations for each.
(1) Copyright: A great deal of website content is copyright protected, which means it cannot be scraped (copied) without the express permission of the owner. Determining whether particular content is copyright protected can require a complicated legal analysis, but it is likely to apply to most textual materials (e.g., news reports), software code, graphs/charts, photos, illustrations, and audio/visual recordings. Copyright protection generally does not apply to raw factual data or information1, although it can protect compilations of data or information that are selected, coordinated, or arranged in a sufficiently creative manner (e.g., market indexes, business directories).2
Accordingly, if you are considering scraping website content that may be copyright protected, you should first determine whether you have a license or other form of permission to access, copy, and use such content, and ensure that your use complies with the applicable grant of rights (see "Contract" section below).
If you do not have express, written permission from the copyright holder, you may nonetheless be able to scrape and use the content if your use is covered by the Fair Use doctrine, which requires evenly weighing the following four factors:
- The purpose and character of the use: Use in academic analysis, parody, criticism, or works that benefit the public are more likely to be afforded Fair Use protection. Works that transform and add new meaning to the content are more likely to be deemed Fair Use than non-transformative works. Non-commercial uses are favored as Fair Use, although commercial uses are not excluded.
- The nature of the copyrighted work: Fair Use is more likely to be found in factual works than fictional works. Further, Fair Use is found more often when the work is unpublished.
- The amount and substantiality of the portion used in relation to the copyrighted work as a whole: Consider how important the scraped content is to the overall work it comes from. Also, consider how much of the work's content is being used. Know that a use can be small quantitatively but important qualitatively. Using a brief snippet of unimportant text is more likely to be protected under Fair Use than using a report's key chart.
- The effect of the use on the potential market for the copyrighted work or the work's value: Courts seek to protect the money a copyright holder would earn from their copyright. Any present or future reduction in earnings due to the issuance of a Fair Use exception decreases the likelihood that a court would find Fair Use.
As demonstrated by the foregoing, the Fair Use analysis is complicated and nuanced; it should always be conducted with the assistance of counsel.
In determining whether you are bound by the restrictions in a given website's TOUs, a key factor is determining whether you ever indicated your assent in a manner sufficient to form a contract. Like all contracts, to form a legally binding agreement, the parties to website TOUs must indicate their mutual assent to the applicable terms. Courts generally find adequate indicia of assent where websites require that the visitor click a check-box agreeing to the website's TOU before accessing the site's content. Using a bot to click the check-box is unlikely to alter this determination. Conversely, some websites imply assent to the website's TOU through the visitor's mere use of the website (no check-box is clicked). In these instances, courts have considered: (1) Did the user know the terms of the contract? (2) Absent actual knowledge, would a "reasonably prudent" visitor be on inquiry notice of the terms given the overall website design?3 Factors courts have considered in conducting this analysis include: how conspicuous the link to the TOU was, whether the website buried the hyperlink or prominently displayed it, and whether the website explicitly directed a user to the TOU.4
(3) Tort Theories: Website owners may also seek to protect their content by asserting common law tort claims such as unfair competition or trespass.
For example, in some states (New York and Illinois included) a branch of the unfair competition tort known as "hot news misappropriation" has been recognized to protect the competitive commercial value of time-sensitive information, such as fresh news reports.5 Courts applying this theory seek to prevent windfalls – one party benefitting from work done by another. The elements of hot news misappropriation under New York law have been described as: (1) the plaintiff generates/gathers information at a cost; (2) the information is time-sensitive; (3) the defendant's use of the information constitutes free-riding on the plaintiff's efforts; (4) the defendant is in direct competition with a product or service offered by the plaintiffs; and (5) the ability of other parties to free-ride on the efforts of the plaintiff would so reduce the incentive to produce the product or service that its existence or quality would be substantially threatened.6 Thus, scraping hot news for competitive use requires consideration of these factors.
In addition, the tort known as "trespass to chattels" has been applied where a website operator can show tangible damage connected to the user's unauthorized access to content. The elements of trespass to chattels are: (1) the user intentionally and without authorization interfered with the website operator's possessory interest in the site; and (2) the unauthorized use proximately resulted in damage to the operator. In eBay v. Bidder's Edge, the court determined that the first prong was violated when it was proven that the operator asked the scrapers to cease their web scraping activities, the scraper circumvented the operator's technological attempts to block access to the site (e.g., blacklisting IP addresses), and the operators granted mere conditional access to their site (implying that access can be revoked).7 The second prong was found to be violated when the scraper's activities consumed "at least a portion of [the operator's] bandwidth and server capacity."8
(4) The Computer Fraud and Abuse Act: The primary focus of the CFAA is the prevention of unauthorized access to computers via circumvention of technological security measures. Accordingly, careful consideration of the CFAA's complexities is strongly recommended if you are considering scraping techniques involving technical security circumventions, such as password sharing or anti-bot avoidance tools.
The CFAA prohibits an individual from intentionally removing information from a protected computer by accessing the computer: (1) without authorization; or (2) by exceeding authorization.9 While there are conflicting legal opinions regarding the application of the CFAA, a few points have emerged:
- The CFAA regulates who may access content, not how access occurs. With authorization to access a computer's content, a user may access all or none of the computer's content without running afoul of the CFAA. Usage of a bot to download validly accessed content does not violate the CFAA: the bot is merely speeding up authorized access.
- Authorization may be conditional: limits may exist both in terms of which content is accessed and how the content is accessed. Such limits should be observed.
- Without further evidence of lack of authorization, courts do not hold mere violation of a website's TOU as evidence of a lack of due authorization under the CFAA.10
- Possession of sign-in credentials is an indicator of "authorization," but not the determining factor.11
- Web content that is not protected by passwords or technical barriers does not require "authorization" and therefore is not likely to run afoul of the CFAA.12
In conclusion, there are several potential legal risks associated with scraping web site content. Assessing those risks requires analysis of various legal theories, as well as consideration of numerous factual variables such as the nature of the scraped content, the technical methods used to scrape, and the applicability of contractual terms. Because the analysis of the applicable legal theories and factual variables can be complicated and nuanced, the assistance of counsel is recommended.
1. Feist Publ'ns, Inc. v. Rural Tel. Serv. Co., 499 U.S. 340, 345-46 (1991) (alphabetical arrangement of telephone directory not sufficiently creative to merit copyright compilation protection)
2. Key Publ'ns, Inc. v. Chinatown Today Publ'g Enters., Inc., 945 F.2d 509, 513-14 (2d Cir. 1991) (business directory with categorical arrangement is copyright protected)
3. Long v. Provide Commerce, Inc., 245 Cal. App. 4th 855, 858, 200 Cal. Rptr. 3d 117, 120 (2016)
4. In re Zappos.com, Inc., Customer Data Sec. Breach Litig., 893 F. Supp. 2d 1058 (D. Nev. 2012)
5. National Basketball Association v. Motorola, Inc., 105 F.3d 841 (2d. Cir. 1997)
6. National Basketball Association v. Motorola, Inc., 105 F.3d 841 (2d. Cir. 1997)
7. eBay, Inc. v. Bidder's Edge, Inc., 100 F. Supp. 2d 1058, 1069-70 (N.D. Cal. 2000); See also: Ticketmaster Corp. v. Tickets.Com, Inc., No. CV997654HLHVBKX, 2003 WL 21406289, at *3 (C.D. Cal. Mar. 7, 2003);
8. eBay, Inc. v. Bidder's Edge, Inc., 100 F. Supp. 2d 1058, 1069-70 (N.D. Cal. 2000)
9. 18 U.S.C. § 1030(a)
10. Nosal II
11. hiQ Labs, Inc. v. LinkedIn Corp., 273 F. Supp. 3d 1099 (N.D. Cal. 2017)
12. hiQ Labs, Inc. v. LinkedIn Corp., 273 F. Supp. 3d 1099 (N.D. Cal. 2017)
The content of this article is intended to provide a general guide to the subject matter. Specialist advice should be sought about your specific circumstances.