Introduction to Web Scraping
[ 8 ]
In an ideal world, web scraping wouldn't be necessary and each website would provide an
API to share data in a structured format. Indeed, some websites do provide APIs, but they
typically restrict the data that is available and how frequently it can be accessed.
Additionally, a website developer might change, remove, or restrict the backend API. In
short, we cannot rely on APIs to access the online data we may want. Therefore we need to
learn about web scraping techniques.
Is web scraping legal?
Web scraping, and what is legally permissible when web scraping, are still being
established despite numerous rulings over the past two decades. If the scraped data is being
used for personal and private use, and within fair use of copyright laws, there is usually no
problem. However, if the data is going to be republished, if the scraping is aggressive
enough to take down the site, or if the content is copyrighted and the scraper violates the
terms of service, then there are several legal precedents to note.
In Feist Publications, Inc. v. Rural Telephone Service Co., the United States Supreme Court
decided scraping and republishing facts, such as telephone listings, are allowed. A similar
case in Australia, Telstra Corporation Limited v. Phone Directories Company Pty Ltd,
demonstrated that only data with an identifiable author can be copyrighted. Another
scraped content case in the United States, evaluating the reuse of Associated Press stories
for an aggregated news product, was ruled a violation of copyright in Associated Press v.
Meltwater. A European Union case in Denmark, ofir.dk vs home.dk, concluded that regular
crawling and deep linking is permissible.
There have also been several cases in which companies have charged the plaintiff with
aggressive scraping and attempted to stop the scraping via a legal order. The most recent
case, QVC v. Resultly, ruled that, unless the scraping resulted in private property damage, it
could not be considered intentional harm, despite the crawler activity leading to some site
stability issues.
These cases suggest that, when the scraped data constitutes public facts (such as business
locations and telephone listings), it can be republished following fair use rules. However, if
the data is original (such as opinions and reviews or private user data), it most likely cannot
be republished for copyright reasons. In any case, when you are scraping data from a
website, remember you are their guest and need to behave politely; otherwise, they may
ban your IP address or proceed with legal action. This means you should make download
requests at a reasonable rate and define a user agent to identify your crawler. You should
also take measures to review the Terms of Service of the site and ensure the data you are
taking is not considered private or copyrighted.