Comunicado Sobre Web Scraping Oct-2024
Comunicado Sobre Web Scraping Oct-2024
Informed by engagement with industry on the initial Joint Statement on Data Scraping and the
Protection of Privacy (August 2023)
October 2024
Key takeaways
Initial Statement
This Concluding Statement builds on the Joint statement on data scraping and the protection
of privacy (the Initial Statement), published August 24, 2023, which highlighted the following
key messages:
Based on engagement with SMCs and other industry stakeholders that followed the issuance
of the Initial Statement, the co-signatories wish to highlight the following additional key
takeaways:
• To effectively protect against unlawful scraping, organizations should deploy a
combination of safeguarding measures, and those measures should be regularly
reviewed and updated to keep pace with advances in scraping techniques and
technologies.
• While artificial intelligence (AI) is used by some sophisticated data scrapers to evade
detection, it can also represent part of the solution, serving to enhance protections
against unlawful scraping.
• The obligation to protect against unlawful scraping applies to both large corporations
and Small and Medium Enterprises (SMEs). There are lower-cost measures that SMEs
can implement, with assistance from service providers, to meet this obligation.
• Where SMCs and other organizations contractually-authorize scraping of personal
data from their platforms, those contractual terms cannot, in and of themselves,
render such scraping lawful; however, they can be an important safeguard.
o Organizations who permit scraping of personal data for any purpose, including
commercial and socially beneficial purposes, must ensure without limitation,
that they have a lawful basis for doing so, are transparent about the scraping
they allow, and obtain consent where required by law.
o Organizations should also implement adequate measures, including
contractual terms and associated monitoring and enforcement, to ensure that
the contractually authorized use of scraped personal data is compliant with
applicable data protection and privacy laws.
• When an organization grants lawful permission for third parties to collect publicly
accessible personal data from its platform, providing such access via an Application
Programming Interface (API)1 can allow the organization greater control over the
data, and facilitate the detection and mitigation of unauthorized scraping.
• SMCs and other organizations that use scraped data sets and/or use data from their
own platforms to train AI, such as Large Language Models, must comply with data
protection and privacy laws as well as any AI-specific laws where those exist. Where
regulators have made available guidelines and principles on the development and
implementation of AI models, we expect organizations to follow that guidance.
Introduction
1. The initial Joint Statement on data-scraping and the protection of privacy (the Initial
Statement), published in August 2023, set out expectations regarding what
organizations should do to ensure that individuals are protected from the risks resulting
from unlawful scraping. The present Concluding Statement was developed to reinforce
the requirements set out in the Initial Statement, share best practices and lessons
learned through engagements with SMCs and industry stakeholders following the
publication of that statement, and set out further expectations for SMCs and other
organizations that host publicly accessible personal information.
2. Both statements address data scraping in the form of automated extraction of personal
data from the web. These statements do not address indexing by search engines, nor do
they address the scraping of non-personal information.
3. While the Initial Statement was published by 12 members of the International
Enforcement Working Group (IEWG) and endorsed by two additional members
1
Application Programming Interface (API) - a way of communicating with a particular computer program or
internet service.
following its publication, the Initial Statement and this Concluding Statement are now
endorsed by a total of 16 co-signatories2.
2
Office of the Australian Information Commissioner (OAIC); Office of the Privacy Commissioner of Canada (OPC-
Canada); United Kingdom Information Commissioner’s Office, (ICO); Hong Kong Office of the Privacy Commissioner
for Personal Data (PCPD); Norway Data Protection Authority (Datatilsynet); Swiss Federal Data Protection and
Information Commissioner (FDPIC); Colombian Superintendencia Industria y Comercio (SIC); Office of the Privacy
Commissioner of New Zealand (OPC-New Zealand); Jersey Office of the Information Commissioner (JOIC);
Moroccan Commission Nationale de Contrôle de la Protection des Données à Caractère Personnel (CNDP);
Argentine Agencia de Acceso a la Información Pública (AAIP); Mexican Instituto Nacional de Transparencia, Acceso
a la Información y Protección de Datos Personales, (INAI); Guernsey Office of the Data Protection Authority
(ODPA); Spain Agencia Española de Protección de Datos (AEPD); Monaco Commission de Contrôle des Informations
Nominatives (CCIN); Israel Privacy Protection Authority (PPA).
3
The Mitigating Unauthorized Scraping Alliance describes itself as an organization that unites industry and
regulators to combat unauthorized data scraping, aiming to promote best practices, raise public awareness, and
provide valuable insights to policymakers.
Lessons learned and co-signatories' expectations
9. As with the Initial Statement, many of the recommendations below represent statutory
requirements in some or all jurisdictions.
10. A fundamental takeaway from the Initial Statement is that publicly accessible personal
data is still subject to data protection and privacy laws in most jurisdictions. SMCs and
operators of websites that host publicly accessible personal data have obligations, under
data protection and privacy laws, to protect personal information on their platforms
from unlawful scraping.
14. Through our engagements, we also learned of further measures, beyond those detailed
in the Initial Statement, that organizations employ to protect against data scraping, such
4
A bot is an automated software application that performs repetitive tasks over a network. It can follows specific
instructions to imitate human behavior.
5
A CAPTCHA is a program or system intended to distinguish human from machine input.
as the implementation of platform design elements that make it harder to scrape data
using automation (e.g., random account URLs, random interface design elements, and
tools to detect and block malicious internet traffic).
15. We learned that the rapid emergence of AI can represent a threat to privacy. SMCs told
us that scrapers are now using AI to scrape data more effectively (e.g., via “intelligent”
bots that can simulate real user activity). At the same time, SMCs explained that they
too are employing AI to better detect and protect against unauthorized scraping,
highlighting that innovative AI tools can also be part of the solution.
16. Ultimately, the co-signatories learned that while no measure is guaranteed to protect
against all unlawful scraping - since sophisticated low-volume scraping can often
resemble user activity - a multi-layered and dynamic combination of safeguards can be
particularly effective in protecting against mass scraping and the amplified harms that
can result when a large volume of data subjects are affected.
20. Several SMCs indicated that in certain circumstances, they allow scraping or other forms
of mass collection of data from their platforms (e.g., through API access, discussed
further below), in furtherance of their own or third parties’ commercial interests, such
as those associated with platform management.
21. The companies explained that they generally “authorize” such collection via contractual
terms, such as those in their Terms and Conditions. SMCs further explained that to
ensure that the scraping that they permit is lawful, their contractual terms generally
require third parties on their platform to comply with applicable laws. They also
explained that it can be difficult for them to determine whether scraped data is used by
those parties solely for purposes allowed by their contract.
22. The co-signatories note that contractual terms cannot in and of themselves render data
scraping lawful. For example, organizations must also ensure that they have a lawful
basis for granting access or permitting collection of personal data, that they are
transparent about the scraping they allow, and that they obtain consent where required
by law.
23. Furthermore, while contractual terms are an important safeguard against unlawful
scraping, a contractual term indicating that third parties must comply with applicable
laws is not sufficient. Organizations should implement adequate measures to ensure
that contractually-allowed use of scraped personal data is compliant with applicable
data protection and privacy laws. The contract could, for example, specify limitations on
the information that may be scraped and the purposes for which it may be used, as well
as the consequences for non-compliance with those terms. However, organizations
cannot simply rely on contractual measures. They should also implement measures to
monitor third parties’ compliance with contractual limitations, and to enforce
compliance when those terms are not respected.
Access to data for research and other potentially socially beneficial purposes
24. In certain circumstances, SMCs may be required by law to provide third parties, such as
researchers, with large-scale access to publicly accessible data on their platforms (e.g.,
pursuant to Article 40 of the EU Digital Services Act6). In other circumstances, we
learned that SMCs may choose to provide data access to third parties, even where there
is no legal requirement to do so (e.g., in support of socially beneficial research). Several
of the companies indicated that they often provide such access via an API, in particular
where they are required or permitted by law to grant large-scale access.
25. While the co-signatories acknowledge the importance of socially beneficial research,
they wish to remind SMCs and other organizations that host publicly accessible personal
data that, when allowing large-scale access or collection, organizations must ensure that
they are complying with applicable data protection and privacy laws, including by
ensuring that there is a lawful basis for granting access or permitting collection.
6
Article 40, Single Market For Digital Services and amending Directive 2000/31/EC (Digital Services Act):
Upon a reasoned request from the Digital Services Coordinator of establishment, providers of very large
online platforms or of very large online search engines shall, within a reasonable period, as specified in
the request, provide access to data to vetted researchers who meet the requirements in paragraph 8 of
this Article, for the sole purpose of conducting research that contributes to the detection, identification
and understanding of systemic risks in the Union, as set out pursuant to Article 34(1), and to the
assessment of the adequacy, efficiency and impacts of the risk mitigation measures pursuant to Article 35.
Specifically, the co-signatories note that not all data protection and privacy laws provide
for “public interest”, research or statistical purposes as an exception to the requirement
for consent or as a lawful basis for the processing of personal data. Further, where such
exceptions do exist, there may be limitations on the scope of their application.
26. The co-signatories also recognize that, where it is lawful to allow large-scale access or
collection, APIs can represent a further safeguard against unlawful scraping. While APIs
are not impenetrable, they can afford the host greater control over the data on its
platform and facilitate detection and mitigation of unauthorized access, via the use of
credentials as well as logging and monitoring of associated activity.
SMC usage of scraped data and data from their own platforms for AI development
27. The co-signatories took the opportunity presented by this initiative to engage with SMCs
about their own scraping of data and use of scraped data sets to train their Large
Language Models, which present not only opportunities for innovation but also
significant privacy risks.
28. Based on what was learned through these engagements, the co-signatories wish to
remind SMCs and other organizations who may use scraped personal data or data
collected from their own platforms for the development, operation and deployment of
generative AI systems, that they must comply with data protection and privacy laws, as
well as any other AI-specific laws where they exist. The co-signatories also call on these
organizations to comply with privacy and data protection principles like those detailed
in the 2023 Global Privacy Assembly Resolution on Generative Artificial Intelligence
Systems and other international guidance7. Specifically, the co-signatories note that
data protection and privacy laws regulate whether and to what extent the collection
and use of personal data for AI development is lawful.
Conclusion
29. Since the release of the initial statement, unlawful data scraping has gained increasing
attention, in part due to the rapid emergence and deployment of generative AI systems.
Data scraping has also been, and continues to be, widely discussed globally both by data
protection authorities and industry.
30. The co-signatories wish to recognize the work of the individual data protection
authorities that have produced guidance8 to address practices related to data scraping.
7
See the Roundtable of G7 Data Protection and Privacy Authorities 2023 Statement on Generative AI, the
Hiroshima Process International Code of Conduct for Advanced AI Systems and others.
8
The Dutch DPA (Autoriteit Persoonsgegevens) issued guidelines and the Italian DPA (Garante Per La Protezione
Dei Dati Personali ) issued instructions to defend personal data from web scraping. The UK Information
Commissioner’s Office consultation on generative AI and data protection, including web scraping to train
generative AI.
In this guidance, we note the common theme that publicly accessible personal data is
generally subject to data protection and privacy laws and should be adequately
protected against unlawful scraping.
31. The co-signatories also want to emphasise their expectation that all companies, not just
SMCs, protect the publicly accessible personal information that they host against
unlawful scraping. Failure to implement adequate safeguards in compliance with
applicable laws could result in regulatory intervention, including enforcement action.
32. The co-signatories also wish to remind those engaged in data scraping, as well as SMCs
and other organizations who use data from their own platforms to train AI, that they
should implement measures to ensure that their data practices comply with data
protection and privacy laws.
33. Data scraping is a complex, broad and evolving issue that is, and will stay on the radar of
data protection authorities. It should also be a focus for other stakeholders that have a
role in protecting privacy, including those with whom we engaged in the course of this
initiative. The co-signatories will continue to work to promote compliance in this area,
including via future engagement with concerned stakeholders, complementary policy
development, public education campaigns, and enforcement9, including collaborative
enforcement.
34. Meanwhile, the co-signatories encourage SMCs to continue to collaborate with each
other and with other stakeholders to share knowledge and strategies and develop
solutions to address and respond to this common threat.
35. The co-signatories wish to thank the SMCs and industry stakeholders who demonstrated
openness in discussions with regulators. This enabled the co-signatories to develop and
share their expectations without the need for formal, resource-intensive enforcement
action, to the benefit of all.
9
Joint investigations of Clearview AI, Inc. by: the Office of the Privacy Commissioner of Canada, the Commission
d’accès à l’information du Québec, the Information and Privacy Commissioner for British Columbia, and the
Information Privacy Commissioner of Alberta; and by the UK Information Commissioner’s Office and the Office of
the Australian Information Commissioner.
Supervision Office of the Privacy Commissioner for
Information Commissioner’s Office Personal Data
United Kingdom Hong Kong
China
Adrian Lobsiger Tobias Judin
Commissioner Head of International Section
Federal Data Protection and Information Datatilsynet
Commissioner Norway
Switzerland
Michael Webster Cielo Angela Peña Rodriguez
Privacy Commissioner Deputy Superintendent for the
Office of the Privacy Commissioner Protection of Personal Data
New Zealand Superintendencia de Industria y
Comercio
Colombia
Paul Vane Omar Seghrouchni
Information Commissioner President
Jersey Office of the Information CNDP (Commission Nationale de
Commissioner contrôle de la protection des Données à
Jersey caractère Personnel)
Morocco
Beatriz de Anchorena Josefina Román Vergara
Director Commissioner
AAIP (Agency for Access to Public INAI (National Institute for
Information) Transparency, Access to Information
Argentina and Personal Data Protection)
Mexico
Brent R Homan Mar España Martí
Commissioner Director
ODPA (Office of the Data Protection AEPD (Agencia Española de Protección
Authority) de Datos)
Guernsey Spain
Robert Chanas Gilad Semama
Président Commissioner
CCIN (Commission de Contrôle des Privacy Protection Authority
Informations Nominatives) Israel
Monaco