Protecting University Repositories from Aggressive Web Scraping – Using Database Rights to Retain Control Over Academic Content

Posted on June 12, 2025 | in Library & University Collections | by estoica

Universities have long served as custodians of knowledge, investing heavily in infrastructure that ensures research outputs are preserved, discoverable, and freely available to all. Institutional repositories, designed to host open access publications, embody this ethos. They support transparency, reproducibility, and public accountability in science and scholarship. But today, that infrastructure is under silent siege.

The rapid growth of generative AI and large language models (LLMs) has created insatiable demand for high-quality text data. Academic repositories – open by design and rich in peer-reviewed research – have become prime targets for automated scraping by commercial actors. These entities often extract content at scale without permission, transparency, or attribution, repurposing it for proprietary model training. This phenomenon undermines the ethical foundations of open access and leaves repository managers in an impossible position: maintain openness, or protect against exploitation?

At the University of Edinburgh, this dilemma has become tangible. Automated scraping activity has been detected targeting DSpace-hosted repositories, while similar content on the Elsevier-managed Pure platform has seen no such traffic. The likely explanation lies in technical and legal asymmetry: commercial platforms benefit from robust infrastructure, gated access, and formal licence agreements, whereas open-source systems tend to lack such safeguards. As a result, repositories maintained by public institutions – those most committed to the ideals of open science – are paradoxically the most vulnerable to abuse.

This blog post proposes a legally grounded, technically feasible solution to this challenge. By asserting database rights – an often-overlooked intellectual property tool under UK and EU law – universities can regain agency over the reuse of their repository content. Importantly, this strategy does not undermine openness for legitimate scholarly users. Rather, it introduces enforceable conditions that deter commercial exploitation, demand accountability, and promote ethical reuse.

The Limits of Copyright

Many institutions assume that copyright will provide sufficient protection against unauthorised scraping. In practice, this is a misconception. UK law permits non-commercial text and data mining under Section 29A of the Copyright, Designs and Patents Act 1988. This exception allows users to mine copyright-protected works without needing a licence, provided the purpose is non-commercial research and the user has lawful access. Contractual terms attempting to override this exception are unenforceable.

Yet most web scraping for AI training is not non-commercial research. It is often undertaken by commercial entities with minimal transparency and no academic oversight. Still, enforcement is difficult. Many repositories host works under open licences – typically Creative Commons – that explicitly allow reuse. Furthermore, universities frequently do not own the copyright in deposited articles, especially if authors have signed publishing agreements assigning rights to third parties. In such cases, institutions cannot lawfully impose additional restrictions.

Compounding the problem is jurisdiction. Scraping bots often originate outside the UK or EU, complicating enforcement of domestic copyright law. Even where copyright is potentially infringed, pursuing action requires a claimant with legal standing – typically the copyright holder – and a clear, attributable violation.

All of this makes copyright, while important, an unreliable instrument for protecting repository infrastructure from automated harvesting. Universities need a different legal lever – one rooted not in the authorship of individual works, but in the stewardship of the repository as a curated dataset.

Database Right: A Strategic Opportunity

Under UK and EU law, creators of databases enjoy a sui generis “database right” if they make a substantial investment in obtaining, verifying, or presenting the contents. This right protects the structure and contents of the database, regardless of whether the individual items are themselves protected by copyright.

Institutional repositories almost certainly qualify. Universities dedicate significant staff time and technical resources to the ingest, verification, metadata creation, and ongoing maintenance of their repositories. These systems are curated with care, often over decades, and represent a considerable investment in public infrastructure. Accordingly, universities are well-positioned to assert database rights over their institutional repositories.

This right is powerful for several reasons. First, it is separate from copyright and not subject to the non-commercial TDM exception. Second, it allows rights holders to impose access conditions, including click-through licences, as a prerequisite for reuse. Third, infringement occurs not at the level of individual documents, but through the extraction or reuse of a substantial part of the database. This makes enforcement more practical against actors engaging in large-scale scraping.

Database right thus offers universities a new legal basis for governing access, especially by commercial AI developers. When paired with licensing, metadata, and technical safeguards, it can form the backbone of an ethical and enforceable access framework.

A Practical and Ethical Framework

To make this legal strategy operational, institutions must adopt a layered approach. The first step is to formally assert their database rights. This can be done through policy statements on repository homepages, in metadata schemas, and within publicly available documentation. Such declarations do not create rights but affirm existing ones and signal intent to enforce them.

Next, universities should introduce access licences. These agreements, ideally implemented as click-through terms or API conditions, should specify acceptable uses. For instance, they might permit free reading and non-commercial research, while requiring separate permission for AI training or other commercial uses. Critically, these licences should reference the university’s database right and clarify that unauthorised mass reuse constitutes infringement.

To ensure that licences are understood and respected by automated agents, metadata must be made machine-readable. Standards such as schema.org, Dublin Core, or SPDX can be adapted for this purpose, enabling bots to detect and interpret the terms of access. This will counter claims of ignorance and strengthens institutions’ legal position in the event of disputes.

Technical deterrents are also essential. Rate limiting, IP monitoring, CAPTCHA, and token-based authentication can reduce the feasibility of mass harvesting without affecting genuine users. Providing structured API access – with terms of use, authentication keys, and logging – offers a controlled alternative for those with legitimate research needs.

Finally, universities should collaborate. A shared registry of rights-asserted repositories, standardised licence taxonomies, and coordinated enforcement mechanisms would amplify impact. Just as Crossref and ORCID have transformed scholarly metadata, a collective infrastructure for licensing and access governance could fortify the open research ecosystem.

AI Training and Open Content: What Can Be Licensed and for What Purposes?

Enabling access to repository content for commercial AI training is complex – but not beyond reach. The key is ensuring that any such use is lawful, ethical, and based on valid permission. Institutions should assess their legal position carefully and decide if this route is appropriate in their particular context.

Under UK law, the text and data mining (TDM) exception apply only to non-commercial research. Commercial AI training by for-profit developers falls outside this scope and therefore requires a separate licence. However, most universities do not hold the copyright in the scholarly works deposited in their repositories – rights typically remain with authors or publishers. This means institutions cannot authorise commercial reuse, including for AI training, unless they control the rights or have obtained explicit permission. In practice, securing such permissions is often unfeasible, especially where publishers may have existing deals with generative AI companies.

That said, two categories of content remain viable: (1) works for which the university holds the copyright – such as internal reports, policy documents, certain funded project outputs, or staff-authored works created in the course of employment – and (2) works made available under open licences, particularly Creative Commons Attribution (CC BY).

On 15 May 2025, Creative Commons published updated guidance confirming that CC BY–licensed works may be used for AI training, provided copyright law permits it and proper attribution is given. Attribution can be fulfilled by linking to datasets or using retrieval-augmented generation (RAG) technologies that surface original sources – now widely implemented in commercial AI tools.

More restrictive licences present limitations. CC ND prohibits derivative uses and is generally incompatible with training. CC NC and CC SA may permit training, but only under specific conditions that could prove difficult to operationalise at scale.

This could be a potential foundation for developing responsibly licensed, cleverly advertised AI-ready datasets – particularly in collaboration with other institutions. As open licensing continues to grow under rights retention and policy incentives, so too does the opportunity to shape ethical, transparent models for AI engagement that preserve academic values while enabling innovation.

Looking ahead, it is not inconceivable that UK universities – facing mounting financial pressure and increasing demands for accountability – may reconsider their long-standing policy of waiving copyright claims over scholarly works authored by employees. While this deference has supported academic freedom and ease of publication, it has also enabled publishers to assert broad rights over publicly funded research, often through exclusive contracts. With the rise of commercial AI training as yet another form of value extraction, institutions may begin to question the logic of handing over copyright – only to see content monetised again and again, through publishing fees, access charges, and now data licensing for machine learning. A policy shift asserting institutional copyright ownership, at least in specific contexts such as AI training, could allow universities to retain a stake in the downstream use of the knowledge they fund, support, and steward. Such a move would not be without controversy, but it would reflect a growing awareness that the current model may no longer serve the interests of researchers, institutions, or the public they aim to benefit.

Strategic Alignment and Future Readiness

This legal and technical approach aligns with broader national and international goals. The UK Government’s National Data Strategy, the proposed National Research Data Cloud, and the Department for Science, Innovation and Technology’s recent policy statements all emphasise the need for trusted data custodianship. Similarly, the UNESCO Recommendation on Open Science and EU initiatives like the Data Governance Act highlight the importance of legal clarity and ethical stewardship in digital research environments.

By asserting database rights, universities can position themselves as active participants in these agendas. They can help shape how AI interacts with scholarship, build public trust, and defend the principles of openness without becoming passive data sources for opaque commercial systems.

Perhaps most importantly, this strategy requires no legislative change. The legal tools already exist. What is needed is awareness, coordination, and the political will to use them.

Conclusion

The open access movement was built on generosity, cooperation, and a belief in the public value of knowledge. But generosity without boundaries invites exploitation. As AI systems grow more powerful and data-hungry, universities must act to protect their repositories – not by locking them down, but by making access conditional, transparent, and fair.

Database right provides a powerful foundation for this effort. It allows universities to assert ownership of their repository infrastructure, impose access conditions, and take meaningful action against unauthorised mass reuse. When paired with technical safeguards and collaborative licensing models, it offers a path to resilient, ethical, and sustainable open access.

This is not a rejection of AI or innovation. It is a call to define the terms under which academic knowledge is used in machine learning systems. With care, clarity, and coordination, universities can uphold their values while embracing the future – on their own terms.

TAGS: AI training, copyright, database right, generative AI, institutional repository, open access, web scraping

Protecting University Repositories from Aggressive Web Scraping – Using Database Rights to Retain Control Over Academic Content

Recent Posts

Follow @EdUniLibraries on Twitter

Library

Collections

Exhibitions

Projects

Archives

Blogroll

Subscribe to Blog via Email

Protecting University Repositories from Aggressive Web Scraping – Using Database Rights to Retain Control Over Academic Content

Recent Posts

Follow @EdUniLibraries on Twitter

Library

Collections

Exhibitions

Projects

Archives

Blogroll

Subscribe to Blog via Email

Tag Cloud