Data from 500,000 UK Biobank volunteers is being offered for sale on Alibaba after Chinese research institutions violated access agreements.
Summary: Genetic, medical, and lifestyle information from all 500,000 UK Biobank participants was offered for sale on Alibaba following a breach of data-sharing agreements by three Chinese research institutions that had authorized access. Although the data was de-identified, it encompasses genome sequences, medical diagnoses, and biological metrics that experts believe could potentially be re-identified. Alibaba removed the listings before any transactions occurred, UK Biobank has halted all external data access, and the Information Commissioner's Office (ICO) is conducting an investigation. A prior investigation in March had already revealed multiple leaks of the data via GitHub.
This week, the UK government confirmed that genetic, medical, and lifestyle data of 500,000 UK volunteers was listed for sale on Alibaba, highlighting a breach that was not the result of hacking but rather a failure of contract by trusted researchers. Three research institutions in China, which had legitimate access to the UK Biobank, downloaded the data and made it available for sale. The Minister of State, Ian Murray, informed the House of Commons that UK Biobank had alerted the government on 20 April about three listings on Alibaba, with at least one seemingly containing data from all 500,000 participants. The data was de-identified, omitting names, addresses, contact information, and NHS numbers, but it did include details like gender, age, birth month and year, socio-economic factors, lifestyle habits, and biological sample measures. Thanks to the cooperation of both the UK and Chinese governments, Alibaba removed the listings before any sales transpired, and the three institutions lost their access privileges. UK Biobank has paused all external data access while it seeks a technical solution to prevent bulk downloads and has reported the incident to the ICO.
Overview of UK Biobank
UK Biobank represents one of the most significant biomedical research resources globally. Between 2006 and 2010, it enlisted 500,000 volunteers aged 40 to 69 from across Great Britain, who agreed to share their health data and partake in monitoring over a period of at least 30 years. The database presently contains over 10,000 variables for each participant, including whole genome sequences (fully released in 2023), blood and urine biomarkers, brain and body imaging scans, hospital diagnosis records, GP data, and comprehensive lifestyle questionnaires. Around 22,000 researchers globally are permitted to access this data for approved studies related to cancer, heart disease, diabetes, Alzheimer’s, and other conditions. This resource has contributed to thousands of peer-reviewed publications and is deemed foundational for contemporary genomic medicine.
The data sharing is conducted under the premise of de-identification, with researchers required to sign material transfer agreements that prohibit redistribution. However, the recent incident involved three institutions breaching these agreements, and it only came to light because of their audacity to publicly list the data for sale.
The issue of re-identification
Although the government stated that the data did not contain identifying names or addresses, this assertion was only partly accurate. An investigation by the Guardian in March uncovered that de-identified UK Biobank data had been leaked online on numerous occasions, mainly due to researchers accidentally uploading partial or complete datasets to GitHub, the code-sharing platform. From July to December 2025, UK Biobank sent 80 legal requests to GitHub for the removal of such data. In one instance, a dataset that included millions of medical diagnoses and their associated dates for over 400,000 participants was published publicly.
The Guardian illustrated that the data could be less anonymous than it appears; a reporter could identify a volunteer's extensive medical records using just their birth month and year and details of a significant surgery, which are commonly shared in casual conversation. Dr. Luc Rocher, an associate professor at the Oxford Internet Institute, explained to the publication that removing identifiers "often does not guarantee anonymity" and that knowing an individual's birthday and a specific medical event might be enough to reliably identify their record. If a record is identified, it could disclose sensitive information such as psychiatric diagnoses, HIV test results, or histories of substance abuse.
According to UK GDPR, data is only considered truly anonymized if individuals cannot be identified "by any reasonably likely means." Given the size and richness of such datasets, particularly those with complete genome sequences, the concern is not whether re-identification could theoretically happen but whether it is challenging enough in practice to offer real protection. As datasets grow and AI tools enhance cross-referencing capabilities, the governance gap regarding data security is widening. Privacy experts argue that UK Biobank’s reliance on de-identification as a safeguard contradicts the reality that many individuals share parts of their health information online, and in this age of advanced language models, that information can be pieced together.
A recurring issue, not an isolated event
The Alibaba listings represent the most noticeable indication of a deep-rooted problem that UK Biobank has been trying to address with limited success for months. The investigation from March revealed the occurrence of data leaks
Other articles
Data from 500,000 UK Biobank volunteers is being offered for sale on Alibaba after Chinese research institutions violated access agreements.
Health information from 500,000 volunteers in the UK Biobank was listed for sale on Alibaba after three Chinese research organizations breached data-sharing agreements. The ICO is currently conducting an investigation.
