UK Biobank data from 500,000 volunteers has been offered for sale on Alibaba following breaches of access agreements by Chinese research institutions.
Summary: Genetic, medical, and lifestyle data from 500,000 UK Biobank volunteers was put up for sale on Alibaba after three Chinese research institutions, which had lawful access, breached their data-sharing agreements. Although the data was de-identified, it comprised genome sequences, hospital diagnoses, and biological measures that experts claim can be re-identified. Alibaba removed the listings before any transactions occurred, the UK Biobank has halted all external data access, and the ICO is conducting an investigation. A previous investigation in March revealed that the data had been leaked numerous times via GitHub.
The genetic, medical, and lifestyle data of 500,000 UK volunteers was available for purchase on Alibaba’s Chinese e-commerce platform this week, as confirmed by the UK government on Wednesday. This breach occurred without the need for any malicious coding. Three research institutions in China, which were legitimately granted access to the UK Biobank’s database, downloaded the data and listed it for sale. It was not a hacking incident, but rather a violation of contract by trusted researchers, making it more concerning due to the vulnerability it reveals in the open research data-sharing model that relies on adherence to rules by those who receive the data.
Ian Murray, the Minister of State, informed the House of Commons that UK Biobank notified the government on April 20 that three listings on Alibaba had been located, with at least one containing data from all 500,000 participants. While the data was de-identified — omitting names, addresses, contact information, and NHS numbers — it included gender, age, month and year of birth, socio-economic status, lifestyle behaviors, and biological sample measurements. With assistance from both the UK and Chinese governments, Alibaba eliminated the listings before any sales could occur, and the three institutions had their access revoked. UK Biobank has paused all external data access while it develops a solution to prevent extensive downloads and has reported itself to the Information Commissioner’s Office.
What UK Biobank holds
UK Biobank represents one of the most significant biomedical research resources globally. Between 2006 and 2010, it enrolled 500,000 volunteers aged 40 to 69 across Great Britain, who agreed to share their health data and be monitored for a minimum of 30 years. The database now contains over 10,000 variables per participant, including complete genome sequences for all 500,000 volunteers (fully released in 2023), biomarkers from blood and urine, imaging scans of the brain and body, hospital diagnosis records, GP data, and thorough lifestyle questionnaires. Approximately 22,000 researchers worldwide access the data for approved research into cancers, heart disease, diabetes, Alzheimer’s, and other conditions, resulting in thousands of peer-reviewed papers, making it foundational for contemporary genomic medicine.
Data is shared under the assumption that it is de-identified. Researchers are required to sign material transfer agreements that restrict redistribution. This model relies on compliance with those agreements. This week's incident arose from three institutions violating the agreement, with their actions becoming known only after they boldly listed the data on a public marketplace.
The re-identification problem
While the government asserts that the data lacked names or addresses, this assurance is accurate yet incomplete. A Guardian investigation published in March uncovered that de-identified UK Biobank data had been exposed multiple times online, as researchers unintentionally uploaded partial or complete datasets to GitHub, a code-sharing platform. From July to December 2025, UK Biobank issued 80 legal notices to GitHub requesting the removal of such data. In one instance, a dataset encompassing millions of hospital diagnoses and associated dates for more than 400,000 participants was publicly disclosed.
The Guardian demonstrated that the data is not as anonymous as it might seem; a reporter was able to identify a volunteer’s extensive hospital diagnosis records using merely their birth month and year alongside details of a significant surgery they underwent, information commonly shared in everyday discussions. Dr. Luc Rocher, an associate professor at the Oxford Internet Institute, indicated that the removal of identifiers “often does not guarantee anonymity,” and that knowing a person's birthday and a specific medical event date could be sufficient to reliably identify their record. Once identified, that record might expose psychiatric diagnoses, HIV test results, or histories of substance abuse.
Under UK GDPR, data is only genuinely anonymized if individuals cannot be recognized “by any reasonably likely means.” Given the size and richness of such datasets, especially those containing complete genome sequences, the issue is not whether re-identification is theoretically feasible but rather if it is practically difficult enough to provide meaningful protection. The gap in data security governance is widening as datasets expand and AI tools enhance cross-referencing capabilities. Privacy experts argue that UK Biobank's approach of treating de-identification as an adequate safeguard contradicts the reality that many individuals publicly share fragments of their health information, which can be reassembled in the era of large language models.
A pattern, not an incident
The Alibaba listings exemplify a significant issue that UK
Other articles
UK Biobank data from 500,000 volunteers has been offered for sale on Alibaba following breaches of access agreements by Chinese research institutions.
Health information from 500,000 UK Biobank volunteers was listed for sale on Alibaba following breaches of data-sharing agreements by three Chinese research institutions. The Information Commissioner's Office is currently conducting an investigation.
