No Peeking at My Genome!—How Scientists Are Tackling Issues Concerning Genomic Data

With the increasing use of genome sequencing and genomic data in the medical field, scientists must also address and resolve the growing issues of privacy protection.

Reading Time: 4 minutes

What is shaped like a twisted ladder, measures up to twice the diameter of the solar system, and has led to the development of countless innovative biomedical technologies? The answer: the human genome.

The sequencing of the human genome has made a revolutionary impact, most notably in the medical field. Thanks to the discovery of the genome, we can diagnose someone’s risk for certain cancers, solve crimes, and even search for long-lost relatives. Genome sequencing has even played a role in the COVID-19 pandemic, with many COVID-19 survivors getting their genome mapped to identify genetic predispositions that increase their risk for COVID-19.

While it’s difficult to detract from the advancements made possible by the use of the human genome, there have been growing privacy concerns in regards to collecting genomic data. A leak in genomic data may reveal private health information, resulting in dangerous consequences like discrimination in the workplace and conflicts with family members. And because an individual’s genes cannot be easily changed, the leakage of genomic data has permanent effects. Thus, scientists have been trying to identify solutions to provide reliable privacy protections for our genome.

The set of legal standards laid out by the Health Insurance Portability and Accountability Act represented the first attempt at addressing the issue. It used a technique known as de-identification, the censoring and transforming of data until it cannot be linked to its provided individual. However, de-identification later proved to be a poorly built shield. For example, a certain type of attack known as the “linkage” attack can bypass de-identification by using external databases that share data subsets with de-identified data. Both the de-identified data and data subsets contain information about the individual providing the data, such as zip codes, gender, and salary. This allows attackers to learn additional facts about the individual before fully identifying them. In addition to this vulnerability, full de-identification of genomic data is especially difficult for research purposes, as the data becomes indecipherable for privacy protection.

Another traditional but ineffective strategy for privacy protection is access control, which lends data access only to a trustworthy group of researchers. To implement this system, many genomic databases allow the submission of a summary of the proposed research to be reviewed by a data access committee. The committee determines if the project has sufficient informed consent: whether permission was granted by participants after full knowledge of an experiment’s purpose and its consequences. While participants and biobanks involved with these projects have more control over who can access their data, it restricts data sharing and does not solve the issue of data leaks when researchers receive the data.

Given the flaws of traditional privacy protection methods, scientists have turned to alternative approaches to securing genomic data, finding potential solutions in the field of cryptography, the study of secure communication techniques based on mathematical concepts and algorithms. One particular type of cryptography, called fully homomorphic encryption (FHE), is said to be secure enough to leave quantum computers unable to crack it. Research on FHE first began in the 1970s, continuing until 2009, when computer scientist Craig Gentry developed the first FHE scheme. Since then, he has been refining the precision and efficiency of FHE alongside associates from IBM Research.

Unlike everyday encryption, genomic data encrypted by FHE remains encrypted at all times when used for running computations or when it is transmitted to other places. Any leak of data would not breach the privacy of patients, as decrypting the data would not reveal any information about the source. As such, compared to de-identification methods, FHE provides the iron shield of privacy protection. Its high-level security stems from the principles of lattices, grids of repeating points with no limits, allowing them to extend to infinity. Lattice-based encryption schemes such as FHE hide data at a distance away from a point within the infinite world of a lattice, making it extremely difficult for computers, both regular and quantum, to calculate the distance of encrypted data from a lattice point.

FHE has not only proven itself to be effective privacy protection for genomic data but also opens a pathway outside the field of medicine to safely share data on an international scale. Last year, Brazilian Bank Banco Bradesco collaborated with IBM to test the potential of FHE in financial data. The bank first compared predictions run by its existing machine-learning-based prediction model, both encrypted and unencrypted. It then trained the model with new encrypted data to show how FHE could continue to protect the privacy of client data. Thus, FHE can be used in finances, transferring any data between banks without data analysts having access to all client information.

With the use of genomic data becoming more frequent and significant for everyday use, genomic data banks should address the issues of effective privacy protection methods more effectively. FHE may be the future to secure privacy protection for genomic data, but it still suffers from a few inadequacies, including greater computational requirements that result in a much longer encryption process than most other encryption schemes. If FHE were to be further developed, tested, and incorporated with genomic data, there may be other concerns that the scientific community will have to confront. For example, Arthur Liang, a junior well-versed in biology and the vice president of Stuyvesant Biology Olympiad, believes “telemedicine and how genomic data is moved from database to database on a daily basis… [has] many holes that need to be addressed in tandem to ensure secure transmission.” But in the end, he stated, “I don’t think there will be any major pushbacks regarding ramping up genomic data collection and analysis. Obviously, it’d be best if privacy protection methods are perfected first, but in my experience with data privacy, it seems like not much will happen if companies stick to a standard, albeit insufficient, data protection protocol.” Genomic data does have its drawbacks, but by no means do its issues detract from the significant contributions it has made in changing the methods we use to diagnose and treat patients, benefitting thousands of lives each day.