The 15th Installment
“My Research”
by Shogo Shimizu,
Assistant Professor, Master Program of Information Systems Architecture
We are about to enter an era in which the DNA sequence of each individual will be available for 1,000 dollars. This is expected to permit individuals and medical institutions to exchange DNA sequences electronically in a variety of situations. For example, an individual will be able to learn about his or her potential for suffering from a specific disease in the future by querying a database of known genetic diseases about his or her DNA sequence, enabling early treatment of the disease. Also, medical service providers will be able to predict the effects of specific treatments on patients based on their DNA sequences.
However, there are risks in revealing one’s DNA sequence to others. For example, a DNA sequence suggesting the possibility of cancer in the future may negatively affect terms and conditions of insurance or employment conditions. Therefore, individuals will want to prevent their DNA sequences from being known to others. A mechanism for protecting the privacy of DNA sequences is necessary for facilitating services that make use of medical information. Such sensitive information has hitherto been protected by law or contractual arrangements. What enabled such protections, however, was the fact that DNA information was used only in limited places for limited purposes. Those protections will be insufficient for preventing information leaks when DNA information is stored or transmitted on the Internet.
The issue of privacy protection in DNA searches is defined as follows. DNA sequences are expressed by the arrangement of four bases, A, C, G, and T. A set of DNA sequences is stored in a database managed by the service provider. The user sends his or her own DNA sequence to the database as a query string. For the search, the two sequences do not need to match each other perfectly. All that needs to be obtained are (information about) sequences that are similar to the query. As measures of similarity, a number of possible definitions have been suggested, and the one that is frequently used for DNA searches is the edit distance. With the edit distance, replacement, insertion, and deletion of bases are the basic operations, and the degree of similarity between sequences is defined by at least how many times of such operations are required to match the two DNA sequences perfectly to each other. The threshold of the degree of similarity is specified by the user. The requirement of privacy protection is to prevent the DNA sequences of users from being known to the service provider who processes the queries.
For fulfilling this requirement, two approaches—the code-based approach and a perturbation-based approach—have been suggested. The former is generally used with the principle of secure function evaluation being the starting point. The secure function evaluation is a method with which two parties calculate the values of specific functions without the information held by each being known to the counterpart. When applied as is, however, this method has a problem with processing efficiency. Therefore, development of a protocol customized for edit distance calculations is necessary. The latter is a method in which the original sequence is hidden by replacing bases being queried with other bases stochastically, or by adding other bases/deleting the bases at random. This means similarity is calculated with a modified query, and there will be a trade-off between safety and processing efficiency.
In either case, the existing method requires a one-to-one calculation of the query sequence and each sequence in the database. The length of a DNA sequence ranges from hundreds to thousands of bases, and the number of sequences accumulated in the database reaches 100,000 for human beings alone. The gene database is expected to grow even larger, and so it is not practical to apply costly processing like a similarity calculation on one-to-one basis. It would be desirable to be able to filter candidate solutions with a more efficient method as the preprocessing. It is this that has prompted the writer to work the issue of increasing the efficiency of processing for private DNA searches using large databases.