Privacy Law and Policy Reporter
Vladimir Estivill-Castro, Ljiljana Brankovic and David L Dowe
Data is one of the most important corporate assets of companies, governments and research institutions. It is now possible to have fast access to correlate information stored in independent and distant databases, to analyse and visualise data online and use data mining tools for automatic and semi-automatic exploration and pattern discovery. Knowledge discovery and data mining (KDDM) is an umbrella term describing techniques for extracting information from data and suggesting patterns in very large databases. With the expansion of computer technology, huge volumes of detailed personal data are now regularly collected and analysed by marketing applications using KDDM techniques. KDDM is also being used in other domains where privacy issues are very delicate. The FBI applied KDDM techniques to analyse crime data and reduce possibilities as part of investigations into the Oklahoma City bombing, the Unabomber case, and many other crimes. Another example is the application of KDDM to analysing medical data. While there are many beneficial applications of KDDM to these domains, individuals can easily imagine the potential damage caused by unauthorised disclosure of financial or medical records.
The balance between privacy and the need to explore large volumes of data for pattern discovery is a matter of concern.
Knowledge discovery in databases that contain personal information has recently become a focus of public attention. Here are just two examples:
At least 400 million credit records, 700 million annual drug records, 100 million medical records and 600 million personal records are sold yearly in the US by 200 superbureaux. Among the records sold are bank balances, rental histories, retail purchases, criminal records, unlisted phone numbers and recent phone calls. When combined, this information provides data images of individuals that are sold to direct marketers, private individuals, investigators and government agencies.
Surveys in the US reveal growing concern about privacy. The newest Equifax-Harris Consumer Privacy Survey shows that over 70 per cent of respondents are against unrestricted use of their medical data for research purposes. At least 78 per cent believe that computer technology represents a threat to privacy and that the use of computers must be severely restricted in the future if privacy is to be preserved. At least 76 per cent believe they have lost control over their personal information. Time-CNN and other recent studies reveal that at least 93 per cent of respondents believe companies selling personal data should obtain permission from individuals. By contrast, in 1970 Equifax-Harris found only 33 per cent considered computer technology a threat to their privacy.
Marketers often see privacy concerns as unnecessary and unreasonable. Privacy is an obstacle to understanding customers and supplying better fitted products.
The existing market of personal data postulates that the gathering institution owns the data. Nevertheless, the attitude of data collectors and marketers towards privacy is significantly more moderate than 20 years ago, when marketers believed that there was ‘too much privacy already’. The reason for this change, apart from the fact that privacy is under much bigger threat now, is probably the fear of losing the trust of customers and massive public opposition. Many data owners acknowledge that there is a ‘Big Brother’ aspect to the exploitation of personal data sets, and take some measures to preserve the customers’ trust. Others imply that the ‘sinister purpose’ of data mining is the ‘product of junk science and journalistic excess’, but nevertheless believe that ‘marketers should take a pro-active stance and work to diffuse the issue before it becomes a major problem’.
Researchers feel that privacy regulations enforce inconsistent restrictions on data exploration, and, in some cases, ruin the data.
Personal data is placed on large online networked databases, such as the Physician Computer Network in the US, with the intention of building and expanding knowledge. Data is necessary for informed decision-making in the public and private sector. How could planning decisions be taken if census data was not collected? How could epidemics be understood if medical records were not analysed? Individuals benefit from data collection efforts via the process of building knowledge that guides society. The protection of privacy cannot be achieved simply by restricting data collection or restricting the use of computer and networking technology. However, scholars from diverse backgrounds in history, sociology, business and political science have concluded that the existing privacy laws are far behind developments in information technology and do not protect privacy well. Only 24 countries have adopted, in varying degrees, the recent OECD Principles on Data Collection. Twelve nations have adopted all OECD’s principles in statutory law. Australia, Canada, New Zealand and the US do not protect personal data handled by private corporations. In Australia, the Privacy Act 1988 (Cth) predates increases in online purchasing and other massive networked data collection mechanisms. Australia’s Privacy Commissioner, Moira Scollay, has taken steps to simplify privacy regulations and provide a single national framework for data matching systems such as Fly-Buys cards. However, she has so far only released for discussion a set of principles for the fair handling of personal information.
KDDM experts offer opposing opinions. Some believe that KDDM is not a threat to privacy, since the derived knowledge is only about and from groups. Others clearly oppose this view, arguing that KDDM deals mainly with huge amounts of microdata. Some fear different academic standards: ‘Statutory limitations that vary from country to country ... suggest that the practice ... varies from country to country’. Europe has adopted the OECD directives and investigators across all fields of scholarly research now require ‘the subject’s written consent or data may not be processed’. The new privacy laws in Germany have dramatically reduced the number of variables in the census and the micro census. Some think that subject approval may not be sufficient for data miners to refer or disclose incidentally discovered patterns.
In the context of KDDM, two privacy issues arise.
KDDM poses a threat to privacy, in the sense that discovered patterns classify individuals into categories, revealing in that way confidential personal information with certain probability. Moreover, such patterns may lead to generation of stereotypes, raising very sensitive and controversial issues, especially if they involve attributes such as race, gender or religion. An example is the debate about studies of intelligence across different races.
The exploratory KDDM tools may correlate and disclose confidential, sensitive facts about individuals. For instance, a central task in KDDM is inductive learning; this takes as input a training data set and produces as output a model (called a classifier) which is then applied to new, unseen cases to predict some important and perhaps confidential attribute (for example, customer buying power or medical diagnosis). The classifiers are typically very accurate when applied to cases from the training set, and can potentially be used to compromise the confidential properties of these cases.
Also, knowledge of totals and other similar facts about the training data may be correlated to facilitate compromising individual values, either with certainty or with a high probability. For example, consider a data set where:
If it is known that Mr Brown’s information is part of the data, it is possible to infer that Mr Brown has disease A.
While the first issue falls in the sociological, anthropological and legal domain, the second is a technical issue. The technical problems were anticipated in the early 80s, well before widespread acceptance of KDDM. Despite this fact, and the apparent interest in this issue in the business and marketing community, little has been done in terms of finding a technical solution to the problem. Approaches for privacy in KDDM have only recently been considered. However, none have been applied seriously for KDDM. All the privacy protection methods proposed for KDDM are well known and applied in the context of statistical databases. There, methods have been developed to guard against the disclosure of individual data while satisfying requests for aggregate statistical information. Removing identifiers such as names, addresses, telephone numbers and social security numbers is a minimum requirement but is insufficient to ensure privacy. Re-identification based on remaining fields may still be possible, and so removing identifiers should never be used on its own. As a simple example; early geographical analysis of medical records replaced personal addresses by latitude and longitude co-ordinates. Today, electronic city maps allow individual homes to be identified.
Traditional methods in database security do not solve these problems; for example, with KDDM, the possibility of identifying specific patterns that significantly narrow possibilities arises. Finding associations about buyers of milk near a needle exchange program are the kinds of inferences that might point to an infringement of privacy. Or in the Schaeffer murder example, perhaps analysis of make-up purchases may allow inferences about young female actresses in LA so that it is feasible to visit all highly likely addresses.
The technical challenge is to provide security mechanisms for protecting the confidentiality of individual information used for knowledge discovery and data mining. More specifically:
Such techniques and mechanisms can lead to new privacy control systems to convert a given data set into a new one in such a way as to preserve the general patterns from the original data set. This will allow for the choice of balance between privacy and the precision of general patterns.
Vladimir Estivill-Castro, Ljiljana Brankovic and David L Dowe.
This paper was first published in the Official Journal of the Australian Computer Society (NSW Branch), Volume 35 no 7, August 1999, and is reprinted here by kind permission of the authors and editors. The authors are at Newcastle University (Estivill-Castro & Brankovic) and Monash University (Dowe).