Home \| Databases \| WorldLII \| Search \| Feedback Privacy Law and Policy Reporter

Home | Databases | WorldLII | Search | Feedback

Privacy Law and Policy Reporter

You are here: AustLII >> Databases >> Privacy Law and Policy Reporter >> 1999 >> [1999] PrivLawPRpr 44

Estivill-Castro, Vladimir; Brankovic, Ljiljana; Dowe, David L --- "Privacy in data mining" [1999] PrivLawPRpr 44; (1999) 6(3) Privacy Law & Policy Reporter 33

Privacy in data mining

Three views on privacy

Individuals: ‘Where did you get my name ... and why?’1

Marketers: ‘What’s the big deal?’

University researchers: ‘How can we carry out research based on facts?’

Where do these views coincide?

Privacy issues in KDDM

Issue 1

Issue 2

Privacy in data mining

Vladimir Estivill-Castro, Ljiljana Brankovic and David L Dowe

Data is one of the most important corporate assets of companies, governments and research institutions. It is now possible to have fast access to correlate information stored in independent and distant databases, to analyse and visualise data online and use data mining tools for automatic and semi-automatic exploration and pattern discovery. Knowledge discovery and data mining (KDDM) is an umbrella term describing techniques for extracting information from data and suggesting patterns in very large databases. With the expansion of computer technology, huge volumes of detailed personal data are now regularly collected and analysed by marketing applications using KDDM techniques. KDDM is also being used in other domains where privacy issues are very delicate. The FBI applied KDDM techniques to analyse crime data and reduce possibilities as part of investigations into the Oklahoma City bombing, the Unabomber case, and many other crimes. Another example is the application of KDDM to analysing medical data. While there are many beneficial applications of KDDM to these domains, individuals can easily imagine the potential damage caused by unauthorised disclosure of financial or medical records.

The balance between privacy and the need to explore large volumes of data for pattern discovery is a matter of concern.

Three views on privacy

Individuals: ‘Where did you get my name ... and why?’1

Knowledge discovery in databases that contain personal information has recently become a focus of public attention. Here are just two examples:

In 1989, the Californian Department of Motor Vehicles earned over US$16 million by selling the driver licence data of 19.5 million Californian residents. A certain Mr Brado used this facility to obtain the home address of actress Rebecca Schaeffer, and killed her in her apartment. The sale of driver licence data ended after this tragedy.
In 1990, Lotus Development Corporation announced a release of a CD-ROM with the data on 100 million households in the US. The data was so detailed that it generated strong public opposition and Lotus abandoned the project. However, this mostly affected small business, as large companies already had access and continued to use Lotus data sets.
At least 400 million credit records, 700 million annual drug records, 100 million medical records and 600 million personal records are sold yearly in the US by 200 superbureaux. Among the records sold are bank balances, rental histories, retail purchases, criminal records, unlisted phone numbers and recent phone calls. When combined, this information provides data images of individuals that are sold to direct marketers, private individuals, investigators and government agencies.

Surveys in the US reveal growing concern about privacy. The newest Equifax-Harris Consumer Privacy Survey shows that over 70 per cent of respondents are against unrestricted use of their medical data for research purposes. At least 78 per cent believe that computer technology represents a threat to privacy and that the use of computers must be severely restricted in the future if privacy is to be preserved. At least 76 per cent believe they have lost control over their personal information. Time-CNN and other recent studies reveal that at least 93 per cent of respondents believe companies selling personal data should obtain permission from individuals. By contrast, in 1970 Equifax-Harris found only 33 per cent considered computer technology a threat to their privacy.

Marketers: ‘What’s the big deal?’

Marketers often see privacy concerns as unnecessary and unreasonable. Privacy is an obstacle to understanding customers and supplying better fitted products.

The existing market of personal data postulates that the gathering institution owns the data. Nevertheless, the attitude of data collectors and marketers towards privacy is significantly more moderate than 20 years ago, when marketers believed that there was ‘too much privacy already’. The reason for this change, apart from the fact that privacy is under much bigger threat now, is probably the fear of losing the trust of customers and massive public opposition. Many data owners acknowledge that there is a ‘Big Brother’ aspect to the exploitation of personal data sets, and take some measures to preserve the customers’ trust. Others imply that the ‘sinister purpose’ of data mining is the ‘product of junk science and journalistic excess’, but nevertheless believe that ‘marketers should take a pro-active stance and work to diffuse the issue before it becomes a major problem’.[2]

University researchers: ‘How can we carry out research based on facts?’

Researchers feel that privacy regulations enforce inconsistent restrictions on data exploration, and, in some cases, ruin the data.

Personal data is placed on large online networked databases, such as the Physician Computer Network in the US, with the intention of building and expanding knowledge. Data is necessary for informed decision-making in the public and private sector. How could planning decisions be taken if census data was not collected? How could epidemics be understood if medical records were not analysed? Individuals benefit from data collection efforts via the process of building knowledge that guides society. The protection of privacy cannot be achieved simply by restricting data collection or restricting the use of computer and networking technology. However, scholars from diverse backgrounds in history, sociology, business and political science have concluded that the existing privacy laws are far behind developments in information technology and do not protect privacy well. Only 24 countries have adopted, in varying degrees, the recent OECD Principles on Data Collection. Twelve nations have adopted all OECD’s principles in statutory law. Australia, Canada, New Zealand and the US do not protect personal data handled by private corporations. In Australia, the Privacy Act 1988 (Cth) predates increases in online purchasing and other massive networked data collection mechanisms. Australia’s Privacy Commissioner, Moira Scollay, has taken steps to simplify privacy regulations and provide a single national framework for data matching systems such as Fly-Buys cards. However, she has so far only released for discussion a set of principles for the fair handling of personal information.

KDDM experts offer opposing opinions. Some believe that KDDM is not a threat to privacy, since the derived knowledge is only about and from groups. Others clearly oppose this view, arguing that KDDM deals mainly with huge amounts of microdata. Some fear different academic standards: ‘Statutory limitations that vary from country to country ... suggest that the practice ... varies from country to country’.[3] Europe has adopted the OECD directives and investigators across all fields of scholarly research now require ‘the subject’s written consent or data may not be processed’. The new privacy laws in Germany have dramatically reduced the number of variables in the census and the micro census. Some think that subject approval may not be sufficient for data miners to refer or disclose incidentally discovered patterns.

Where do these views coincide?

Today, individuals, marketers and researchers concur that the protection of privacy is urgent. Individuals want recognition that they should have control over records containing information about themselves. Marketers want to avoid legal consequences, higher costs and negative public reaction. Researchers want clarity and consistency in regulations. Eventually, a mutually agreeable privacy policy will emerge, but it is unclear whether there are any mechanisms which could enforce it.

Privacy issues in KDDM

In the context of KDDM, two privacy issues arise.

Issue 1

KDDM poses a threat to privacy, in the sense that discovered patterns classify individuals into categories, revealing in that way confidential personal information with certain probability. Moreover, such patterns may lead to generation of stereotypes, raising very sensitive and controversial issues, especially if they involve attributes such as race, gender or religion. An example is the debate about studies of intelligence across different races.

Issue 2

The exploratory KDDM tools may correlate and disclose confidential, sensitive facts about individuals. For instance, a central task in KDDM is inductive learning; this takes as input a training data set and produces as output a model (called a classifier) which is then applied to new, unseen cases to predict some important and perhaps confidential attribute (for example, customer buying power or medical diagnosis). The classifiers are typically very accurate when applied to cases from the training set, and can potentially be used to compromise the confidential properties of these cases.

Also, knowledge of totals and other similar facts about the training data may be correlated to facilitate compromising individual values, either with certainty or with a high probability. For example, consider a data set where:

there are 10 people, two females and eight males,
there are eight cases of disease A, and
none of the females has disease A.

If it is known that Mr Brown’s information is part of the data, it is possible to infer that Mr Brown has disease A.

While the first issue falls in the sociological, anthropological and legal domain, the second is a technical issue. The technical problems were anticipated in the early 80s,[4] well before widespread acceptance of KDDM. Despite this fact, and the apparent interest in this issue in the business and marketing community, little has been done in terms of finding a technical solution to the problem. Approaches for privacy in KDDM have only recently been considered. However, none have been applied seriously for KDDM. All the privacy protection methods proposed for KDDM are well known and applied in the context of statistical databases. There, methods have been developed to guard against the disclosure of individual data while satisfying requests for aggregate statistical information. Removing identifiers such as names, addresses, telephone numbers and social security numbers is a minimum requirement but is insufficient to ensure privacy. Re-identification based on remaining fields may still be possible, and so removing identifiers should never be used on its own. As a simple example; early geographical analysis of medical records replaced personal addresses by latitude and longitude co-ordinates. Today, electronic city maps allow individual homes to be identified.

Traditional methods in database security do not solve these problems; for example, with KDDM, the possibility of identifying specific patterns that significantly narrow possibilities arises. Finding associations about buyers of milk near a needle exchange program are the kinds of inferences that might point to an infringement of privacy. Or in the Schaeffer murder example, perhaps analysis of make-up purchases may allow inferences about young female actresses in LA so that it is feasible to visit all highly likely addresses.

The technical challenge is to provide security mechanisms for protecting the confidentiality of individual information used for knowledge discovery and data mining. More specifically:

we need to develop techniques for replacing original data with data that approximately exhibits the same general patterns, but conceals sensitive information; and
we need to develop mechanisms that will enable data owners to choose an appropriate balance between privacy and precision in discovered patterns; that is, new methods that balance the level of privacy and the plausibility of generated hypotheses.

Such techniques and mechanisms can lead to new privacy control systems to convert a given data set into a new one in such a way as to preserve the general patterns from the original data set. This will allow for the choice of balance between privacy and the precision of general patterns.

Vladimir Estivill-Castro, Ljiljana Brankovic and David L Dowe.

This paper was first published in the Official Journal of the Australian Computer Society (NSW Branch), Volume 35 no 7, August 1999, and is reprinted here by kind permission of the authors and editors. The authors are at Newcastle University (Estivill-Castro & Brankovic) and Monash University (Dowe).

[1] Harris J, ‘An open letter to my friends in direct marketing’ Target Marketing, p 44, 1990, as cited in Culnan M J, ‘How did they get my name? An exploratory investigation of consumer attitudes toward secondary information use’ (1993) MIS Quarterly September pp 341-61.

[2] Peacock P R, ‘Data mining in marketing: part 2’ (1998) 7 (1) Marketing Management pp 15-25.

[3] O’Leary D E, ‘Some privacy issues in knowledge discovery: the OECD Personal Privacy Guidelines’, 10 (2) IEEE Expert pp 48-52.

[4] Trueblood R P, ‘Security issues in knowledge systems’ Proceedings of the 1st International Workshop on Expert Systems pp 834-940.

AustLII: Copyright Policy | Disclaimers | Privacy Policy | Feedback
URL: http://www.austlii.edu.au/au/journals/PrivLawPRpr/1999/44.html