| Data mining, also known as knowledge-discovery in databases (KDD), is the practice of automatically searching
large stores of data for patterns. To do this, data mining uses computational techniques
from Statistics and Pattern recognition.
Data mining has been defined as "The nontrivial extraction of implicit, previously unknown, and potentially useful information
from data" [1] and "The science of extracting useful information from large data sets or databases" [2]. Although it is usually
used in relation to analysis of data, data mining, like artificial intelligence, is an umbrella term and is used with varied meaning in a wide range of
contexts.
A simple example of data mining is its use in a retail sales department. If a store tracks the purchases of a customer and
notices that a customer buys a lot of silk shirts, the data mining system will make a correlation between that customer and silk
shirts. The sales department will look at that information and begin direct mail marketing of silk shirts to that customer. In
this case, the data mining system used by the retail store discovered new information about the customer that was previously
unknown to the company.
Used in the technical context of data warehousing and analysis,
data mining is neutral. However, it sometimes has a more pejorative usage that implies imposing patterns (and particularly causal
relationships) on data where none exist. This imposition of irrelevant, misleading or trivial attribute correlation is more
properly criticized as "data dredging" in the statistical literature.
Used in this latter sense, data dredging implies scanning the data for any relationships, and then when one is found coming up
with an interesting explanation. (This is also referred to as "overfitting the model".) The problem is that large data sets
invariably happen to have some exciting relationships peculiar to that data. Therefore any conclusions reached are likely to be
highly suspect. In spite of this, some exploratory data work is always required in any applied statistical analysis to get a feel
for the data, so sometimes the line between good statistical practice and data dredging is less than clear.
A more significant danger is finding correlations that do not really exist. Investment analysts appear to be particularly
vulnerable to this. In his book Where Are the Customers' Yachts? ISBN 0471119792 (1940), Fred Schwed, Jr, wrote: "There
have always been a considerable number of pathetic people who busy themselves examining the last thousand numbers which have
appeared on a roulette wheel, in search of some repeating pattern. Sadly enough,
they have usually found it."
Most data mining efforts are focused on developing a finely-grained, highly detailed model of some large data set. In Data Mining For Very Busy People [3],
researchers at West Virginia University and the
University of British Columbia discuss
an alternate method that involves finding the minimal differences between elements in a data set, with the goal of developing
simpler models that represent relevant data.
There are also privacy concerns associated with data mining. For example, if an employer has access to medical records, they
may screen out people with diabetes or have had a heart attack. Screening out such employees will cut costs for insurance, but it
creates ethical and legal problems.
Data mining government or commercial data sets for national security or law enforcement purposes has also raised privacy
concerns. [4]
There are many legitimate uses of data mining. For example, a database of prescription drugs taken by a group of people could
be used to find combinations of drugs with an adverse reactions. Since the combination may occur in only 1 out of 1000 people, a
single case may not be apparent. A project involving pharmacies could reduce the number of drug reactions and potentially save
lives. Unfortunately, there is also a huge potential for abuse of such a database.
Basically, data mining gives information that wouldn't be available otherwise. It must be properly interpreted to be useful.
When the data collected involves individual people, there are many questions concerning privacy, legality, and ethics.
The a priori algorithm is the most fundamental algorithm
used in data mining.
Since the early 1990's, with the availability of oracles for certain combinatorial games, also called tablebases (e.g. for 3x3-chess) with any
beginning configuration, small-board dots-and-boxes, small-board-hex, and certain endgames in chess, dots-and-boxes, and hex; a
new area for data mining has been opened up. This is the extraction of human-usable strategies from these oracles. This is
pattern-recognition at too high an abstraction for known Statistical Pattern Recognition algorithms or any other algorithmic
approaches to be applied: at least, no one knows how to do it yet (as of January 2005). The method used is the full force of
Scientific Method: extensive experimentation with the tablebases combined with intensive study of tablebase-answers to well
designed problems, combined with knowledge of prior art i.e. pre-tablebase knowledge, leading to flashes of insight. Berlekamp in dots-and-boxes etc. and John Nunn in chess endgames are notable examples of people doing this work, though they were not and are not involved in tablebase
generation.
History
Data Mining grew as a direct consequence of the availability of large reservoirs of data. Data Collection in digitalized forms
had already commenced in the 1960s allowing for retrospective data analysis via computers. Relational Databases arose in the
1980s along with Structured Query Languages (SQL) allowing for dynamic on-demand analysis of
data. The 1990s saw an explosion in growth of data. Data warehouses
were beginning to be used for storage of data. Data Mining thus arose as a response to challenges faced by the database community in dealing with massive amounts of data, application of statistical
analysis to data and application of search techniques from Artificial Intelligence to these problems.
References
[1] W. Frawley and G. Piatetsky-Shapiro and C. Matheus, Knowledge Discovery in Databases: An Overview. AI Magazine, Fall 1992,
pgs 213-228.
[2] D. Hand, H. Mannila, P. Smyth: Principles of Data Mining. MIT Press, Cambridge, MA, 2001. ISBN 0-262-08290-X
[3] T. Menzies, Y. Hu, Data Mining
For Very Busy People. IEEE Computer, October 2003, pgs 18-25.
[4] K. A. Taipale, Data Mining and Domestic Security: Connecting the Dots to Make Sense of
Data (http://www.stlr.org/cite.cgi?volume=5&article=2), Center for Advanced Studies in Science
and Technology Policy (http://www.advancedstudies.org/). 5 Colum. Sci. &
Tech. L. Rev. 2 (December 2003).
[5] O. Maimon and M. Last, Knowledge Discovery and Data Mining – The Info-Fuzzy Network (IFN) Methodology, Kluwer
Academic Publishers, Massive Computing Series, 2000.
External links
Commercial solutions
|