Data Mining

Data Mining

What Is Data Mining?

    Data mining is when someone takes a very large amount of data and tries to find patterns in it. This can be a software company looking at how users move through their program, a retail store looking at what items are purchased together or a financial firm that looks at what amount of money is invested at particular ages. There are 3 main types of data mining. You can use cluster analysis to look at new data correlations, anomaly detection's to find weird correlations that you would never expect, or the association rule mining showing that some things are dependent upon each other. Without data mining companies would have a very large data base of information that had no meaning to it. Collecting this information makes it so companies can find new marketing techniques based on what their customers like. It makes sense to send a young teen a coupon for the new clothing brand coming to target, but not a coupon for denture cream. While this correlation may seem obvious, with good data mining you can find that people of a certain age are more apt to buy items that you sell. If this correlation is only made by one company, they can use their advertising to rope people into their store as opposed to other stores.

The History of Data Mining

    Throughout the years, computers have been used for the categorization of data, as well as storage, and finding statistics. Beginning around the 1970’s, data was stored throughout programs and data structures, though as time progressed it became to where data mining increased the computers ability to store large amounts of data such as terabytes and sometimes petabytes, which is a unit of information equal to one quadrillion. From this evolvement of data storage and the arrangement of data, it has adapted the data operators from thinking in a conducted mindset to an analytical way of thinking.

    In the 1970’s data mining vaguely hit on a new development called “artificial intelligence” (AI), which is “the science and engineering of making intelligent machines” (McCarthy). In dealing with artificial intelligence, scientists would develop sets of algorithms, which is a set of rules to be followed in a calculation, and used them to enable machines to learn. As time progressed, machines were challenged to retain even larger amounts of data. In developing these algorithms, scientists would not create a hypothesis until the test was concluded because the construction of the model was for determining the relationship between the machines qualities and class, which made it a “test and hypothesize” process.

    Data Mining has not been significant until it reached the 1990’s. This is when companies and businesses used algorithms to deal with their data and increase productivity. They were using data mining for finding systematic behaviors in all economic aspects. It is even said that the financial crimes enforcement network used complicated algorithms to design a program in which helped locate more than 1 trillion dollars worth of laundered money. Data mining has come in affect to show the examination of data and its patterns.

    Data mining is a union of a few kinds of technology, which include data management, statistics, machine learning, and visualization. All of these things work in unison to find certain attributes that correlate with sets of data. They also work together sectioning parts of the data into clusters, identifying the oddballs, finding patterns, cleaning up the data, and many other things.

    Though data mining is originated around the 1970’s, it has really sprouted within the last 10 to 15 years. Companies now use it to increase the number of customers they obtain and they even use it to advertise their product by analyzing the merchandise in which people buy.

Knowledge Discovery in Databases

    Data Mining is one part of several steps in a process in which data is narrowed, transformed, interpreted and otherwise manipulated to discover some interesting relationship(s); This process is known as Knowledge Discovery in Databases(KDD). Steps generally include selection, pre-processing, transformation, data mining, and interpretation. Each has it’s own distinct purpose and challenges.

    Selection is the process by which the initial data set is gathered and filtered. For instance, perhaps only a certain group of people are being looked at, only the information related to people that fall into that group is used. As the adage goes “what does that have to do with the price of tea in China”, when looking at some set of data there is other data that simply isn’t relevant.

    Pre-processing is a further, more complex, filtering of the selected data set. This is frequently to “clean” or remove “dirty” or bad information. Dirty Data is some undesirable information that is generally incomplete or false. Sometimes human error is involved, other times someone lied. Someone may not want to be found and might use erroneous information or a data entry clerk might of accidently pressed a key twice. These pieces of information would unnecessarily skew the results of data mining and interpretation so it’s desirable to remove. Sometimes the goal is to find erroneous or false information so the kind of information filtered out need not necessarily be false or erroneous; It could just as easily be the opposite such that likely to be true information is removed.

    Transformation is a restructuring or mutation of the data set in order to make further processing in data mining easier or more efficient. Most information in databases is in separate tables which consist of rows of un-nested or non-recursive information meaning that there do not exist sub parts to the information. Rows may contain references to other tables and in this way nested or recursive information may be represented but often these kind of references are inconvenient and slow in the context of some data mining algorithms so they are transformed into a more convenient, efficient, or manageable form.

    Data Mining
is the step where connections in the data are actually found. Various algorithms are applied to show different kinds of relationships. Algorithms used in data mining include but are not limited to the following: Anomaly detection, Association rule learning, Clustering, Classification, Regression, and Summarization. Each of These algorithms attempts to find some kind of relationship in the data.

    Interpretation is the last step where a human attempts to interpret the result in someway to either determine the reason why the particular result exists or what action to take in response. If it’s found that there is a stockbroker that is frequently making a bit much on trades near major decisions of a company you might decide to investigate for insider trading. If it’s found that beer and diapers are frequently bought together you might place them next to each other along with some targeted advertisement.

Data Mining Algorithms

    Anomaly detection is the attempt to find outliers in data or a group of data not clustered with others. This might be used for finding insider trading as mentioned above or finding that a particular sales choice lead to greater profits.

    Association rule learning is the attempt to find rules by which multiple variables or groups of data are related. This would be the classic beer and diapers example, “when diapers are bought beer is bought”.

    Clustering is the grouping of like quantas of data. If you plotted a set of points on a graph and noticed that 3 or 4 groups of higher density occurred clustering would ideally group those respective sets of points. This can however occur on an arbitrary number of dimensions with arbitrary distance functions to relate the distance between points.

    Classification is much like clustering in that it attempts to group like information. The difference being that in classification known patterns (possibly ones found from clustering) or “classes” are looked for in the data set in order to say “this quanta of data of class X”.

    Regression attempts to find a function which models results of the data and might be used as a predictor. This might be a line of best fit or a more complex polynomial or even something that isn’t easily classified as a mathematical function.

    Summarization attempts to present the data in some useful way such that a human might have an easier time interpreting it. Graphical representations of data are a pervasive means by which this is accomplished.

Data Mining in the Real World

    Data Mining has been around for almost as long as there has been data, but in recent years the wide-spread use of the internet and computers has created a new medium for companies, businesses, and researchers to achieve their goals.


    Companies use data mining to build a relationship with their customers. An example of this would be Google Advertising. Instead of sending the same, generic advertisement to everyone that visits a webpage, logarithms track the users past history and only gives them information that is relevant to them.

Scientific Research

    The same information gathering techniques used by businesses to create more revenue can also be used for scientific research. Geneticists use Multifactor Dimensionality Reduction, an approach that detects and characterizes combinations of independent variables, to help understand human DNA sequences. These new understandings will help doctors diagnose, prevent, and treat cancer and other diseases in the future.

Medical Records

    Large hospital networks sometimes have over 3 million patients to keep track of and treat. With such a high number of patients involved, some slip through the crack. Whether it’s a misdiagnoses or accidentally prescribing someone the wrong drug, the effects can sometimes be deadly. SofTek, a medical databases support company located in Kansas City, develops tools for hospitals around the country that mine through their medical databases and looks for unique patterns. Normally these patterns help show patients that were misdiagnosed and can help save their lives.

Works Cited

Alexander, Doug. "Data Mining." Data Mining. N.p., n.d. Web. 11 Dec. 2012.

"Anomaly Detection." Wikipedia. Wikimedia Foundation, 12 Sept. 2012. Web. 11 Dec. 2012.

"Association Rule Learning." Wikipedia. Wikimedia Foundation, 12 Sept. 2012. Web. 11 Dec. 2012.

"Cluster Analysis." Wikipedia. Wikimedia Foundation, 12 Sept. 2012. Web. 11 Dec. 2012.

Cunningham, Bryan. "Google's Data Mining Raises Questions of National Security." The Guardian. Guardian News and Media, 15 Oct. 2012. Web. 11 Dec. 2012.

"Data Mining (computer Science)." Encyclopedia Britannica Online. Encyclopedia Britannica, n.d. Web. 11 Dec. 2012.

"Data Mining." Wikipedia. Wikimedia Foundation, 12 Sept. 2012. Web. 11 Dec. 2012.

Desikan, Prasanna, Kuo-Wei Hsu, and Jaideep Srivastava. "Data Mining for Healthcare Management." Http:// 2011 Siam International Conference on Data Mining, 28 Apr. 2011. Web. 11 Dec. 2012.

Field, Abigail. "Is Data-Mining Free Speech? The Supreme Court Agrees to Decide a Crucial Case." Daily Finance, 11 Jan. 2011. Web. 11 Dec. 2012.

"FTC: Collecting and Selling Data Mined from Social Media Sites Covered by FCRA." McDermott Will&Emery. FTC, 25 June 1012. Web. 11 Dec. 2012.

Hall, Shane. "Examples Of Data Mining Vs. Traditional Marketing Research." Small Business. Chron, n.d. Web. 11 Dec. 2012.

Ho, Wing Kee, and Xiaohua Luan. "Data Mining." Data Mining. University of North Carolina, n.d. Web. 11 Dec. 2012.

Kessler, Michelle, and Byron Acohido. "Data Miners Dig a Little Deeper -" Data Miners Dig a Little Deeper - USA Today, 11 July 2006. Web. 11 Dec. 2012.

Morris, Jason, and Ed Lavandera. "Why Big Companies Buy, Sell Your Data." CNN. CNN Tech, 23 Aug. 2012. Web. 11 Dec. 2012.

Patel, Neil. "10 Ways Data Mining Can Help You Get a Competitive Edge." 10 Ways Data Mining Can Help You Get a Competitive Edge. KISSmetrics, n.d. Web. 11 Dec. 2012.

"Scientists Question Terrorist-hunting Techniques." CNN. CNN, 07 Oct. 2008. Web. 11 Dec. 2012.

Singer, Natasha. "The Data-Mining Industry Kicks Off a Public Relations Campaign." Bits The DataMining Industry Kicks Off a Public Relations Campaign Comments. The New York Times, 15 Oct. 2012. Web. 11 Dec. 2012.

Stein, Joel. "Breaking News, Analysis, Politics, Blogs, News Photos, Video, Tech Reviews." Time. Time, 10 Mar. 2011. Web. 11 Dec. 2012.