Data Mining Techniques
Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit
Data Mining Techniques
Data Mining Techniques
There are many different applications of data mining, although most applications fit into known, well-defined and scientific techniques.
Despite the large number of specific data mining algorithms developed over the years, there are only a handful of fundamentally similar algorithmic tasks. In most of these analytic problems, the
business need is to find “correlations ” or “patterns ” between a particular variable describing an individual and other variables pertaining to that same individual.
For example, in historical data we may want to know which customers defected from the company after their contracts expired. Therefore, we have …
. The target variable in this case is “defection” or “churn”.
. We want to find out which variable or set of variables that are correlated with this pattern of defection. Example is it “age” , “sex” , “income bracket”, “tenure”, etc. or is it most likely a combination of those variables.
One of the fundamental ideas of data mining is finding or selecting important, informative attributes or “variables ” (age, sex, income, tenure, etc.) also called “independent variables” or “predictors” of entities within the data which have an impact or correlation to a given “target attribute” (defection) also called “dependent variable” . Informative means containing information.
Information is a quantity (or quality) that reduces the uncertainty about something. The better the information, the more uncertainty it reduces. Finding this correlation could help the company alter its future action such as customer selling strategies in order to eliminate or reduce customer defection, and thereby improve overall company profitability.
Some of the well-known data mining techniques are:
. Naïve Bayes Classifier
. Regression
. Classification and Segmentation
. Clustering
. Association
. Similarity Matching
. etc.
Terminology
Data mining has many terms that mean the same thing. This has come about as a result of having a portion of many similar disciplines such as statistics, operational research, machine learning, artificial intelligence, database technology, pattern recognition, and other disciplines converge into what is now called data mining.
The following terms all mean the same:
. A Dataset, a File, a Table in a database (or a database/DW query), a Worksheet in Excel.
This is the set of data to be examined or data mined. It could be coming from a flat dataset, from a table in a database, from a query of a database (perhaps from multiple tables), or from a query of a data warehouse or data mart dimension and fact tables, or from a worksheet, etc.
. An Instance , a Record, a Row, a Data Point, a Feature Vector, a Tuple , a Case.
This is a single record of a dataset, or single row from a database table or query. It is all the data pertaining to one transactional event. Be aware that a data point is not a single piece of data.
Rather it is all the data relevant to an individual transaction.
. A Variable, an Attribute , a Property, a Field, a Column in a database table.
This is a single variable or a single piece of information. An attribute could be a predictor attribute, a target attribute or it could be neither.
. A Predictor variable or attribute, an Independent variable, an Explanatory variable. This is a variable that can predict (to some degree) the outcome or target variable.
. A Target variable or attribute, a Dependent variable , an Outcome variable.
This outcome variable is to be predicted. For past instances, this attribute exists. For future instances, this is the variable to be predicted.
1. Prediction (Naïve Bayes classifier):
Naïve Bayes classifier is based on the Baysian theory where the prediction for the probability of occurrence of an event is computed based on an associated (or related) event.
Unlike simple probability which simply relies on computing the frequency of occurrence of an event by taking the ratio of “positive” instances to all instances, Bayesian probability (based on Thomas Bayes
theorem) is a probability formula for determining the frequency of occurrence of an event, given the frequency (or some knowledge) of another related event.
The Bayes theorem formula is …
P(A|B) = P(A) P(B|A)
P(B)
Example 1
where
A – the event of interest (the target variable)
B – the related event (the independent variable)
P(A|B) – probability of A given B is true
P(B|A) – probability of B given A is true
P(A), P(B) – probability of A or B independently
What is the probability of fire if we see smoke?
. Assuming that fires are rare. They only happen 1% of the cases.
. Assuming that smoke is more common, about 15% of the cases (cooking, construction, etc.)
. Assuming that the probability of smoke given fire (i.e.fire generating smoke) is 90%
Using the formula
P(A|B) = P(A) P(B|A) where A – fire The target variable
P(B) B – smoke The independent variable
P(A|B) – probability offire given smoke P(B|A) – probability of smoke given fire
P(A), P(B) – probability of fire or smoke independently
P(fire|smoke) = P(fire) x P(smoke|fire) = .01 x .90 = 6%
P(smoke) .15
Example 2
75% of the children in schools have a dog, and 30% have a cat. Of people that have cats, 60% of them also have a dog,
P(dog) = 75%, P(cat) = 30% P(dog|cat) = 60%
What is the probability that if I have a dog, I also have a cat?
A – having B – having
a dog
a cat
Target
Ind. Var.
P(cat|dog) = P(cat) x P(dog|cat) = .30 x .60 = 24%
P(dog) .75
2. Regression:
Regression is a statistical process for estimating a “target variable ” given one or more independent variable(s). Regression analysis can only be performed on numerical data. Regression is widely used for prediction and forecasting. It is the act of prediction of continues numerical values.
The best example of regression analysis is linear regression. This is where known/historic data points
are plotted on an x and y axis. The idea is to derive a function (in linear regression is it a line) that will
help predict (or at least provide a good estimation) of a target value, given a new independent data point.
The formula is to find the “best fit” line where the delta (difference) between the actual value and the
predicted value is minimized. The best regression line will produce the smallest number when summing the squares of the deltas. :(actual -predicted)2.
$1,000
X-axis - Annual Income (predictor)
Linear Regression: Finding the line that minimizes the total of the square of the distance fromthe actual data point to the prediction line |
The prediction model is the line drawn across the many data points
The formula for linear regression for X and Y data points is:
Y' = a + bX
( Y' = intersect + slope * X )
where: Y' is the predicted value for Y
a is the intersect of regression line with Y axis b is the slope of the regression line
X is the actual value for X
The formula for a (the regression line intercept with Y) is:
a = (Σy)(Σx2) – (Σx)(Σxy) n(Σx2) – (Σx)2
where: a is the intersect of regression line with Y axis Σ is the sum of …
n is the number of data points
The formula for b (the slope of the regression line) is:
b = n(Σxy) – (Σx)(Σy) where: b is the slope of regression line
n(Σx2) – (Σx)2 Σ is the sum of
n is the number of data points
3. Classification and Segmentation:
Classification or “class probability estimation” (also called Segmentation) attempts to predict for an individual in a population, which of a set of known classes or segments this individual belongs to.
By knowing which class (or segment) an individual belongs to, we can predict an outcome or behavior of a future instance based on previously known outcomes of similarly classified individuals.
For classification algorithms, a model is created that will predict which class from a number of known classes will a new individual belong to. This could be done via decision trees.
Classification is considered a “supervised” data mining exercise.
One of the best ways of performing data mining is to segment the population into different groups
with respect to some or many attributes. An attribute is a property having some quantity or quality. Examples: “income”, “age”, “race”, “sex”, “education level”, “home ownership”, “geography”, etc. Our job is to keep segmenting the population set until each segment is as pure as possible.
With a “Scoring ” or “Class Probability”, a score representing the probability (or some other measure) the likelihood of that individual belonging to that particular segment or class.
4. Clustering:
Clustering is similar to classification or segmentation. The difference in clustering is that the process
does not have pre-determined target groups. The algorithm tries to find some relationships or
common attributes (pattern) within the data to group by or cluster the set of instances or individuals in without being given a training set (historical data) with pre-classified outcomes or targets.
Clustering is much more difficult as it is considered “un-supervised” data mining.
Classification or segmentation on the other hand is given a set of known groups/classifications
(a.k.a. target groups) as part of a training set, and the classification model tries to predict which of the groups/classes a new individual belongs to based on that pre-classified training set.
5. Association:
Association (or co-occurrence grouping) attempts to find association between entities (e.g. products) based on historical transactions. “Market-Basket Analysis” is classic case of association.
Market-basket analysis attempts to answer the question “What items are commonly purchased together” in the same basket (e.g. shopping cart).
While segmentation and clustering looks at the similarity between objects based on the various object variables or attributes, and attempts to group those objects based on those similarities. Association
looks at similarities of objects based on their appearing together in the same transaction.
Association is the best data mining technique and exploitation by selling organizations to perform
cross-selling. Cross selling is the act of product recommendation. It is the act of recommending a second product that is often purchased along with the product of your interest. This
recommendation is often done after you decide to purchase the first product. Examples: game console and game software, Electronic/Electrical machines and multi-year product protection warrantee, Printer and printer ink, Laptop and laptop bag etc.
6. Similarity Matching:
Similarity matching attempts to identify individuals (or organizations) based on data known about them.
Data about the individuals from internally assembled sources can be combined with lifestyle segmentation
data obtained from external sources, such as. PRIZM® data from Claritas, Personicx® from Acxiom,
Mosaic® from Experian, or census data from the federal government, etc. to create customer profiling.
These profiles are often based on
- demographic (age, race, sex, marital status, education, occupation, income,family size, religion, etc.),
- geographic (country, state, county, city, zipcode, community, urban/rural, etc.),
- behavioral (brand awareness, brand loyalty,price sensitivity, shopping experience, usage rate, etc.) .
- media usage (TV, radio, theater, social media, internet usage, internet searches, books, magazine, etc.),
- interests (hobbies, social events, vacations, entertainment, club membership, recreational,food, etc.),
- personality (achiever, emulator, belonger, savior, doomsdayer, survivalist, philanthropist, etc. ). Similarity matching is the basis for one of the most popular methods for product recommendation
6. Outer detection:
This type of data mining technique relates to the observation of data items in the data set, which do not match an expected pattern or expected behavior. This technique may be used in various domains like
intrusion, detection, fraud detection, etc. It is also known as Outlier Analysis or Outilier mining. The
outlier is a data point that diverges too much from the rest of the dataset. The majority of the real-world datasets have an outlier. Outlier detection plays a significant role in the data mining field. Outlier
detection is valuable in numerous fields like network interruption identification, credit or debit card fraud detection, detecting outlying in wireless sensor network data, etc.
7. Sequential Patterns:
The sequential pattern is a data mining technique specialized for evaluating sequential data to discover sequential patterns. It comprises of finding interesting subsequences in a set of sequences, where the stake of a sequence can be measured in terms of different criteria like length, occurrence frequency, etc. In other words, this technique of data mining helps to discover or recognize similar patterns in transaction data over some time
2023-08-14