What is the definition of Data Science?
Data science is a set of fundamental principles that guide the extraction of knowledge from data.
What is the definition of Data Mining?
Data mining is the extraction of knowledge from data, via technologies that incorporate these principles.
What is the definition of Data-Driven Decision-Making?
Data-Driven Decision-Making refers to the practice of basing decisions on the analysis of data, rather than purely intuition.
Tasks in data mining:
Describe classification and class probability estimation task
It attempts to predict, for each individual in a population, which of a set of classes this individual belongs to.
Describe regression task
Regression attempts to predict, for each individual, the numerical value of some variable for that individual. Example: “How much will a given customer use a service?”
Regression vs. Classification?
Classification predicts WHETHER something will happen, whereas regression predicts HOW MUCH something will happen.
Describe similarity matching task
Similarity matching attempts to IDENTIFY individuals based on data known about them. Example: finding companies who are similar to the ones you are serving.
Describe clustering task
Clustering attempts to GROUP individuals in a population together by their similarity, but not driven by any specific purpose. Example: “Do our customers form natural groups or segments?”
Describe co-occurrence grouping task
It attempts to find ASSOCIATIONS between entities based on transactions involving them. Example: “What items are commonly purchased together?”
Clustering vs. co-occurrence?
While clustering looks at similarity between objects based on the objects’ attributes, co-occurrence grouping considers similarity of objects based on their appearing together in transactions.
Describe profiling task
Profiling attempts to characterize the typical behavior of an individual, group, or population. Example: “What is the typical cell phone usage of this customer segment?”
Describe link prediction task
Link prediction attempts to predict connections between data items, usually by suggesting that a link should exist, and possibly also estimating the strength of the link. Example: “Since you and Karen share 10 friends, maybe you’d like to be Karen’s friend?”
Describe data reduction task
Data reduction attempts to take a large set of data and replace it with a smaller set of data that contains much of the important information in the larger set. For example, a massive dataset on consumer movie-viewing preferences may be reduced to a much smaller dataset revealing the consumer taste preferences that are latent in the viewing data (for example,viewer genre preferences).
Describe causal modeling task
Causal modeling attempts to help us understand what events actually influence others. Example: “Was this because the advertisements influenced the consumers to purchase? Or did the predictive models simply do a good job of identifying those consumers who would have purchased anyway?” A business needs to weight the trade-off of increasing investment to reduce the assumptions made, versus deciding that the conclusions are good enough given the assumptions.
Conditions for supervised learning:
Define label
The value for the target variable for an individual.
Supervised vs. unsupervised tasks
Supervised:
Unsupervised:
Both:
Second stage of CRISP process - Data Understanding
Third stage of CRISP process - Data Preparation
Data preparation phase often proceeds along with data understanding, in which the data are manipulated and converted into forms that yield better results.
Define data leak
A data leak is a situation where a variable collected in historical data gives information on the target variable—information that appears in historical data but is not actually available when the decision has to be made.
Fifth stage of CRISP process - Evaluation
The purpose of the evaluation stage is to assess the data mining results rigorously and to gain confidence that they are valid and reliable before moving on.
Sixth stage of CRISP process - Deployment
In deployment the results of data mining—and increasingly the data mining techniques themselves—are put into real use in order to realize some return on investment.
Data Mining vs. Software Development
Data mining is an exploratory undertaking closer to research and development than it is to engineering. The CRISP cycle is based around exploration; it iterates on approaches and strategy rather than on software designs. Outcomes are far less certain, and the results of a given step may change the fundamental understanding of the problem.