Which application areas can benefit from detecting
co-occurrence relationships?
Marketing: Identify items that are bought together for marketing purposes
Inventory Management: Identify parts that are often needed together for repairs to equip the repair vehicle
Usage Mining: Identify words that frequently appear together in search queries to offer auto-completion
Describe correlation analysis and give an example which technique can be used for continuous variables and binary variables?
Continuous variable: Pearsons correlation coefficient
Binary variable: Phi coefficient
Value range:
1: positive correlation
0: independent variable
- 1: negative correlation
What is the shortcoming with correlations between products in shopping baskets?
What is the benefit of association analysis compared to correlations?
What can association analysis not find?
What is a itemset?
- k-itemset: An itemset that contains k items
Define Support count
Define Support
Define frequent itemset
What is the difference between the rule evaluation metrics Support and Confidence?
X (Condition) -> Y (Consequent)
Support:
- fraction of transactions that contain both X and Y
Confidence:
- how often items in Y appear in transaction that contain X
What are the main challenges of association analysis?
1) Mining associations from large amounts of data can be computationally expensive (need to apply smart pruning strategies)
2) Algorithms often discover a large number of associations (many are irrelevant or redundant, user needs to select the relevant subset)
What is the goal of association rule mining?
Explain the Brute Force Approach for Association Rule mining
1) List all possible association rules
2) compute support and confidence for each rule
3) remove rules that fail the threshold of minsup and minconf
Attention: Computationally prohibitive due to large number of candidates!
What happens with rules that originate from the same itemset?
Explain the two-step approach of rule generation; is this approach computationally better compared to the brute force approach?
1) Frequent itemset generation (all itemsets whose support >= minsup)
2) Rule generation (high confidence rules from each frequent itemset, each rule is a binary partitioning of a frequent itemset)
-> Frequent itemset generation is still computationally expensive
What is the difference between a frequent itemset and a candidate itemset?
- Candidate itemset is potentially a frequent itemset (only if support is >= minsup)
Explain the Brute Force Approach
Complexity is O(NMw)
-> expensive because M = 2^d (d= items)
Explain why the brute force approach is not feasible (with the example of amazon)
Explain the Apriori principle in regard to frequent itemsets
If an itemset is frequent, then all of its subsets must also be frequent
What is the anti-monotone property of support?
How can you use the apriori principle for pruning?
Explain the Apriori Algorithm for frequent itemset generation
Briefly state what the FP-Growth frequent itemset generation method is
A itemset generation algorithm that compresses data into tree structure in memory
How can you apply the frequent itemsets in e commerce?