Big Data definition
"The 3 Vs": - high volume - high velocity - and/or high variety information assets that require new forms of processing to enable enhanced decision making, insight discovery and process optimisation.
Purpose of Business Analytics
Get Value from Big Data:
Three primary activities in Business Analytics
(Data Sources) –> ACQUIRE DATA –> PERFORM ANALYSIS –> PUBLISH RESULTS (push & pull w/ knowledge workers)
Methods of Business Analytics
Linear Regression Model
Yi = ß0 + ß1 (Xi) + e
Yi = outcome variable (dependent variable) ß0 = y-intercept / constant / intercept ß1 = slope of the line Xi = independent variable e = error term (vertical deviation of observation from regression line)
THE BEST LINE SHOULD BE UNBIASED & EFFICIENT
Four assumptions for valid statistical inference based on regression model
What does the OLS model do?
Under the four assumptions, the best fitting line is the one that minimises the sum of the squared residuals!
=> ∑ (êi)^2 = ∑(yi - ^y)
Why should one use a log-transformation for regression?
What to check in regression output
R-square: how much of the total variance in the data is explained by the model? Value should be as close to 1 as possible.
ANOVA: Significance of F-value (Levene test): should be not significant, the value thus not be lower that 0,1 for the data to have similar distribution at all Xi.
Coefficients: intercept and ß1 and their p-values. p-values should be significant, therefor below 0,1 at least!
Hypothesis tests
p-value ≤ alpha
OR
critical value ≤ |test statistic|
=> reject H0!!!
Predictive Analysis
Using past events to anticipate the future
Data Mining & Machine Learning
GOAL: Learn a classification model from the data! Learning means that a system performs a task better with a model than w/out a model (≈guessing or just assigning one class to all test data)
Supervised vs. unsupervised learning
Supervised:
- the data are labeled with pre-defined classes
Unsupervised (clustering)
Supervised learning process
Two steps:
Accuracy = number of correct classifications / total number of test cases
Methods of Data Mining & Machine Learning
Supervised learning: - decision tree (one of most widely used techniques, very efficient!) - perceptron - logistic regression ...
Unsupervised learning:
Causal Inference: How to interpret a correlation between X and Y?
Correlation does not necessarily suggest causality!
3 possibilities:
X -> Y
Y -> X
W -> X & Y
Possible solutions: