Big Data Analytics Management Flashcards

Question

What are informative attributes?

Answer 1

Information is a quantity that reduces uncertainty about something.

Answer 2

Predictive model is a formula for estimating the unknown value of interest: the target.

Answer 3

Supervised learning is model creation where the model describes a relationship between a set of selected variables (attributes or features) and a predefined variable called the target variable. The model estimates the value of the target variable as a function (possibly a probabilistic function) of the features.

Answer 4

Supervised learning is model creation where the model describes a relationship between a set of selected variables (attributes or features) and a predefined variable called the target variable. The model estimates the value of the target variable as a function (possibly a probabilistic function) of the features.

Answer 5

Model induction. Induction is a term from philosophy that refers to generalizing from specific cases to general rules (or laws, or truths).

Answer 6

The input data for the induction algorithm, used for inducing the model. They are also called labeled data because the value for the target variable (the label) is known.

Answer 7

Data which does not fit into single computer and needs multiple computers to process it

Answer 8

Entity that can mimic human behavior

Answer 9

Methods to handle and prepare data

Answer 10

VOLUME - Data at Rest - Terabytes to exabytes of existing data to process VELOCITY - Data in Motion - Streaming data, milliseconds to seconds to respond VARIETY Data in Many Forms - Structured, unstructured, text, multimedia VERACITY - Data in Doubt - Uncertainty due to data inconsistency; incompleteness, ambiguities, latency, deception, model approximations.

Answer 11

Unstructured data - text, video, audio

Answer 12

Traditional systems have been designed for transactions, not unstructured data.

Answer 13

* Google goes through every page, scans its contents and has keywords ready which creates an index (and they update it every day) * Traditional architecture is not enough to process data * **Solution:** cluster architecture. However, every day 900 machines die and need to be replaced

Answer 14

Solution: Google File System - redundant storage of massive amounts of data on cheap and unreliable computers MapReduce distributed computing paradigm

Answer 15

Open source software framework for distributed storage and distributed processing that replicated Google's MapReduce model.

Answer 16

MapReduce is a programming model and an associated implementation for processing and generating big data sets with a parallel, distributed algorithm on a cluster. The MapReduce model consists of two main stages: 1. **Map**: input data is split into discrete chunks to be processed 2. **Reduce:** output of the map phase is aggregated to produce the desired result The simple nature of the programming model lends itself to **efficient** and **large-scale implementations** across thousands of cheap nodes (computers).

Answer 17

1. Data Engineering and Processing (big data technologies - Google Files, Hadoop) 2. Data Science -\> automated DDD

Answer 18

Increasing in value and difficulty: 1. **Descriptive** - what happened? [Reports] 2. **Diagnostic** - why did it happen? [Queries, statistical analysis] 3. **Predictive** - what will happen? [Forecasts, machine learning] 4. Diagnostic + Predictive = **Prescriptive** - how can we make it happen? [Optimization, planning]

Answer 19

Predictive analytics don't answer question WHY it just predicts. While diagnostic analytics tries to answer WHY something happen.

Answer 20

Determining correlation is easier than implying causation (randomized controlled experiment).

Answer 21

* _Descriptive/diagnostic analytics_ provide insight into the data, so that one can better understand what data to collect and store and provide insight into ways to improve future models. * _Predictive analytics_ is building a model to predict when something will happen. * _Prescriptive analytics_ automates action to be taken based on prediction.

Answer 22

Data visualization; Clustering; Co-occurrence grouping

Answer 23

* Uplift modeling - predict how individuals behave contingent on the action performed upon them * Automation - determine optimal action based on predicted reaction of individuals

Answer 24

Homogeneous with respect to the target variable. If every member of a group has the same value for the target, then the group is pure.

Answer 25

* Attributes rarely split a group perfectly. * Not all attributes are binary. * Some attributes take on numeric values (continuous or integer)

Answer 26

**Entropy** is a measure of disorder that can be applied to a set, such as one of our individual segments. Consider that we have a set of properties of members of the set, and each member has one and only one of the properties. In supervised segmentation, the member properties will correspond to the values of the target variable. Disorder corresponds to how mixed (impure) the segment is with respect to these properties of interest. So, for example, a mixed up segment with lots of write-offs and lots of non-write-offs would have high entropy.

Answer 27

They had data on what people liked and they personalized/customized/adapted their movies to customers' preferences

Answer 28

* They stored customers' data as a 'shoe profile'; * They sold their service to e-commerce shoe stores as a widget.

Answer 29

* Manually * Manually download the file (which someone created) * Pretending you are a human browsing a web site (web scraping) - API

Answer 30

* A modern **programming** language which offers complete flexibility but requires more effort to implement; * **Specialized tools** which allow faster implementation but provide less flexibility and make it harder to replicate data collection.

Answer 31

1. Request a web page 2. Parse the HTML 3. Filter and transform data to desired format 4. Save data.

Answer 32

import.io is a simple too that tries to infer what is interesting on a website webscraper.io gives you more flexibility

Answer 33

1. Define a starting page 2. Define category links 3. For each individual category or product page determine which information to collect, determine which links to follow.

Answer 34

* Many sites do not allow gathering information automatically. * It detects if you are human based on detection of frequent requests, cookies, Robots Exclusion Protocol (it is stored on robots.txt), and other trackers. * Not all information is public (you can use authentication the protected information and API)

Answer 35

It tells everyone who is allowed to crawl their page.

Answer 36

Data can be costly to acquire, so companies don't want to be found.

Answer 37

It is an official way of accessing information automatically. 1. Get an API key 2. Query an API endpoint using the API key - an API usually provides multiple endpoints or functions (most recent movies, most popular...) 3. Process the response

Answer 38

* **Representational State Transfer (REST) APIs:** Used for singular queries for one term * **Streaming API:** Continuously get the tweets

Answer 39

* **Cross-sectional** - data that (almost) never changes; * e.g. city names, birth date * **Transactional** - one observation represents one transaction; * e.g. a website visit * **Panel** - one observation represents one individual during a time period * e.g. monthly bill.

Answer 40

Tidy data is on a single table according to the following rules: 1. Each _variable_ must have its _own column_. 2. Each _observation_ must have its _own row_. 3. Each _value_ must have its _own cel_l.

Answer 41

* Structured * Unstructured

Answer 42

QUALITATITVE/CATEGORICAL DATA - Nominal; - Ordinal (satisfaction level); QUANTITATIVE DATA - Discrete (countable number); - Continuous (interval value).

Answer 43

* Text-based documents (tweets, webpages) * Images, videos, audio

Answer 44

* Topic Modeling (text); * Sentiment Analysis (text); * Feature extraction (image/video/sound).

Answer 45

* Missing data * Measurement error

Answer 46

* Missing observations * Missing values in some observations

Answer 47

You need to know the reason why it is missing because it will inform whether it is a problem or not. Is data missing at random or not?

Answer 48

It is fine. If data are missing at random, the remaining observations are still a representative sample of the population. Solution: listwise deletion i.e. delete all observations that do not have values for all variables in the analysis.

Answer 49

It is a problem! The remaining observations are not a representative sample of population.

Answer 50

* **Selection bias** occurs when the sampling procedure is not random and thus the sample is not representative of the population. * **Self-selection** - some members of the population are more likely to be included in the sample because of their characteristics.

Answer 51

Selection bias occurs when the sampling procedure is not random and thus the sample is not representative of the population.

Answer 52

* **Self-selection** - some members of the population are more likely to be included in the sample because of their characteristics. * **Attrition** - some observations may be less likely to be present in the sample due to time constraints

Answer 53

Measurement error occurs when the data is collected errors that are non random.

Answer 54

* **Recall bias** - respondents recall some events more vividly than others (child deaths by gun vs swimming pools); * **Sensitive questions** - respondents may not report data accurately (wages, health conditions); * **Faulty equipment** - equipment that exhibits systematic measurement error.

Answer 55

Disorder corresponds to how mixed (impure) the segment is with respect to properties of interest.

Answer 56

entropy = - p1 log (p1) - p2 log (p2) - ... Each pi is the probability of property i within the set, ranging from pi = 1 when all members of the set have property i, and pi = 0 when no members of the set have property i.

Answer 57

It measures how much an attribute improves entropy over the whole segmentation it creates.

Answer 58

We use Laplace correction. Its purpose is to moderate the influence of leaves with only a few instances.

Answer 59

We measure the attribute on the basis of information gain, which is based on a purity measure called entropy, another is variance reduction (for numeric target).

Answer 60

We use tree induction technique.

Answer 61

Tree induction recursively finds informative attributes for subsets of the data. In so doing it segments the space of instances into similar regions. The partitioning is “supervised” in that it tries to find segments that give increasingly precise information about the quantity to be predicted, the target. The resulting tree-structured model partitions the space of all possible instances into a set of segments with different predicted values for the target.

Answer 62

The data miner specifies the form of the model and the attributes; the goal of the data mining is to tune the parameters so that the model fits the data as well as possible.

Answer 63

Hinge loss. The penalty for a misclassified point is proportional to the distance from the decision boundary, so if possible the SVM will make only “small” errors.

Answer 64

1. Define target 2. Collect data 3. Build a model (set of rules or a mathematical formula) 4. Predict outcomes

Answer 65

* **Target variable** (label) - the value you're trying to predict; * **Supervised segmentation** is model creation where the model describes a relationship between a set of selected variables (attributes or features) and a predefined variable called the target variable. The model estimates the value of the target variable as a function (possibly a probabilistic function) of the features. * **Entropy** is a measure of disorder (surprise). It tells how impure the segment is with regards to the properties of interest. * **Information gain** measures how much an attribute decreases entropy over the whole segmentation it creates.

Answer 66

* **Accuracy** is the proportion of correct decisions made by the classifier. * **Error rate** is the proportion or wrong decisions made by the classifier. * **Confusion matrix i**s a table that is often used to describe the performance of a classification model (TP, TN, FP, FN) on a set of test data for which the true values are known.

Answer 67

* **Affinity grouping** - associations, market-basket analysis (Which items are commonly purchased together?) * **Similarity matching** (Which other companies are similar to ours?) * **Clustering** (Do my customers form natural groups?) * **Sentiment analysis** (What is the sentiment of my users?)

Answer 68

* **Predictive modeling** * Will specific customer will default? Which accounts will be defrauded? * **Causal modeling** * How much would client X spend if I gave her a discount?

Answer 69

They pursue different goals: **Predictive modeling** is the process of applying a statistical model or data mining algorithm to data for the purpose of _predicting new or future observations._ _Example:_ How much will client X spend? **Explanatory modeling** is the use of statistical models for explaining how the world works (by testing _causal explanations)_. _Example:_ How much would a discount change client X's spending?

Answer 70

* **Explanatory models** are based on _underlying causal relationships between theoretical constructs_ while * **predictive models** rely on _associations between measurable variables._ * **Explanatory modeling** seeks to _minimize model bias_ (i.e. specification error) to obtain the most accurate representation of the underlying theoretical model, * **predictive modeling** seeks to _minimize the combination of model bias and sampling variance_ (how much does the model change with new data).

Answer 71

It is a method for estimating an unknown value of interest, which is called target.

Answer 72

1. Define (quantifiable) target 2. Collect data - data on same or related phenomenon 3. Build a model - a set of rules or a mathematical formula that allow establishing a prediction. 4. Predict outcomes - the model can be applied to any customer. It gives us a prediction of the target variable.

Answer 73

REGRESSION Attempts to estimate or predict the numerical value of some variable for an individual. Mathematical formula: * Linear regression * Logistic regression Rule-based formula: * Regression trees CLASSIFICATION Attempts to predict which of a (small) set of classes an individual belongs to. Mathematical formula: * Logistic regression * Support Vector Machines Rule-based formula * Classification trees

Answer 74

**Linear regression** is an approach for modeling the relationship between a _dependent variable_ and one or more _explanatory variables_. The estimators B0, B1, B2 are obtained by _minimizing the sum of squared errors._ It is used when you are trying to predict a numerical variable.

Answer 75

If the dependent variable takes values between 0 and 1, we can use **logistic regression** to model its relationship with one or more **explanatory variables.** F () is a function with values between 0 and 1. P(Pass) = f(b0 + b1 x effort)

Answer 76

It explains sow much of the total variation is explained by the model. Everything besides effort. The bigger it is the better because it means this percentage in variation is explained by the model.

Answer 77

# 1. **Define target** - will prospect X buy life insurance? 2. **Collect data** - gather list of prospects with demographic information 3. **Build a model** - logistic regression, classification trees 4. **Predict outcomes**

Answer 78

Logistic regression can be used for classification when: * Target variable is binary. * The outcome of a model can be interpreted as probability.

Answer 79

Stop segmentation when at least one of the conditions is met: * All elements of a segment belong to the same class * The maximum allowed tree depth is reached * Using more attributes does not "help

Answer 80

Resulting groups have tobe as pure as possible - homogeneous w.r.t. target variable.

Answer 81

How much information is necessary to represent information about an event with X possible outcomes? log2(X) p=1/x is log2(1/p) **Entropy** measures of the general disorder of a set - how unpredictable world is

Answer 82

* **Information gain** (IG) measure the change in entropy due to any amount of new information being added. * Information gain measure how much an attribute decrease entropy over the whole segmentation it creates. IG = entropy(parent)- [p(c1) \* entropy(c1) + p(c2) \* entropy(c2) + ...]

Answer 83

**Accuracy** = number of correct decisions made/total number of decisions. **Error rate** = 1 - accuracy

Answer 84

* True Positives (TP) - actual positives correctly predicted as positive. * True Negatives (TN) - actual negatives correctly predicted as negative. * False Positives (FP) - negatives incorrectly predicted as positive. * False Negatives (FN) - positives incorrectly predicted as negative.

Answer 85

* **Generalization** is the property of a model whereby _model applies to data that were not used to build the model._ * **Overfit** is the tendency to _tailor models to the training data_, at the expense of generalization to previously unseen data points.

Answer 86

* **Holdout data** (or test set) is the data that was not used to teach the model - it was set aside so the created model could be evaluated * **Cross-validation** computes its estimates over all the data by performing multiple splits and systematically swapping out samples for testing. It computes the average and standard deviation from k folds.

Answer 87

The **expected value framework** is an analytical tool that is extremely helpful in organizing thinking about data-analytic problems. .Combines: * Structure of the problem * Elements of the analysis that can be extracted from the data * Elements of the analysis that need to be acquired from other sources (e.g., business knowledge) The **benefit/cost matrix** summarizes the benefits and costs of each potential outcome, always comparing with a base scenario. It does not really matter which base scenario we choose, as long as all comparisons are with the same scenario.

Answer 88

Does not predict the future but just fits the data perfectly, as it memorizes the training data and performs no generalization.

Answer 89

It is the tendency to tailor model to the training data, at the expense of generalization to previously unseen data points.

Answer 90

* If we allow ourselves enough flexibility in searching, we will find patterns * Unfortunately, these patterns may be just chance occurences in the data * We are interested in patterns that generalize, i.e., that predict well for instances that we have not yet observed

Answer 91

**Solution:** cross-validation. * Cross validation is a more sophisticated training and testing procedure. * Cross-validation computes its estimates over all the data by performing multiple splits and systematically swapping out samples for testing. * Simple estimate of the generalization performance, but also some statistics on the estimated performance (mean, variance, . . . )

Answer 92

You need to assess what is better - good average or small standard deviation.

Answer 93

* (Weighted) Sum of information gain in each split a variable is used (tree-based models) * Difference in model performance with and without using that variable (all models)

Answer 94

* Trees create a segmentation of the data * Each nod in the tree contains a test of an attribute * Each path evenually terminates at a leaf * Each leaf corresponds to a segment, and the attributes and values along the path give the characteristics. * Each leaf contains a value for the target variable

Answer 95

* P1,P2, ..., Pn are the proportions of classes 1,2, ..., n in the data * Disorder corresponds to how mixed (impure) a segment is * Entropy is **zero at minimum** **disorder** (all members belong to the same class) * Entropy is **one at maximal** **disorder** (members equaly distributed among classes)

Answer 96

1. **Scalling with a queue** - you create a queue for requests so that frequent requests don't crash the system. 2. **Scalling by sharding the database** - you split the write load across multiple machines - horizontal partitioning/sharding. 1. It starts faults and corruption issues

Answer 97

* Robustness and fault tolerance * Low latency reads and updates * Scalability * Generalization * Extensibility * Ad hoc queries * Minimal maitenance * Debuggability

Answer 98

* **Operational complexity** * *Compaction* is an intensive operation - a lot of coordination. Many things could go wrong. * **Extreme complexity of achieving eventual consistency** * Consistency and availability don't go together * **Lack of human-fault tolerance:** an incremental system is constantly modifying the state it keeps in the database, which means a mistake can also modify the state in the database. *

Answer 99

The expected value framework is an analytical tool that is extremely helpful in organizing thinking about data-analytic problems. Combines: * Structure of the problem * Elements of the analysis that can be extracted from the data * Elements of the analysis that need to be acquired from other * sources (e.g., business knowledge)

Answer 100

**Generalization** is the property of a model or modeling process, whereby the model applies to data that were not used to build the model.

Answer 101

**Overfitting** is the tendency of data mining procedures to _tailor models to the training data_, at the expense of generalization to previously unseen data points.

Answer 102

**Holdout data** is data used for validating a model and not used for training a model. Performance is evaluated based on accuracy in the test data -\> **holdout accuracy.** Holdout accuracy is an estimate of **generalization accuracy.**

Answer 103

1. Stop growing the tree before it gets too complex 2. Grow the tree until it is too large, then 'prune' it back, reducing its size (and complexity).

Answer 104

- Ad retrieval - Customer classification - Customer clustering - Competitor analysis

Answer 105

You measure the distance between the attirbutes. Distance = Pythagoras.

Answer 106

The **lift of the co-occurrence** of A and B is the probability that we actually see the two together, compared to the probability that we would see the two together if they were unrelated to (independent of) each other. how much more frequently does this association occur than we would expect by chance?

Answer 107

how much more likely than chance a discovered association is. An alternative is to look at the difference of these quantities rather than their ratio. This measure is called **leverage.**

Answer 108

1. Scaling 2. Complexity 3. Fault-tolerance 4. Data-corruption

Answer 109

1. Robustness and fault tolerance 2. Low latency 3. Minimal maitenance 4. Ad hoc queries

Answer 110

* Distributed file systems are quite similar to the file systems of your computer, except they spread their storage across a cluster of computers * They scale by adding more machines to the cluster * Designed so that you have fault tolerance when a machine goes down, meaning that if you lose one machine, all your files and data will still be accessible * The operations you can do with a distributed filesystem are often more limited than you can do with a regular filesystem * For instance, you may not be able to write to the middle of a file or even modify a file at all after creation How distributed file systems work? * All files are **broken into blocks** (usually 64 to 256 MB) * These blocks are **replicated** (typically 3 copies) among the HDFS servers (datanodes) * The namenode provides a **lookup service** for clients accessing the data and ensures the nodes are correctly replicated across the cluster

Answer 111

The MapReduce model consists of two main stages: * **Map** input data is split into discrete chunks to be processed * **Reduce** output of the map phase is aggregated to produce the * desired result The simple nature of the programming model lends itself to efficient and large-scale implementations across thousands of cheap nodes (computers).

Answer 112

* Analytics application is **struggling to keep up with the traffic** - too many requests for the database * You start hashing the database, however, it is messy and takes time and it is **prone to errors** * **Fault-tolerance** decreases as you can only fix it by having one of the databases down * **Data corruption issues.** No place to store unchangeable data, thus you corrupt the original file.

Answer 113

The desired properties of Big Data systems are related both to _complexity_ and _scalability._ * **Complexity** generally used to characterize something with many parts where those parts interact with each other in multiple way * **Scalability** ability to maintain performance in the face of increasingdata or load by adding resources to the system A Big Data system must _perform well_, be _resource-efficient_, and it must be **easy to reason about**

Answer 114

1. **Robustness and fault tolerance** 1. ****Duplicated data 2. Concurrency 2. **Low latency** 3. **Minimal maitenance** 1. Anticipating when to add machines to scale, 2. keeping processes up and running 3. debugging 4. **Ad hoc queries** 1. Being able to mine a dataset arbitrarily gives opportunities for business optimization and new applications.

Answer 115

* Manages the **master dataset** – an immutable, append-only set of raw data * Pre-computes arbitrary query functions – called **batch views** * Runs in a loop and continuously recomputes the batch views from scratch * Very simple to use and understand * Scales by adding new machines.

Answer 116

* Data is always true (or correct): all records are always correct; no need to go back and re-write existing records; you can simply append new data * You can always go back to the data and **perform queries you did not anticipate** when building the system Data should be stored in **raw format**, should be **immutable** and should be **kept forever**

Answer 117

* Accommodates all requests that are subject to _low latency requirements_ * Its goal is to ensure new data is represented in query functions as quickly as needed for the application requirements * Similar to the batch layer in that it produces views based on data it receives * One big difference is that the speed layer _only looks at recent data,_ whereas the batch layer looks at all the data at once * Does _incremental computation_ instead of the recomputation done in the batch layer

Answer 118

* Indexes **batch views** so that they can be queried with low latency * The serving layer is a specialized distributed database that loads in a batch view and makes it possible to do random reads on it * When new batch views are available, the serving layer automatically swaps those in so that more up-to-date results are available * It does not need to support specific record updates * This is a very important point, as random writes cause most of the complexity in databases

Answer 119

Distributed file systems are quite similar to the file systems of your computer, except they spread their storage across a cluster of computers * They scale by adding more machines to the cluster * Designed so that you have fault tolerance when a machine goes down, meaning that if you lose one machine, all your files and data will still be accessible The operations you can do with a distributed filesystem are often more limited than you can do with a regular filesystem * For instance, you may not be able to write to the middle of a file or even modify a file at all after creation

Answer 120

Motivation Google needed a good distributed file system * Redundant storage of massive amounts of data on cheap and unreliable computers Why not use an existing file system? * Google’s problems were different from anyone else’s * Different workload and design priorities * Google File System is designed for Google apps and workload * Google apps are designed for Google File System Assumptions * High component failure rates * Inexpensive commodity components fail all the time * "Modest" number of HUGE files * Just a few million * Each is 100MB or larger; multi-GB files typical * Files are write-once, mostly appended to * Large streaming reads

Answer 121

**BigTable** is a _distributed storage system_ for managing _structured data_ that is designed to scale to a very large size: petabytes of data across thousands of commodity servers.

Answer 122

Hadoop File System is the open source alternative to Google File System * Comodity hardware * Tolerant to failure

Answer 123

* Hadoop MapReduce is a distributed computing paradigm originally pioneered by Google * Used to process data in the batch layer

Answer 124

Many data analysis problems involve the application of a split-apply-combine strategy: * **Split:** Break up a big problem into manageable pieces; * **Apply:** Operate on each piece independently; * **Combine:** Put all the pieces back together.

Answer 125

* **Simplicity:** Developers can write applications in their language of choice, such as Java, C++ or Python * **Scalability:** MapReduce can very large amounts of data, stored in HDFS on one cluster * **Speed:** Parallel processing means that MapReduce can take problems that used to take days to solve and solve them in hours or minutes * **Recovery** MapReduce takes care of failures. If a machine with one copy of the data is unavailable, another machine has a copy of the same key/value pair, which can be used to solve the same sub-task. * **Minimal data motion:** MapReduce moves compute processes to the data on HDFS and not the other way around. Processing tasks can occur on the physical node where the data resides. This significantly reduces the network traffic patterns and contributes to Hadoop’s processing speed

Answer 126

MapReduce is a very powerful and flexible tool that allows performing almost any data transformation task. However it has some limitations: * MapReduce is designed specifically for batch processing * Low level framework (hard to use) New tools have been developed to simplify the use of MapReduce * Apache HIVE (similar to SQL) * Apache Pig (script language)

Answer 127

* Elastic clouds allow you to **rent hardware on demand** rather than own your own hardware in your own location. * Elastic clouds let you **increase or decrease the size of your cluster nearly instantaneously**, so if you have a big job you want to run, you can allocate the hardware temporarily. * Elastic clouds dramatically **simplify system administration.** They also provide additional storage and hardware allocation options that can significantly drive down the price of your infrastructure. Examples of suppliers: * Microsoft Azure * Amazon Web Services (AWS) * Digital Ocean

Answer 128

* The area under the ROC curve (depicted in gray) is the probability that the model will rank a randomly chosen positive case higher than a negative case * AUC is useful when a single number is needed to summarize performance, or when nothing is known about the operating conditions

Answer 129

* Underfitting A model that is too simple does not fit the data well (high bias) * e.g., fitting a quadractic function with a linear model * Overfitting A model that is too complex fits the data too well (high variance) * e.g., fitting a quadractic function with a 3rd degree function

Answer 130

* Bias a model that underfits is wrong on average (high bias) but is not highly affected by slightly different training data * Variance a model that overfits is right on average, but is highly sensitive to specific training data

Answer 131

* When trying the optimal model we are in fact trying to find the optimal tradeoff between bias and variance; * We can reduce variance by putting many models together and aggregating their outcomes. More complexity generally gives us lower bias but higher variance, while lower variance models tend to have higher bias.

Answer 132

**Ensemble methods** use multiple algorithms to obtain better predictive performance than could be obtained from any of the algorithms by itself Using **multiple algorithms** usually increases model performance by: * **reducing variance:** models are less dependent on the specific training data Examples: * **Bagging** (or bootstrap aggregation) creates multiple data sets from t_he original training data by bootstrapping_ – re-sample with repetition. Runs several models and _aggregates output with a voting system_ * **Random Forest** combines bagging with random selection of features (or predictors) * **Boosting** applies classifiers sequentially, assigning higher weights to observations that have been mis-classified by the previous methods

Answer 133

* If we allow ourselves enough flexibility in searching, we will find patterns * Unfortunately, these patterns may be just chance occurences in the data... * We are interested in patterns that generalize, i.e., that predict well for instances that we have not yet observed

Answer 134

A **fitting graph** shows the accuracy (or error rate) of a model as a function of model complexity. Generally, there will be more overfitting as one allows the model to be more complex.

Answer 135

**Complexity** is a measure of the flexibility of a model. * If the model is a _mathematical function_, complexity is measured by the **number of parameters** * If the model is a _tree_, complexity is measured by the **number of nodes**

Answer 136

As a model gets more complex, it is allowed to pick up harmful **spurious correlations.** * These correlations do _not represent characteristics of the population in general_ * They may become harmful when they _produce incorrect generalizations_ in the model

Answer 137

* **Cross-validation** computes its estimates over all the data by performing multiple splits and systematically swapping out samples for testing * Simple estimate of the generalization performance, but also some statistics on the estimated performance (mean, variance, . . . )

Answer 138

A **learning curve** is a plot of the generalization performance (testing data) against the amount of training data * Generalization performance improves as more training data are available * Steep initially, but then marginal advantage of more data decreases

Answer 139

* The ROC graph shows the entire space of performance possibilities for a given model, independent of class balance * Plots classifiers false positive rate on the x axis against true positive rate on the y axis * It depicts relative trade-offs that a classifier makes between benefits (true positives) and costs (false positives): * (0, 0) is the strategy of never issuing a positive classification * (1, 1) is the strategy of always issuing a positive classification * The line linking (0,0) to (1,1) is the strategy of guessing randomly

Answer 140

* Algorithm is wrong * Data is biased * People are biased

Answer 141

* There is a trade-off: * Get money for ads * Or do self-advertising Thus, you need to know how likely the person will convert to figure out the expected value of self-ad. Target variable - Conversion within a week Data to use: - Historical data What are informative attributes for selection? - Income - Age - Device - Status (working/non-working) - Number of friends using spotify - Number of hours listened - Number of skips - Made a step towards buying - Clicking on premium options

Answer 142

**Uplift modelling** identifies _individuals that are most likely to respond favorably_ to an action.

Answer 143

* Predictive modelling: * Will a targeted customer buy? * Will I buy Spotify premium? * focus only on distinguishing between customers that buy if they are targeted versus those that do not buy * Uplift modelling: * Will the customer buy ONLY if targeted? * Are the self-ads the reason why I buy Spotify Premium? * further distinguis different behaviors among those that do not get targeted

Answer 144

* The c**ore complication with uplift modeling** lies in the fact that the _cannot measure the uplift for an individual_ because we cannot simultaneously target and not target a single person. * We can overcoe this by **randomly assigning similarly looking people to different treatments** and assessing the differences in their behavior.

Answer 145

- Differential approach - Two model approach

Answer 146

1. Choose a target variable 2. Run two predictive models: 1. Experimental group 2. Control group 3. Calculate difference in predicted outcomes across models

Answer 147

* Each model is trained to minimize difference in expected cusomer value within a leaf, not to minimize the differences in uplift. * It does not mean that you're going to identify those who will have the highest lift.

Answer 148

1. Define uplift as a target varibale 2. Run one predictive model with both treatment and control groups 1. At each split, minimize variations in uplift, not in expected value

Answer 149

**Two model approach** Each predictive model find splits to optimize expected life-time value * Best split is the split that minimizes variation of life-time value within each group **Differential Approach** Uplift models find splits to optimize difference in treatment effect * Best split is the split that minimizes variation of treatment effects within each group (or that maximizes the variance or treatment effects across groups)

Answer 150

1. **Existence of a valid control group** if there is no adequate control group it is not possible to create an uplift model 2. **Negative effects** uplift models usually have a much better performance when some customers react negatively to intervention 3. **Negatively correlated outcomes** when the outcome is negatively correlated with the incremental impact of a marketing activity, the benefit of uplifting modeling may be larger.

Answer 151

* We can use predictive models for predicting outcomes based on individual attributes * However, models based only on observational data do not inform how users would react to a specific intervention * There is no distinction between individual attributes, which are mostly immutable, and the causal part of the model * Did the customers upgrade because they saw the ad, or were they going to upgrade anyways?

Answer 152

•Product recommendations –Different types of similarity can be used! •Customer segments •Personality types •Store and warehouse layout •Text mining •Reducing problem complexity and enhancing interpretation

Answer 153

Distortion is a measure for each cluster which is calculated distance of each point and its cluster centroid. However; while k-means clustering converges to a stable solution, the actual solution depends on the starting centroid choice.

Answer 154

An elbow method is used to determine when within group sum of squares decreases (and does not decrease much more).

Answer 155

We can interpret clusters by looking at a _typical cluster member_ or _typical characteristic_(s). Essentially, showing the cluster centroid.

Answer 156

Inverse Document Frequency tells that the fever documents in which term occurs the more significant it likely is to be to the documents it does occur in. Combined with Term Frequency which counts within the documents form the TF values for each term, and the document counts across the corpus form the IDF values It compares TF with the entire corpus' IDF. The TF–IDF value increases proportionally to the number of times a word appears in the document and is offset by the number of documents in the corpus that contain the word, which helps to adjust for the fact that some words appear more frequently in general.

Answer 157

Epistemic concerns 1. Inconclusive evidence 2. Inscrutable evidence 3. Misguided evidence Normative concerns 4. Unfair outcomes 5. Transformative effects 6. *Traceability*

Answer 158

Correlation does not imply causation. Algorithms produce knowledge that is yet uncertain and has not been proven. Leads to -\> Unjustified actions

Answer 159

* The connection between the data and conclusion is not accessible. A lack of knowledge regarding the data being used (e.g. relating to their scope, provenance and quality), but more importantly also the inherent difficulty in the interpretation of how each of the many data-points used by a machine-learning algorithm contribute to the conclusion it generates, cause practical as well as principled limitations. Leads to -\> Opacity

Answer 160

* "Garbage in, garbage out" - input data is biased or incomplete. * The output of an alogorithm incorporates the values and assumptions that are presented in the inpute data of the algorithm. In thisway, the output can never exceed the input (e.g. cannot become more objective) * Leads to -\> Bias

Answer 161

* Should data-driven discrimination be allowed? * The decisions and actions resulting from the outcome of an algorithm should be examined according to ethical criteria and principles considering the 'fairness' of the decision or action (including its effects) Leads to -\> Discrimination

Answer 162

* Autonomous decision-making can be questionable and yet appear ethically neutral because they do not seem to cause any obvious harm. This is because algorithms can affect how we conceptualise the world, and modify its social and political organisation. Leads to -\> Challenges for autonomy and informational privacy

Answer 163

* How do you assign responsibility of an algorithm?

Answer 164

* Unjustified actions * Actions taken on the basis of inductive correlations have real impact on human interests independent of their validity. * Opacity * Lack of accessibility * Lack of comprehesibility * Information asymmetry * Even if people would want to they would not be able to explain how it works - algorithms can be too complex * Bias * Embedded social bias * Technical bias (constraints) * Emergent bias * Discrimination * Autonomy * Personalisation algorithms tread a fine line between supporting and controlling decisions by filtering which information is presented to the user based upon indepth understanding of preferences, behaviours, and perhaps vulnerabilities to influence * Deciding which information is relevant is subjective * Personalisation algorithms reduce the diversity of information users encounter by excluding content deemed irrelevant or contradictory to the user’s beliefs * Informational privacy * While there are laws (GDPR) which protect data of identfiable individual, you can still be clustered into a group, which you don't want to be identified with. * Moral responsibility * Black box - so nobody's responsible for the algorithm.

Answer 165

Steps: business understanding, data understanding, data preparation, data modeling, evaluation, and deployment

Answer 166

Frameworks of ethical dilemmas: * Six types of ethical concerns raised by algorithms (Mittelstadt et al. 2016) * Transparency about 5 aspects * Algorithmic opacity in 3 ways * Classifier discrimination in every CRISM-DM cycle * Different types and origins of bias * Assertion-based framework

Answer 167

Sensitivity is calculated as the number of correct positive predictions divided by the total number of positives.

Answer 168

* **Support:** Add a rule which says which part of the occurences the model should address. * For example, place a constraint that such rules must apply to some minimum percentage of the data—let’s say that we require rules to apply to at least 0.01% of all transactions. * **Confidence:** Add a rule which determines the strenght of the association. * For example, we might say we require the strength to be above some threshold, such as 5%

Answer 169

**Support** of association is an indication of how frequently the items appear in the data.

Answer 170

The probability that B occurs when A occurs we’ve seen before; it is p(B|A), which in association mining is called the *confidence* or strength of the rule.

Answer 171

**Lift** answers the question - how much more frequently does this association occur than we would expect by chance? The lift of the co-occurrence of A and B is the probability that we actually see the two together, compared to the probability that we would see the two together if they were unrelated to (independent of) each other. As with other uses of lift we’ve seen, a lift greater than one is the factor by which seeing A “boosts” the likelihood of seeing B as well.

Answer 172

**Leverage** looks at the difference of seeing the probability of the items purchased together minus the probability of items purchased independently from each other.

Answer 173

* Inherent randomness * Prediction is not 'deterministic' - there is no promise that people will bahave according to our model. * Bias * No matter how much data is given, the model will never achieve maximum accuracy (unless your model takes ALL factors into account) * Variance * Model accuracy varies accross difference training sets

Answer 174

* **Formidable historical advantage** * **Unique intellectual property** * Novel techniques for mining the data * **Unique intangible collateral assets** * Implementation of the model * Company culture regarding implementing Data Science solutions (e.g. culture of experimentation) * **Superior data scientists** * You need at least one superstar data scientist to be able to evaluate the quality of the prospective hires * **Superior data science management** * Understand the business needs * Be able to communicate to techies and suits * Coordinate models with business constraints and costs * Anticipate outcomes of data science projects * They need to do this within the culture of a particular firm

Answer 175

* Engage academic data scientists (pay for their PhD) * Take top-notch data scientists as scientific advisors * Hire a third-party to conduct the data science

Answer 176

* Messy * Stopwords * Similar words for the same thing * Stemming * Which words are relevant?

Answer 177

● **Robustness and fault tolerance:** the server uses replication under the hood to ensure availability when servers go down, and they are human-fault tolerant because when a mistake is made you can fix your algorithm or remove the bad data and recompute it from scratch. ● **Scalability:** both the batch and serving layers are easily scalable. They’re both fully distributed systems, and scaling them is as easy as adding new machines. ● **Generalization**: the architecture described is as general as it gets. You can com-pute and update arbitrary views of an arbitrary dataset. ● **Extensibility:** adding a new view is as easy as adding a new function of the mas-ter dataset. Because the master dataset can contain arbitrary data, new types of data can be easily added. ● **Ad hoc queries:** the batch layer supports ad hoc queries innately. All the data is conveniently available in one location. ● **Minimal maintenance:** main component to maintain in this system is Hadoop. Hadoop requires some administration knowledge, but it’s fairly straightforward to operate. ● **Debuggability:** in traditional databases, an output can replace the original input. In the batch and serving layers, the input is the master dataset and the output is the views. Likewise, you have the inputs and outputs for all the intermediate steps. Having the inputs and outputs gives you all the information you need to debug when something goes wrong.

Answer 178

* No. * This is very important, as random writes cause most of the complexity in databases. By not supporting random writes, these are more simple. * That simplicity makes them robust, predictable, easy to configure, and easy to operate.

Answer 179

Speed layer only looks at the most recent data.

Answer 180

● **Batch computation systems:** high throughput, high latency systems. Batch computation systems can do nearly arbitrary computa-tions, but they may take hours or days to do so. The only batch computation sys-tem we’ll use is Hadoop. ● **Serialization frameworks:** provide tools and libraries for using objects between languages. They can serialize an object into a byte array from any language, and then deserialize that byte array into an object in any lan-guage. Serialization frameworks provide a Schema Definition Language for defining objects and their fields, and they provide mechanisms to safely version objects so that a schema can be evolved without invalidating existing objects. ● **Random-access NoSQL databases:** there are many of these databases. They sacrifice the full expressiveness of SQL and instead specialize in certain kinds of operations. They all have different seman-tics and are meant to be used for specific purposes. They’re not meant to be used for arbitrary data warehousing. ● **Messaging/queuing systems:** provides a way to send and consume messages betweenprocesses in a fault-tolerant and asynchronous manner. ● **Realtime computation system:** high throughput, low latency, stream-processing systems. They can’t do the range of computations a batch-processing system can, but they process messages extremely quickly.

Answer 181

**Data normalization** refers to storing data in a structured manner to minimize redundancy and promote consistency.

Answer 182

I. **Intelligibility:** the lack of an explicit, interpretable model may pose a problem in some areas. There are two aspects: the justification of a decision and the intelligibility of an entire model. With k-NN, it is easy to describe how a single instance is decided: the set of neighbors participating in the decision can be presented, along with their contributions. What is difficult to explain more deeply is what knowledge has been mined from the data. Also, visualization is possible with two dimensions, but not with many dimensions. II. **Dimensionality and domain knowledge:** numeric attributes may have vastly different ranges, and unless they are scaled appropriately the effect of one attribute with a wide range can swamp the effect of another with a much smaller range. There is also a problem with having to many attributes, or many that are irrelevant to similarity judgement. Some problems are said to be high dimensional (including irrelevant variables) and they suffer from the curse of dimensionality. The prediction is then confused by the presence of too many irrelevant attributes. To solve this, one can conduct feature selection (determining which features should be included in the data mining model). III. **Computational efficiency:** the main computational cost of a nearest neighbor method is in the prediction/classification step, when the database must be queried to find the nearest neighbor of a new instance.

Big Data Analytics Management Flashcards

(219 cards)