CH2 Flashcards

Question 1

Q

What are the phases of the data analytics lifecycle?

Answer

A

1- Discovery
2- Data Preparation
3- Model Planning
4- Model Building
5- Communicate Results
6- Operationalize

Question 2

Q

What are the key points in the discovery phase?

Answer

A

1- Learn about the business domain
2- History of past attempts at improving the system.
3- Know the possible and available resources for data, such as feedback forms and sales data.
4- Framing the problem: What are the sides that need to be fixed to improve the system?
5- Identifying key stakeholders: Who
has an impact on the system.
6- Developing an initial hypothesis: If we do this, then this will happen.
7- Identifying data sources.

Question 3

Q

What is data preparation?

Answer

A

Working with data and performing analytics for the duration of the project.

Question 4

Q

What does ETLT stand for?

Answer

A

Extract TL/LT
Extract > Load > Transform
or
Extract > Transform > Load

Question 5

Q

What does “learning about the data” mean in the data preparation phase?

Answer

A

1- Clarify gaps
2- Identify external sources of the data
3- Identify access and availability

Question 6

Q

What is data conditioning? And in which does it exist?

Answer

A

It exists in phase 2 (data preparation), and it refers to the process of cleaning data, normalizing datasets, and performing transformations on the data.

Question 7

Q

What are the ethical issues that need to be considered while dealing with data in phase 2?

Answer

A

1- Privacy
2- NDAs: Non-Disclosure Agreements.
3- Anonymization: removing or altering personal identifiers
4- Bias

Question 8

Q

What are the tools for data preparation?

Answer

A

1- Python and Jupyter Lab
2- Data Wrangler
3- Tableau
4- Google DataPrep

Question 9

Q

What is data wrangling?

Answer

A

The process of transforming raw data into a more usable format by organizing and cleaning it.

Question 10

Q

What are the most commonly used examples of data wrangling?

Answer

A

1- Mergin several data sources into one dataset for analysis
2- Identifying gaps or empty cells in data and either filling or removing them.
3- Deleting irrelevant or unnecessary data
4- Identifying outliers in data and eithe explaining the inconsistancies or deleting them to facilitate analysis.

Question 11

Q

What are the most commonly used examples of data wrangling in businesses?

Answer

A

1- Detect corporate corruption/scams/cheating.
2- Support data security.
3- Ensure accurate and recurring data modeling results.
4- Perform customer behavior analysis.
5- Reduce time spent on preparing data for analysis.
6- Promptly recognize the business value of your data.
7- Find out data trends.

Question 12

Q

What is model planning in the big data lifecycle?

Answer

A

The team determines the methods, techniques, and workflow it intends to follow for the subsequent model-building phase.

Question 13

Q

What is the main goal of the team in phase 3 (Model Planning)?

Answer

A

To choose an analytical technique or short list of candidate techniques based on the end goal of the project.

Question 14

Q

In model building (phase 4), how do you divide the dataset?

Answer

A

Divide the dataset into 3 subsets:
Training data: used to train the model and develop the analytical models (60%).
Test data: Set aside for evaluating the performance of the model after training (20%).
Cross-validation data: Tune hyperparameters and check for overfitting (20%).

Question 15

Q

What is the purpose of executing models in phase 4 (model building)?

Answer

A

Assess validity: Ensure the model accounts for most of the data and has predictive power.

Question 16

Q

What is the purpose of refining models in phase 4 (model building)?

Answer

Study These Flashcards

A

1- Modify variable input and reduce correlated variables to optimize the results.
2- Confirming or denying the insights from phase 3 (model planning) regarding correlated variables.

Question 17

Q

What is the purpose of documenting the results, logic, and any assumptions made during model construction and execution in phase 4 (model building)?

Answer

Study These Flashcards

A

It is crucial for understanding the context and decisions taken throughout the modeling process.

Question 18

Q

What is the benefit of a cross-validation subset in the training process of the model?

Answer

Study These Flashcards

A

It helps prevent overfitting by testing how well a model generalizes to unseen data, especially when there’s not enough data to create a separate training and testing set.

Question 19

Q

What is overfitting?

Answer

Study These Flashcards

A

It is when the model performs very well on the training data but poorly on new unseen data because it is too closely fitted to the specific details of the training dataset.

Question 20

Q

What is drift in terms of big data?

Answer

Study These Flashcards

A

It is the shift in data or relationship over time that can make a model less effective if not addressed.

Question 21

Q

How to avoid the problem of drift affecting the model efficiency?

Answer

Study These Flashcards

A

If drift is detected, the model may need to be retrained or updated with newer data to maintain performance.

Question 22

Q

What should the team do in phase 5 (Communicate Results)?

Answer

Study These Flashcards

A

1- Compare outcomes to success criteria.
2- Present findings clearly
3- Evaluate success or failure.
4- Highlight the key findings.
5- Focus on business impact.
6- Document key insights and make recommendations for future work.
7- Reflect on what went well and what can be improved in future projects.

Question 23

Q

What is the main focus of phase 6 (Operationalize)?

Answer

Study These Flashcards

A

Deploying the model in a real-world environment and monitoring its performance.

CH2 Flashcards

Big Data Analytics Lifecycle (23 cards)