What are the phases of the data analytics lifecycle?
1- Discovery
2- Data Preparation
3- Model Planning
4- Model Building
5- Communicate Results
6- Operationalize
What are the key points in the discovery phase?
1- Learn about the business domain
2- History of past attempts at improving the system.
3- Know the possible and available resources for data, such as feedback forms and sales data.
4- Framing the problem: What are the sides that need to be fixed to improve the system?
5- Identifying key stakeholders: Who
has an impact on the system.
6- Developing an initial hypothesis: If we do this, then this will happen.
7- Identifying data sources.
What is data preparation?
Working with data and performing analytics for the duration of the project.
What does ETLT stand for?
Extract TL/LT
Extract > Load > Transform
or
Extract > Transform > Load
What does “learning about the data” mean in the data preparation phase?
1- Clarify gaps
2- Identify external sources of the data
3- Identify access and availability
What is data conditioning? And in which does it exist?
It exists in phase 2 (data preparation), and it refers to the process of cleaning data, normalizing datasets, and performing transformations on the data.
What are the ethical issues that need to be considered while dealing with data in phase 2?
1- Privacy
2- NDAs: Non-Disclosure Agreements.
3- Anonymization: removing or altering personal identifiers
4- Bias
What are the tools for data preparation?
1- Python and Jupyter Lab
2- Data Wrangler
3- Tableau
4- Google DataPrep
What is data wrangling?
The process of transforming raw data into a more usable format by organizing and cleaning it.
What are the most commonly used examples of data wrangling?
1- Mergin several data sources into one dataset for analysis
2- Identifying gaps or empty cells in data and either filling or removing them.
3- Deleting irrelevant or unnecessary data
4- Identifying outliers in data and eithe explaining the inconsistancies or deleting them to facilitate analysis.
What are the most commonly used examples of data wrangling in businesses?
1- Detect corporate corruption/scams/cheating.
2- Support data security.
3- Ensure accurate and recurring data modeling results.
4- Perform customer behavior analysis.
5- Reduce time spent on preparing data for analysis.
6- Promptly recognize the business value of your data.
7- Find out data trends.
What is model planning in the big data lifecycle?
The team determines the methods, techniques, and workflow it intends to follow for the subsequent model-building phase.
What is the main goal of the team in phase 3 (Model Planning)?
To choose an analytical technique or short list of candidate techniques based on the end goal of the project.
In model building (phase 4), how do you divide the dataset?
Divide the dataset into 3 subsets:
Training data: used to train the model and develop the analytical models (60%).
Test data: Set aside for evaluating the performance of the model after training (20%).
Cross-validation data: Tune hyperparameters and check for overfitting (20%).
What is the purpose of executing models in phase 4 (model building)?
Assess validity: Ensure the model accounts for most of the data and has predictive power.
What is the purpose of refining models in phase 4 (model building)?
1- Modify variable input and reduce correlated variables to optimize the results.
2- Confirming or denying the insights from phase 3 (model planning) regarding correlated variables.
What is the purpose of documenting the results, logic, and any assumptions made during model construction and execution in phase 4 (model building)?
It is crucial for understanding the context and decisions taken throughout the modeling process.
What is the benefit of a cross-validation subset in the training process of the model?
It helps prevent overfitting by testing how well a model generalizes to unseen data, especially when there’s not enough data to create a separate training and testing set.
What is overfitting?
It is when the model performs very well on the training data but poorly on new unseen data because it is too closely fitted to the specific details of the training dataset.
What is drift in terms of big data?
It is the shift in data or relationship over time that can make a model less effective if not addressed.
How to avoid the problem of drift affecting the model efficiency?
If drift is detected, the model may need to be retrained or updated with newer data to maintain performance.
What should the team do in phase 5 (Communicate Results)?
1- Compare outcomes to success criteria.
2- Present findings clearly
3- Evaluate success or failure.
4- Highlight the key findings.
5- Focus on business impact.
6- Document key insights and make recommendations for future work.
7- Reflect on what went well and what can be improved in future projects.
What is the main focus of phase 6 (Operationalize)?
Deploying the model in a real-world environment and monitoring its performance.