What are the 5 V’s of Big Data?
Volume - Terabytes, petabytes
Variety - structured, unstructured, multi-media
Velocity - batch, realtime, streams
Veracity - reliability, availability, completeness
Value - Insights, foresights, actions, decisions.
What is Survivorship bias? Give an Example
In WW2 man was tasked with where to put extra armor on planes to prevent them being shot down. He was able to observe some planes that did return but instead of putting the armor where the holes were he put the armor where they wernt as these would have been the places where getting shot must have been worse that for the plane as he didnt see bullet holes in this spot which meant the planes must have not come back.
Advantages of DBMS? (SSRNHC)
DBMS in the big data era
Name the steps of the data science process
Describe the Problem Formulation Step
Defining the problem in a clear way including explaining its purpose, who will benefit, in what way, how will you measure the success, is it a problem that can be solved in a data driven way
Describe the Data Collection Step
What data do you need, where will you store it, for how long, do you own the data, is it readily available, what do you have to do to gain access to it, are you allowed to make a copy of it.
Describe the Data Preparation Step
Describe the data analysis step
How will you analyze the data, why is your approach the right approach, what methods / algorithms and models will you use.
Describe the Story telling step
How you present the results of your analytics. Is your presentation susceptible to misunderstandings, what are the key messages you want the audience to get from the findings.
What are the 4 Human centered principles for data analytics?
What questions should you ask of the data with regards to you analysis.
Who, What, When, Where, Why, How
Why might you want to do data sampling?
Volume of data may cause storage and accessibility problems
You need the convinience of working with a smaller set of data (laptop vs cluster)
Smaller dataset has the same data properties.
What are the two broad types of data sampling?
Sampling without replacement
- no duplicate items in sample, items are dependent
Sampling with replacement - each time we add an item to the sample it is not excluded from being added again, sampled items are independent
What are the different types of sampling
Simple Random sampling - if a set is n size chose items in the larger set s where n < s and the probability of selecting an item is 1/n
Weighted Random sampling - designing weights to capture a particular interest in the data
Stratified random sampling.
N items in dataset each belonging to k strata, want to select k items form each straum giving m = sk item for the sample (WR)
For each strata chose each of the k samples from the stratum uniformly at random
In some studies may want to preserve the proportion of strata in other studies may want to over sample rare strata.
What are the key questions to address during data collection?
What data do you need
Do you need all of it or a sample
Are you authorized to acquire the data
In what from are you going to store (ingest) the data
What are the different methods of ingestion? (TTCM)
Tabular data - csv, JSON
Text Data, MongoDB
Complex structured data - SQL Relational DB
Multimedia data - Data lakes and cloud storage
What are the 7 Data types?
What are the dimensions of data quality? (FARCC)
Freshness
Accuracy
Reliability
Completeness
Consistency
What is data quality
The degree to which data can be used for its intended purpose, and the degree to which data accurately represents the real world
What are the 3 things to check for to see if your data is fit for use?
Data Exploration - Discovering and understanding the quality characteristics of the data through exploratory techniques
Data Transformation - transforming the data through cleaning, curating, repairing
Data Enrichment - Enriching the data with data integration and imputation
Explain the difference between data integration and data imputation
Data integration is joining different kinds of data together that share a common attriubute like time
Data imputation is imputing missing values into the data, recreating lost data.
What are Hindsight, Insight and Foresight? Briefly explain
Hindsight - What happened (search and query)
Insight - why is it happening (knowledge, discovery
Foresight - What will happen (prediction)
What are the three components of story telling?
Narrative
Data
Visual