Data Analysis Important Flashcards

(77 cards)

1
Q

What is Volume of Data?

A

The size of data generated and collected

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is Variety of Data?

A

The array of data structures and types

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is Velocity of Data?

A

The speed at which the data is produced

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is Veracity of Data?

A

The quality and credibility of data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is Variability of Data?

A

The fluctuation and inconsistencies of the data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is Value of Data?

A

The business relevance of the data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

List out 5 Data Ecosystem components

A
  1. Sensing (Identifying data sources & evaluate their quality)
  2. Collection (Gather Data -> Batch Collection, Real-Time Collection)
  3. Wrangling (Transform data with cleaning, merging and restructuring)
  4. Analysis (Use algorithms, statical models, visualization to find insights of data)
  5. Storage (Store data with future use and consider factors like cost, security, performance)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

List out 7 stages of Data Lifecycle

A
  1. Generation (Data must first exist)
  2. Collection (Understand required data)
  3. Processing
    • Imputation : Handing missing value and outliers
    • Feature Engineering : Creating new variables
    • Transform data for machine learning
  4. Storage (SQL, Cloud)
  5. Management (Retriving and tracking data when needed)
  6. Analysis and VIsulisation (Algorithmic Approach, Machine Learning, Statistical Modeling)
  7. Interpretation (Making sense of analysis by applying domain expertise)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is the Critical Stage for Healthcare?

A

Data Collection

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is the Critical Stage for Machine Learning?

A

Data Processing

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is the Critical Stage for Business Decisions?

A

Analysis & Interpretation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is the Critical Stage for Cybersecurity?

A

Data Management

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is the Critical Stage for IoT Applications?

A

Data Generation & Storage

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What hypothesis has one IV and one DV?

A

Simple Hypothesis

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What hypothesis has multiple IV and/or DV?

A

Complex Hypothesis

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What hypothesis shows a proposition without directly measureable evidence?

A

Logical Hypothesis

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What hypothesis tested with t-Test, ANOVA?

A

Statistical Hypothesis

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What hypothesis claim opposing the null hypothesis?

A

Alternative Hypothesis

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What hypothesis proposes that 2 groups of observation are unassociated?

A

Null Hypothesis

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

List out 6 steps in Hypothesis Development

A
  1. Ask a Question
  2. Preliminary Research
  3. Formultate your Hypothesis
  4. Refine your Hypothesis
  5. Phrase your Hypothesis
  6. Write your Null Hypothesis
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

What is Descriptive Analysis?

A
  1. Describe “What happened?”
  2. Analyzed historical data to summarize and understand past events
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

What is Diagnostic Analysis?

A
  1. “Why something happened?”
  2. Analyzes historical data to extract deeper, underlying patterns
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

What is PredictiveAnalysis?

A
  1. Predict outcomes based on historical patterns
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

What is Prescriptive Analysis?

A
  1. “What should we do?”
  2. Prescribing actions and decisions and provide recommendations
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
What is the most important analysis for Scientific Research?
Descriptive Analysis
25
What is the most important analysis for Business and Finance?
Predictive Analysis
26
What is the most important analysis for Data Science or ML?
Diagnostic Analysis
27
What is the most important analysis for Everyday Reasoning?
Descriptive Analysis
28
What is the most important analysis for Experimental Design?
Prescriptive Analysis
29
What is the purpose of using Column/Bar Charts?
Presents **numerical differences between categories**
30
What is the purpose of using Scatterplot/Bubble Charts?
Display distribution typically 2 varaibles in points to **reveal correlation** ## Footnote Bubble charts add 3 or 4 variable by controlling size and color of each point
31
What is the purpose of using Line/Area Chart?
Tracks the evolution of data over a continous interval ## Footnote Ideal for representing changes in trends over** time**
32
What is the purpose of using Boxplot?
Visualises **distributions (and outliers) across different groups** (compared to the histogram)
33
What is the purpose of using Histogram?
Visualises **frequency distribution** of continuous variables.
34
What is the purpose of using Count/Dot Plot?
Scatterplot variant that counts number of observations for a combination of two categorical variables. Number is represented by size. Colors can represent categories (species) or continuous values (mean, median).
35
What is the purpose of using Pie Chart?
Reprents the proportion of different categories
36
What is the purpose of using Radar/Polar Chart? ## Footnote Add examples of what variables can be used in what chart
Represents one or multiple variables stacked at an axis
37
What is the meaning for Variance 1. = 0 , 2. > 0? ## Footnote Variance describes a tendency of values in a single variable to change in a population
1. No spread or deviation 2. Represent how far apart the values are
38
What is the meaning for Covariance 1. = 0 , 2. > 0, 3. < 0? ## Footnote Variance describes a tendency of values in 2 or more variable to change in a population
1. No linear association 2. Move in same direction 3. Move in opposite direction
39
List out all 6 Analysis Techniques for Mathematical Statistical
1. Descriptive Analysis 2. Inferential Statistics 3. Regression Analysis 4. Factor Analysis 5. Discriminant Analysis 6. Time-Series Analysis
40
What does Descriptive Analysis / Statistic measures?
1. Central tendency (mean, mode, median) 2. Dispersion (variance,standard deviation,range) ## Footnote Considers historical data and describes current performance
41
What does Inferential Statistics measures?
1. Primarily compares means / mediams between 2 groups of data 2. Parametric Tests: t-tests and Anova 3. Non-parametric Tests : Chi-square, Mann-Whitney U Test 4. Regression / Correlation Analysis ## Footnote Make conclusions and inferences about a population of data based on a sample of data
42
What does Regression Analysis determines?
Determines if a relationship exists between a set of variables
43
What does Time-Series Analysis tracks?
Track evolution of data over time, models trends
44
What is the technique used in Discriminant Analysis?
1. Use a linear combination of variables as 'predictors' 2. Can take in factor scores ## Footnote Classifies observations into predefined groups. A type of supervised analysis technique
45
List out all 9 core packages inside `Tidyverse`
1. ggplot2 (Plotting) 2. dplyr (Filtering, Selecting, Mutating -> Data Manipulation) 3. tidyr (Reshaping and Cleaning -> Data Tidying) 4. readr (import csv -> Reading Data) 5. purrr (list/vectors -> Functional Programming) 6. tibble (Modern data frames) 7. stringr (Working with string) 8. forcats 9. lubridate
46
How to create a lookup table for `A = Airline, B=Weather,C=Security`?
``` lookup_tbl <- c("A"="Airline", "B"="Weather", C="Security") # Use the lookup table df$cancellation <- c("A","A","C","B") df$cancellation_cat <- lookup_tbl[df$cancellation] # Mutate new column ``` ## Footnote This is base R technique
47
How to mutate a new column for `A = Airline, B=Weather,C=Security`? | Using `dplyr` package (Left Join and Case When)
# Data Frame ``` library(dplyr) library(tibble) df <- tibbe(cancellation = c("A","A","C","B")) Method 1 : Left Join # Convert lookup table into data frame lookup_df <- tibble( cancellation_code = c("A","B","C"), cancellation_cat = c("Airline","Weather","Security") ) # Left join the Data Frames df |> left_join(lookup_df, by = c("cancellation"="cancellation_code")) Method 2 : Case When df |> mutate( cancellation_cat = case_when ( cancellation == "A" ~ "Airline", cancellation == "B" ~ "Weather", cancellation == "C" ~ "Security" ) ) ```
48
List out 5 common functions the the dplyr provides
1. Subset columns with `select()` with helper functions 2. Subset rows with `filter(), distinct() and slice()` 3. Sort rows by variables with `arrange()` and occasionally with `desc()` 4. Create new columns with `mutate()` 5. Aggregate data with `summarise()` and summary functions ## Footnote Standard Flow 1. Start with `tbl` or `df`, then use dplyr verbs 2. Do not subset with the base R convention 3. Try to maintain the **data pipeline**
49
Which pipe operator is native from R?
|> ## Footnote %>% need `magrittr` package (part of tidyverse) Both `|>` and `%>%` pipe operator means then
50
How to filter the flights data using `filter()`?
``` data(flights, package="nycfights") filter(flights,month==2,day==1) or flights |> filter(month==2,day==1) ``` ## Footnote subset() and filter() have similar features, but filter() is more optimised `subset(month==2 & day==1)`
51
How to use group the flights by month and summarize the mean of departure delay? (Use dplyr package)
``` data(flights, package="nycflights") flights |> group_by(month)|> summarise(mean(dep_delay),na.rm=T) ``` ## Footnote group_by() + summarize() = aggregate() -> Base R na.rm= T ignores missing values
52
How to use group the flights by month and summarize the mean of departure delay? (Use base R)
``` data(flights, package="nycflights") aggregate(dep_delay ~ month ,data=flights, FUN=mean, na.rm=TRUE) ```
53
How to select relevent column only? (year,month,day)
``` data(flights, package="nycflights") select(flights, year, month, day) OR flights |> select(year,month,day) ``` ## Footnote The `select()` function drops ever column except the one that specifically name
54
How to create a new column air time hours that divides air time with 60?
``` data(flights, package="nycflights") flights |> mutate( air_time_hrs = air_time / 60 ) ```
55
List out 5 imputation techniques
1. Mean/Median/Mode Imputation 2. K-nearest neighbours (KNN) Imputation 3. Multiple Imputation (MICE) 4. Regression Imputation 5. Hot-Deck Imputation
56
Why use `group_by()` with `mutate()`?
Perform summary calculations (mean, mode,median) ``` flights %>% group_by(carrier) %>% mutate( speed = distance / (air_time / 60), avg_speed = mean(speed, na.rm = TRUE) ) ``` ## Footnote carrier ... avg_speed AA 446.67 AA 446.67 DL 700 DL 700
57
How to add row with data ID=4, Name=David, Age=40, also add column age?
``` df1 <- data.frame(ID = c(1, 2, 3), Name = c("Alice", "Bob", "Charlie")) df2 <- data.frame(Age = c(25, 30, 35)) df_rbind <- rbind(df1, data.frame(ID = 4, Name = "David", Age = 40)) #Row Bind df_cbind <- cbind(df1, df2) #Column Bind ```
58
What is the keyword for Histogram in question? (3)
1. Distribution 2. Frequency 3. Skewness ## Footnote Visualize the **frequency distribution** of **continuous variables**
59
What is the keyword for Boxplot in question? (6)
1. Distribution 2. Spread 3. Extreme Values 4. Outliers 5. Quartiles/Median 6. Across Different ## Footnote Comparing **spread / distribution of data** across **multiple categories**
60
What is the keyword for Scatterplot in question? (2)
1. Relationship 2. Correlation ## Footnote Visualizing **covariance** and reveal **relationship/correlation**
61
What is the keyword for Column/Bar Chart in question? (5)
1. Counts 2. Frequencies ## Footnote Visualize **numerical differences** between **categorical variables**
62
What is the keyword for Line Chart in question? (4)
1. Dates 2. Time 3. Evolution 4. Trends ## Footnote **Evolution of data** over a **continuous interval** (usually time)
63
What is the keyword for Pie Chart in question? (5)
1. Proportion 2. Percentage Share ## Footnote Visualizes the **proportion** of **differenct categories**
64
What is the basic syntax for ggplot2?
``` ggplot(data=,mapping=) + geom_...() ``` | ggplot() function is inside ggplot2 package ## Footnote Need at least these 3 things, can omit data= and mapping= geom_...() can have 1. geom_bar(fill="steelblue") -> Bar Chart 2. geom_point(color="darkred",alpha=0.7) -> Scatterplot 3. geom_line(color="steelblue") -> Line Chart 4. geom_boxplot(fill="lightblue") -> Box Plot 5. geom_histogram(binwidth=2,fill="orange", color="black") -> Histogram
65
Fill up the code ``` ggplot(data = mtcars, aes(x = wt, y = mpg)) + geom_point() + _1_( _2_ = "Fuel Efficiency vs. Vehicle Weight", x = "Weight (1000 lbs)", _3_ = "Miles Per Gallon (MPG)", _4_ = "Source: mtcars dataset" ) ```
1. labs 2. title 3. y 4. caption
66
What is this chart? ``` ggplot(cyl_counts, aes(x = "", y = n, fill = cyl)) + geom_bar(stat = "identity", width = 1) + coord_polar(theta = "y") + geom_text(aes(label = n), position = position_stack(vjust = 0.5)) + # Add labels labs(title = "Proportion of Cars by Cylinder Type", fill = "Cylinders") + theme_void() ```
Pie Chart ## Footnote The **coord_polar(theta="y")** change the bar chart into circle
67
Fill up the blanks ``` # 2. Scatterplot + a linear model trend line ggplot(mtcars, aes(_1_ = wt, y = mpg)) + _2_() + # This draws the dots _3_(_4_ = "lm", se = FALSE, color = "red") + # This draws the red line labs(title = "Plot with geom_point() and geom_smooth()") ```
1. x 2. geom_point() 3. geom_smooth() 4. method="lm"
68
What is the keyword for Count / Dot in question? (5)
1. Categories (Education Level, Employment Type) ## Footnote Analyzing the relationship **between 2 categories**
69
What is the keyword that helps us to choose the most appropriate variable to analyze for a dependent variable? (4)
1. Relationship 2. Effects on the target 3. Coefficient Estimates 4. Importance
70
How to use `aggreate()` to find the mean Sepal.Length for each Species in the iris dataset?
aggreate(Sepal.Length ~ Species, data = iris, FUN=mean) ## Footnote Species Sepal.Length 1 setosa 5.006 2 versicolor 5.936 3 virginica 6.588
71
How to use the count dplyr function instead from base R?
`dplyr::count()`
72
How to use `aggreate()` to count the number of cyl(cylinders) for each car in the mtcars dataset?
``` aggregate(mpg ~ cyl, data=mtcars, FUN=length) ```
73
73
74
Need more code like in mock test str(flights), color=carrier
75