Probability and Statistics Flashcards

(129 cards)

1
Q

What are descriptive statistics used for in a data analytics role?

A

To summarize data in terms of mean, median, interquartile range, standard deviation, or skewness

Descriptive statistics help in understanding how numerical and categorical data are distributed.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Why is it important to understand descriptive statistics?

A

Because machine learning algorithms are based on statistical models

Understanding your data profile is critical for selecting the appropriate algorithm.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is the focus of descriptive statistics?

A

Describing useful information about a given data set

It does not require presenting all of the data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Name some Python packages used to calculate descriptive statistics.

A
  • Pandas
  • NumPy
  • SciPy

These packages provide functions to compute various descriptive statistics.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What does the mean represent in descriptive statistics?

A

The average value of a dataset

It is calculated by summing all values and dividing by the number of values.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What does the median indicate in a dataset?

A

The middle value when the data is ordered

It is less affected by outliers compared to the mean.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is the interquartile range (IQR)?

A

The range between the first quartile (Q1) and the third quartile (Q3)

It measures the spread of the middle 50% of the data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What does standard deviation measure?

A

The amount of variation or dispersion in a dataset

A low standard deviation indicates that data points tend to be close to the mean.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What does skewness indicate about a dataset?

A

The asymmetry of the distribution of values

Positive skew indicates a longer tail on the right, while negative skew indicates a longer tail on the left.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

True or false: Plotting data can help in understanding its summary better.

A

TRUE

Visualizations can reveal patterns and insights that may not be apparent from summary statistics alone.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Name the popular descriptive statistics for representing a ‘typical’ value for a dataset.

A
  • Average
  • Median
  • Mode

These measures help summarize the central tendency of the data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Why do we study Descriptive statistics?

A

Because Data Analytics is based on statistical models

Understanding your data profile is critical to select the appropriate test that best fits your data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What does the term distribution mean in data analytics or statistics?

A

A probability distribution

A distribution refers to the spread of the values across a range.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is a variable?

A

Any characteristic, behaviour, category, or number that can be measured or counted

Variables are fundamental in data analysis and statistics.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What are the two main types of numerical variables?

A
  • Discrete
  • Continuous

Numerical variables can be either whole numbers or values within a range.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Define a continuous variable.

A

A variable that may contain any value in a range

Examples include spending amounts or time measured in seconds.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Define a discrete variable.

A

A variable that has only particular valid values

Examples include shoe sizes or the number of times you went to the market.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What are categorical variables?

A

Variables selected from a group of labels

Examples include coin toss outcomes or marital status.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What is cardinality in the context of categorical variables?

A

The number of different labels a categorical variable can have

For example, a coin toss has two outcomes: heads or tails.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

What is the difference between ordinal and nominal variables?

A
  • Ordinal: Related to an order (e.g., days of the week)
  • Nominal: No specific order (e.g., preferred color)

Ordinal variables have a ranking, while nominal variables do not.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

True or false: A test result can be encoded as a number, such as 0 for fail and 1 for pass, making it a discrete variable.

A

FALSE

It is a categorical variable that was encoded, not a discrete variable.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

What is an example of a situation where a categorical variable is encoded as a number?

A

A test result encoded as 0 for fail and 1 for pass

This illustrates how categorical data can be represented numerically.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

What is a unique ID in the context of data variables?

A

A set of numbers generated for identification purposes

IDs are typically categorical variables, even if they are numeric.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

What is the purpose of descriptive statistics?

A

Describing and summarising the data

This includes a quantitative approach (numerical summary) and graphical representation (plots).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
What type of analysis is performed when describing **one variable**?
Univariate analysis ## Footnote This focuses on a single variable's characteristics.
26
What type of analysis is conducted when studying **two variables** at once?
Bivariate analysis ## Footnote This examines the relationship between two variables.
27
What is the term for studying **more than two variables**?
Multivariate analysis ## Footnote This involves analyzing multiple variables simultaneously.
28
Descriptive statistics can be broken down into two major studies: **central tendency** and _______.
Variability ## Footnote These studies help summarize the dataset's characteristics.
29
Define **population** in the context of data analysis.
All existing elements from a particular variable ## Footnote Collecting data from a population can be challenging due to its size.
30
What is a **sample**?
A part of a population ## Footnote A sample ideally preserves the statistical characteristics of the population if its size is statistically significant.
31
What happens to the error when the **sample size** increases?
The error decreases ## Footnote A larger sample size leads to more accurate inferences about the population.
32
What are **outliers**?
Data points that differ from most of the data ## Footnote Outliers can indicate errors or new behaviors in the data.
33
What might cause **outliers** in data?
* New natural behavior of the data * Error in the data collection process * Human error or bias * Equipment not being calibrated ## Footnote Understanding the context of data collection is crucial for interpreting outliers.
34
True or false: Outliers can exist in data analysis.
TRUE ## Footnote Recognizing the presence of outliers is important for accurate data interpretation.
35
What is the purpose of **central tendency** in data analysis?
To provide figures that summarize the data ## Footnote Central tendency includes measures such as mean, median, and mode.
36
What are the **measures of variability** mentioned in the text?
* Variance * Standard Deviation * Skewness * Kurtosis * Percentiles and Quartiles * Range ## Footnote These measures help quantify how data is spread around the mean.
37
What does **variance** show?
The spread of the data around the mean ## Footnote It indicates how far/close the data points are from the mean.
38
What is the symbol for **standard deviation**?
σ ## Footnote Standard deviation measures the dispersion of the data from the mean in either direction.
39
How can you calculate **standard deviation** in a DataFrame or Series?
* Using the square root of variance * Using the method .std() ## Footnote np.sqrt() can be used for the square root calculation.
40
What does **skewness** measure?
The asymmetry of the data ## Footnote A distribution is symmetric when it looks the same to the left and right of the center point.
41
What happens in a distribution with **positive skewness**?
The tail on the right side is longer ## Footnote This indicates that there are more extreme values on the higher end.
42
What does **negative skewness** indicate?
The tail of the left side of the distribution is longer ## Footnote This suggests more extreme values on the lower end.
43
How can you calculate **skewness** in a DataFrame or Series?
Use .skew() ## Footnote This method provides the skewness value for the data.
44
What does **kurtosis** relate to?
Studying the tails of the distribution ## Footnote It measures the presence of outliers in the distribution.
45
What does high **kurtosis** indicate?
Significant tails or many outliers ## Footnote This suggests a distribution with extreme values.
46
What is a **percentile**?
All values below a given percentage ## Footnote For example, the 50th percentile includes all values below which 50% of observations may be found.
47
What are the three **quartiles** mentioned?
* First quartile (Q1 or 25th percentile) * Second quartile (Q2 or 50th percentile or median) * Third quartile (Q3 or 75th percentile) ## Footnote Quartiles divide the percentiles into four parts.
48
What is the **interquartile range (IQR)**?
The difference between the first and third quartile ## Footnote It helps understand the range where 50% of the most frequent data is.
49
How can you calculate **quartiles** in a DataFrame or Series?
Use .quantile() ## Footnote This method provides the quartile values for the data.
50
What is the **range** in statistics?
The difference between the maximum and minimum values ## Footnote It provides a measure of the spread of the data.
51
How can you find the **minimum value** in your numerical data?
Use .min() ## Footnote This method retrieves the smallest value in the dataset.
52
What is the purpose of a **statistical test**?
To measure a test statistic that explains the relationship between variables ## Footnote Statistical tests are essential for analyzing data and drawing conclusions.
53
Name the **types of statistical tests** considered in data analytics.
* Normality tests * Correlation tests * Parametric tests * Non-parametric tests ## Footnote These tests are grouped based on their objectives.
54
What is the **Shapiro-Wilk test** used for?
To evaluate if data is normally distributed ## Footnote It is a type of normality test.
55
List three **correlation tests** mentioned.
* Pearson's * Spearman's * Chi-Squared Test ## Footnote These tests evaluate how variables correlate.
56
What are **parametric tests** used for?
To evaluate if data from distinct groups are similar or different ## Footnote Examples include Student's t-test and ANOVA.
57
Name two **non-parametric tests** mentioned.
* Mann-Whitney U Test * Wilcoxon Test ## Footnote Non-parametric tests do not assume a specific data distribution.
58
What does the **Kruskal-Wallis test** evaluate?
If data from distinct groups are similar or different ## Footnote It is a non-parametric alternative to ANOVA.
59
Why do we need to use **statistical tests**?
* Determine differences or similarities between groups * Evaluate if a predictor variable has statistical importance to a target variable ## Footnote Statistical tests provide insights into data relationships.
60
Fill in the blank: A **test statistic** is a number that explains the _______ between your variables in the test.
relationship ## Footnote Understanding the relationship is crucial for interpreting statistical tests.
61
What does the **Chi-Squared Test** measure?
Significant difference between expected and observed frequencies in categorical variables ## Footnote It assesses how well observed data fits with expected data.
62
What is the **null hypothesis** in hypothesis testing?
No difference in frequency or proportion of occurrences in each category ## Footnote It serves as a default position that indicates no effect or no difference.
63
What is the **alternate hypothesis** in hypothesis testing?
There is a difference in frequency or proportion of occurrences in each category ## Footnote It typically represents the research question being tested.
64
What is the purpose of **hypothesis testing**?
Forming opinions or conclusions from collected data ## Footnote It involves comparing observed data to expected data.
65
What is the **Significance Level** (alpha)?
Probability of rejecting the null hypothesis when it is true ## Footnote Commonly set at 5%, indicating a 5 in 100 chance of error.
66
What does a **test statistic** represent?
A number explaining how different the relationship between variables is ## Footnote It varies in calculation depending on the type of statistical test.
67
What is a **p-value**?
Probability that the null hypothesis is true ## Footnote A smaller p-value indicates stronger evidence against the null hypothesis.
68
If the **p-value** is lower than the alpha, what can be concluded?
Enough evidence to reject the null hypothesis ## Footnote This indicates that the observed effect is statistically significant.
69
What does the **Shapiro-Wilk test** assess?
Whether a given data set is normally distributed ## Footnote The null hypothesis states that the population is normally distributed.
70
What happens if the p-value in the Shapiro-Wilk test is less than the chosen alpha level?
Reject the null hypothesis ## Footnote This suggests that the data is not normally distributed.
71
What are the two main types of **statistical tests**?
* Parametric tests * Nonparametric tests ## Footnote The choice between these tests depends on the normality of the data.
72
When should you use a **parametric test**?
When the data is **normally distributed** ## Footnote Parametric tests assume that the underlying data follows a normal distribution.
73
When should you use a **nonparametric test**?
When the data is **not normally distributed** ## Footnote Nonparametric tests do not assume a specific distribution of the data.
74
What is a **t-test** used for?
To test the difference between **two sample means** ## Footnote It tests if the difference in the means is 0.
75
Who developed the **t-test**?
William Gosset of **Guinness's Brewery** ## Footnote The t-test is also known as Student's t-test.
76
What is the null hypothesis in a **t-test**?
There are **no significant levels of difference** between the samples ## Footnote The alternative hypothesis states that there are significant levels of difference.
77
What is a **Paired Student’s t-test** used for?
To test the difference between **two sample parameter values** ## Footnote The samples should be dependent or paired.
78
What type of samples are required for a **Paired Student’s t-test**?
The samples should be **dependent (paired)** ## Footnote An example is testing the same group before and after an intervention.
79
What does **ANOVA** stand for?
Analysis of **Variance** ## Footnote ANOVA compares mean variation between three or more groups.
80
What is required for the data in an **ANOVA** test?
The data should be **normally distributed** ## Footnote ANOVA tests assume normality in the data distribution.
81
What is an example dataset that could be used in an **ANOVA** test?
Pain threshold levels across different people with varying **hair colors** ## Footnote This dataset can show mean variations among groups.
82
What is the **Mann-Whitney U Test** used for?
To determine differences between two groups where at least one group is not normally distributed ## Footnote It is a nonparametric test and requires independent (unpaired) samples.
83
True or false: The **Mann-Whitney U Test** is a parametric test.
FALSE ## Footnote The Mann-Whitney U Test is a nonparametric test.
84
What type of samples are required for the **Mann-Whitney U Test**?
* Independent samples * Unpaired samples ## Footnote This test is used when at least one group is not normally distributed.
85
What is a **Wilcoxon Test** used for?
To analyze paired samples when at least one sample is not normally distributed ## Footnote It is a non-parametric test and requires dependent (paired) samples.
86
What type of samples are required for the **Wilcoxon Test**?
* Dependent samples * Paired samples ## Footnote This test is suitable for matched pairs, such as before and after measurements.
87
Give an example of when to use a **Wilcoxon Test**.
When examining the difference between scores on a test before and after a training intervention ## Footnote This involves the same group being tested twice.
88
What is the **Kruskal-Wallis test** used for?
To determine differences between three or more groups when at least one distribution is not normally distributed ## Footnote It is a nonparametric alternative to a one-way ANOVA.
89
True or false: The **Kruskal-Wallis test** is a parametric alternative to a one-way ANOVA.
FALSE ## Footnote The Kruskal-Wallis test is a nonparametric test.
90
What is a key characteristic of the **Kruskal-Wallis test**?
It is used for three or more groups ## Footnote It assesses whether there are differences among groups when at least one distribution is not normally distributed.
91
What is **A/B testing**?
A method of comparing two versions of a product to determine which one performs better ## Footnote Widely used in web development, marketing, and product design.
92
In A/B testing, what are the two versions typically referred to as?
* Version A * Version B ## Footnote These versions are shown to different segments of users simultaneously.
93
What is the purpose of **A/B testing**?
To make data-driven decisions that can lead to improvements in user experience and business outcomes ## Footnote It helps identify which version of a product performs better based on specific metrics.
94
What is an example of an **A/B test** involving background color?
Testing a webpage with: * Version A: White background * Version B: Light blue background ## Footnote The objective is to determine which background color leads to a higher conversion rate.
95
What is the first step in designing an **A/B test**?
Identify the objective clearly ## Footnote The objective should be specific, measurable, attainable, relevant, and time-bound (SMART).
96
What is a potential objective for an **E-commerce website** in A/B testing?
Increase the conversion rate on product pages ## Footnote For example, testing whether adding customer reviews increases sales.
97
What is a potential objective for an **email marketing campaign** in A/B testing?
Improve the open rate of a promotional email ## Footnote For example, testing two different subject lines to see which leads to a higher open rate.
98
What is a potential objective for a **mobile app** in A/B testing?
Enhance user engagement ## Footnote For example, testing whether adding a daily motivational quote feature increases average time spent on the app.
99
What is a potential objective for **brick-and-mortar retail** in A/B testing?
Increase sales of a product in the store ## Footnote For example, testing whether showing an ad at the store's entrance increases sales.
100
What is the significance of **randomization** in A/B testing?
Visitors are randomly assigned to different versions to ensure unbiased results ## Footnote This helps in accurately measuring the performance of each version.
101
What is the importance of **sample size** in A/B testing?
To ensure a large enough sample size to detect a meaningful difference ## Footnote For example, having 10,000 visitors in each version can provide reliable data.
102
True or false: A/B testing can only be applied to webpages.
FALSE ## Footnote A/B testing can be applied to various products, including apps and marketing campaigns.
103
What did Google test in their A/B test regarding link colors?
41 different shades of blue for its link colors ## Footnote The objective was to identify the most effective color for increasing click-through rates.
104
What was the objective of Facebook's A/B test on ad formats?
To increase user interaction with ads ## Footnote Users were randomly assigned to see one of two ad formats.
105
What did Amazon test in their A/B testing regarding product pages?
Different layouts of the product detail page ## Footnote The objective was to see which layout led to higher conversion rates.
106
What is the first step after clarifying the **objective** in A/B testing?
Create hypotheses ## Footnote A hypothesis is a statement that can be tested and will either be supported or rejected by the test results.
107
Define a **hypothesis** in the context of A/B testing.
A statement that can be tested ## Footnote Each hypothesis should be specific and measurable.
108
What are the two main approaches in A/B testing for hypothesis statements?
* 2-sample cases * 1-sample cases ## Footnote A two-sample hypothesis test compares the performance metric of two different versions.
109
What does the **null hypothesis (H-zero)** state in a two-sample A/B test?
No significant difference between conversion rates ## Footnote Any observed difference is due to random chance.
110
What is the **alternative hypothesis (H1)** in A/B testing?
There is a significant difference between conversion rates ## Footnote The observed difference is not due to random chance.
111
In a one-sample test, what does the null hypothesis state?
No significant difference between the sample statistic and the known industry standard ## Footnote Any observed difference is due to random chance.
112
What is the **Minimum Detectable Effect (MDE)**?
The smallest improvement you want to detect ## Footnote For example, a 5 percent increase in the conversion rate.
113
What is the significance level (alpha) typically set at in A/B testing?
0.05 ## Footnote This means you are willing to accept a 5 percent chance of a false positive.
114
What does **power (1 - Beta)** represent in A/B testing?
The probability of detecting a true effect ## Footnote Typically set at 0.8, meaning an 80% chance.
115
What is the purpose of **randomisation** in A/B testing?
To randomly assign users to control or treatment groups ## Footnote This helps eliminate biases and confounding variables.
116
What should be maintained to improve the statistical power of an A/B test?
Equal group sizes ## Footnote This ensures that the test results are more reliable.
117
What is crucial during the **implementation phase** of A/B testing?
Setting up and launching the A/B test correctly ## Footnote Any errors can compromise the integrity of the test results.
118
What are **key performance indicators (KPI)** used for in A/B testing?
* Conversion rates * Click-through rates * Engagement metrics * Retention rates ## Footnote These metrics measure the success of the test.
119
What does **statistical analysis** help determine in A/B testing?
Whether differences between groups are statistically significant ## Footnote It evaluates if observed differences are due to random chance.
120
What is a common threshold for statistical significance in A/B testing?
p-value less than 0.05 ## Footnote Indicates that the results are statistically significant.
121
What does a **confidence interval** provide in A/B testing?
A range of values for the true effect of the treatment ## Footnote A typical value might be a 95% confidence interval.
122
What should be assessed when interpreting A/B test results?
* Magnitude of the effect * Business impact * Cost and feasibility * Long-term implications ## Footnote This ensures informed decisions that drive business outcomes.
123
True or false: Focusing solely on **statistical significance** is sufficient for interpreting A/B test results.
FALSE ## Footnote Always assess practical implications and real-world impact.
124
What common pitfall involves overlooking external factors in A/B testing?
Ignoring external factors ## Footnote These can influence test results, such as seasonality or market trends.
125
What does **insufficient sample size** lead to in A/B testing?
Unreliable results ## Footnote Ensure the sample size is large enough to detect a meaningful effect.
126
What is the risk of **cherry-picking results** in A/B testing?
Selectively reporting favourable results ## Footnote Report all results transparently and conduct thorough analysis.
127
What should be evaluated to avoid a **short-term focus** in A/B testing?
Long-term impact of changes ## Footnote Conduct follow-up tests if necessary.
128
Fill in the blank: A simple business problem might be: The webpage ‘contact us’ form is too long, so the hypothesis is that reducing the form inputs from 20 to 10 will _______.
increase completed forms ## Footnote This can be evaluated by A/B testing the two forms.
129
What statistical method might be used to evaluate a hypothesis about tool maintenance in a factory?
Regression model ## Footnote This model predicts the best time for maintenance based on data collected.