Statistics Flashcards

Question 1

Q

What is the Central Limit Theorem and why is it important?

Answer

A

States that if we sample from a population using a sufficiently large sample size, the mean of the samples (sample population) will be normally distributed (assuming true random sampling). The mean tending to the mean of the population and variance equal to the variance of the population divided by the size of sampling. This will be true regardless of the distribution of the population.

https://spin.atomicobject.com/2015/02/12/central-limit-theorem-intro/

Question 2

Q

What is sampling?

Answer

A

Data sampling is a statistical analysis technique used to select, manipulate, and analyze a representative subset of data points to identify patterns and trends in the larger data set being examined.

https://searchbusinessanalytics.techtarget.com/definition/data-sampling

Question 3

Q

Why is data sampling important?

Answer

A

It enables data scientists and other data professionals to work with a small, manageable amount of data about a statistical population to build and run analytical models more quickly while still producing accurate findings.

Question 4

Q

How is data sampling useful?

Answer

A

For data sets that are too large to efficiently analyze in full.
Identifying and analyzing a representative sample is more efficient and cost effective than surveying the entirety of the data or population.

Example: in big data analytics applications or surveys.

Question 5

Q

What should be considered when data sampling and why?

Answer

A

The size of the required data sample and the possibility of introducing a sampling error.

Question 6

Q

What are the different sampling methods?

Answer

A

Simple random sampling
Stratified sampling
Cluster sampling
Multistage sampling
Systematic sampling

Question 7

Q

What is simple random sampling?

Answer

A

Randomly selecting subjects from the whole population.

Question 8

Q

What is stratified sampling?

Answer

A

Subsets of the data sets or population are created based on a common factor and samples are randomly collected from each subgroup. A sample is drawn from each strata using a random sampling method. *remember to sample proportionally.

Question 9

Q

What is cluster sampling?

Answer

A

A larger dataset is divided into subsets or clusters based on a defined factor, then a random sampling of clusters is analyzed–the sampling unit is the whole cluster–instead of sampling individuals form each group, a researcher will study whole clusters

Question 10

Q

What is multistage sampling?

Answer

A

More complicated form of cluster sampling

Dividing the larger population into a number of clusters
Second stage clusters are then broken out based on a secondary factor, and those clusters are then sampled and analyzed

Question 11

Q

What is systematic sampling?

Answer

A

setting an interval at which to extract data from the larger population

Example - every 10th row in a dataset

Question 12

Q

What are the non-probability sampling methods?

Answer

A

Convenience sampling
Consecutive sampling
Purposive/judgmental sampling
Quota sampling

Question 13

Q

What is the difference between type I vs type II error?

Answer

A

Type I: null hypothesis is true but is rejected

Type II: the null hypothesis is false but erroneously fails to be rejected

Question 14

Q

What is linear regression?

Answer

A

the relationship between a single dependent variable Y and one or more predictors (X)

Question 15

Q

What are the assumptions required for linear regression?

Answer

A

Linearity: The relationship between X and the mean of Y is linear.
Independence: Observations are independent of each other (minimal collinearity between explanatory variables)
The errors or residuals
(y-actual – y-hat(predicted)) are normally distributed
Homoscedasticity - The variance of residual is the same for any value of X

Question 16

Q

Define p-value

Answer

Study These Flashcards

A

The minimum alpha (significance level) at which the coefficient is relevant
The lower the p-value, the more important the variable is in predicting the response/dependent variable (Y)

Question 17

Q

Define coefficient

Answer

Study These Flashcards

A

The coefficient value signifies how much the mean of the dependent variable changes given a 1-unit shift in the independent variable while holding other variables in the model constant.

Question 18

Q

Define R-squared

Answer

Study These Flashcards

A

Statistical measure that represents the proportion of the variance for a dependent variable that’s explained by an independent variable or variables

Question 19

Q

What is statistical interaction?

Answer

Study These Flashcards

A

the effect of one independent variable may depend on the level of the other independent
variable

the effect of one factor (input/independent variable) on the dependent variable (output
variable) differs among levels of another factor.

Question 20

Q

What is selection bias?

Answer

Study These Flashcards

A

‘sampling’ bias - data that is gathered and prepared for modeling has characteristics that are not representative of the true, future population of cases the model will see

active selection bias occurs when a subset of the data is systemically (non-randomly) excluded from analysis

Question 21

Q

What is an example of a data set with a non-Gaussian distribution

Answer

Study These Flashcards

A

Weibull distribution, found with life data such as survival times of a product
Log-normal distribution, found with length data such as heights
Largest-extreme-value distribution, found with data such as the longest down-time each day
Exponential distribution, found with growth data such as bacterial growth Poisson distribution, found with rare events such as number of accidents
Binomial distribution, found with “proportion” data such as percent defectives or the possible numbers of successes on n trials for independent events that each have a probability of p occurring.

Question 22

Q

What are some causes when data is not normally distributed?

Answer

Study These Flashcards

A

Extreme values/ Outliers - It is important that outliers are identified as truly special causes before they are eliminated. Extreme values should only be explained and removed from the data if there are more of them than expected under normal conditions.
Overlap of Two or More Processes - If two or more data sets that would be normally distributed on their own are overlapped, data may look bimodal or multimodal – it will have two or more most-frequent values. The remedial action for these situations is to determine which X’s cause bimodal or multimodal distribution and then stratify the data.
Insufficient Data Discrimination - Round-off errors or measurement devices with poor resolution can make truly continuous and normally distributed data look discrete and not normal. Insufficient data discrimination – and therefore an insufficient number of different values – can be overcome by using more accurate measurement systems or by collecting more data.
Sorted Data - Collected data might not be normally distributed if it represents simply a subset of the total output a process produced. This can happen if data is collected and analyzed after sorting.
Values Close to Zero or a Natural Limit -If a process has many values close to zero or a natural limit, the data distribution will skew to the right or left. In this case, a transformation, such as the Box-Cox power transformation, may help make data normal. In this method, all data is raised, or transformed, to a certain exponent, indicated by a Lambda value. When comparing transformed data, everything under comparison must be transformed in the same way.
Data Follows a Different Distribution

Statistics Flashcards

(22 cards)