What is the Central Limit Theorem and why is it important?
States that if we sample from a population using a sufficiently large sample size, the mean of the samples (sample population) will be normally distributed (assuming true random sampling). The mean tending to the mean of the population and variance equal to the variance of the population divided by the size of sampling. This will be true regardless of the distribution of the population.
https://spin.atomicobject.com/2015/02/12/central-limit-theorem-intro/
What is sampling?
Data sampling is a statistical analysis technique used to select, manipulate, and analyze a representative subset of data points to identify patterns and trends in the larger data set being examined.
https://searchbusinessanalytics.techtarget.com/definition/data-sampling
Why is data sampling important?
It enables data scientists and other data professionals to work with a small, manageable amount of data about a statistical population to build and run analytical models more quickly while still producing accurate findings.
How is data sampling useful?
For data sets that are too large to efficiently analyze in full.
Identifying and analyzing a representative sample is more efficient and cost effective than surveying the entirety of the data or population.
Example: in big data analytics applications or surveys.
What should be considered when data sampling and why?
The size of the required data sample and the possibility of introducing a sampling error.
What are the different sampling methods?
What is simple random sampling?
Randomly selecting subjects from the whole population.
What is stratified sampling?
Subsets of the data sets or population are created based on a common factor and samples are randomly collected from each subgroup. A sample is drawn from each strata using a random sampling method. *remember to sample proportionally.
What is cluster sampling?
A larger dataset is divided into subsets or clusters based on a defined factor, then a random sampling of clusters is analyzed–the sampling unit is the whole cluster–instead of sampling individuals form each group, a researcher will study whole clusters
What is multistage sampling?
More complicated form of cluster sampling
Dividing the larger population into a number of clusters
Second stage clusters are then broken out based on a secondary factor, and those clusters are then sampled and analyzed
What is systematic sampling?
setting an interval at which to extract data from the larger population
Example - every 10th row in a dataset
What are the non-probability sampling methods?
What is the difference between type I vs type II error?
Type I: null hypothesis is true but is rejected
Type II: the null hypothesis is false but erroneously fails to be rejected
What is linear regression?
the relationship between a single dependent variable Y and one or more predictors (X)
What are the assumptions required for linear regression?
Define p-value
The minimum alpha (significance level) at which the coefficient is relevant
The lower the p-value, the more important the variable is in predicting the response/dependent variable (Y)
Define coefficient
The coefficient value signifies how much the mean of the dependent variable changes given a 1-unit shift in the independent variable while holding other variables in the model constant.
Define R-squared
Statistical measure that represents the proportion of the variance for a dependent variable that’s explained by an independent variable or variables
What is statistical interaction?
the effect of one independent variable may depend on the level of the other independent
variable
the effect of one factor (input/independent variable) on the dependent variable (output
variable) differs among levels of another factor.
What is selection bias?
‘sampling’ bias - data that is gathered and prepared for modeling has characteristics that are not representative of the true, future population of cases the model will see
active selection bias occurs when a subset of the data is systemically (non-randomly) excluded from analysis
What is an example of a data set with a non-Gaussian distribution
What are some causes when data is not normally distributed?