LOS 10a: define simple random sampling and a sampling distribution
LOS 10b: Explain sampling error
In a simple random sample, each member of the population has the same probability or likelihood of being included in the sample. In practice, random samples are generated using random number tables or computer random-number generators. Systematic sampling is often used to generate approximately random samples. In systematic sampling, every kth member in the population list is selected until the desired sample size is reached.
Sampling Error
Is the error caused by observing a sample instead of the entire population to draw conclustions relating to population parameters.
Sampling Distribution
This is the probability distribution of a given sample statistic under repeated sampling of the population. By repeating the sampling of the population, we will get different means with each sample. The distribution of these sample means is called the sampling distribution of the mean
LOS 10c: distinguish between simple random and stratified random sampling
Stratification is the process of grouping members of the population into relatively homogeneous subgroups, or strata, before drawing samples. The strata should be mutually exclusive and collectively exhaustive. Once this is accomplished, random sampling is applied within each stratum and the number of observations drawn from each stratum is based on the size of the stratum relative to the population. This often improves the representativeness of the sample by reducing the sampling error.
LOS 10 d: Distinguish between time-series and cross-sectional data
Time-series data consists of observations measured over a period of time, spaced at uniform intervals.
Cross-Sectional data refers to data collected by observing many subjects at the same point in time.
Data can have both time-series and cross-sectional data in them. Examples:
LOS 10e: Explain the central limit theorem and its importance
The central limit theorem allows use to make accurate statements about the population mean and variance using the sample mean and variance regardless of the distribution of the population, as long as the sample size is adequate, normally defined as more than 30.
The important properties of the central limit theorem are:
LOS 10f: Calculate and interpret the standard error of the sample mean
The standard deviation of the distribution of sample means is known as the standard error of the statistic
When population variance σ2, is known, the standard error of sample mean is calculated as:
practically speaking, population variances are almost never known, so we estimate the standard error of the sample mean using the sample’s standard deviation
LOS 10h: Distinguish between a point estimate and a confidence interval estimate of a population parameter.
A point estimate involves the use of sample data to calculate a single value that serves as an approximation for an unknown population parameter. For example the sample mean is a point estimate for the population mean. The formula used to calculate a point estimate is known as an estimator and is given as:
A confidence interval uses sample data to calculate a range of possible values that an unkown population parameter can take, with a given probability of (1-a) , where a is called the level of significance, and (1-a) refers to the degrees of confidence
A confidence intervale has the following structure:
LOS 10g: Identify and describe desirable properties of an estimator
Unbiasedness - an unbiased estimator is one whose expected value is equal to the parameter being estimated. The expected value of the sample mean equals the population mean. Therefore sample mean is unbiased estimator or population mean
Efficiency- an efficient unbiased estimator is one that has the lowest variance among all unbiased estimators of the same parameter
Consistency a consistent estimator is one for which the probability of estiamtes close to the value of the population parameter increases as sample size increases.
LOS 10i: Describe the properties of Student’s t-distribution and calculate and interpret its degrees of freedom
Student’s t-distribution is a bell-shaped probability distribution with the following properties:
The t-distribution is used in the following scenarios:
LOS 10j: calculate and interpret a confidence interval for a population mean, given a normal distribution with 1)a known population variance 2) an unknown population/ variance, or 3) an unknown variance and a large sample size
The confidence interval for the population mean when the population follows a normal distribution and its variance is known is calculated as:
When the varianceof a normally distributed population is not known, we use the t-distribution to construct confidence intervals
When the population is normally distributed we :
When the distribution of the population is nonnormal, the construction of an appropriate confidence interval depends on the sample size
LOS 10k: Describe the issues regarding selection of the appropriate sample size, data-mining bias, sample selection bias, survivorship bias, look-ahead bias, and time-period bias
We know that as sample size increases, standard error decreases, so this would be considered good. However, in practice two considerations may work against increasing sample size:
Types of Biases
Data mining is the practice of developing a model by extensively searching through a data set for statistically significant relationships until a pattern that works is discovered. Given that a lot of hypothesis are tested, its virtually certain that some of them will appear to be highly stat significant, even on a data set with no real correlations at all.
This most commonly occurs when:
Warning signs that data mining bias might exist are:
Sample Selection Bias results from the exclusion of certain assets from a study due to the unavailability of data
Some data bases may suffer from survivorship bias.
sample selection bais is most severe in studies of hedge fund returns, since they are not required to publicly post returns, they only post the best ones
Look-Ahead Bias arises when a study uses information that was not available on the test date, example estimates of of revenues
Time-Period Bias arises if a test is based on a certain time period, which may make the results obtained from the study time-period specifict