What are descriptive statistics used for in a data analytics role?
To summarize data in terms of mean, median, interquartile range, standard deviation, or skewness
Descriptive statistics help in understanding how numerical and categorical data are distributed.
Why is it important to understand descriptive statistics?
Because machine learning algorithms are based on statistical models
Understanding your data profile is critical for selecting the appropriate algorithm.
What is the focus of descriptive statistics?
Describing useful information about a given data set
It does not require presenting all of the data.
Name some Python packages used to calculate descriptive statistics.
These packages provide functions to compute various descriptive statistics.
What does the mean represent in descriptive statistics?
The average value of a dataset
It is calculated by summing all values and dividing by the number of values.
What does the median indicate in a dataset?
The middle value when the data is ordered
It is less affected by outliers compared to the mean.
What is the interquartile range (IQR)?
The range between the first quartile (Q1) and the third quartile (Q3)
It measures the spread of the middle 50% of the data.
What does standard deviation measure?
The amount of variation or dispersion in a dataset
A low standard deviation indicates that data points tend to be close to the mean.
What does skewness indicate about a dataset?
The asymmetry of the distribution of values
Positive skew indicates a longer tail on the right, while negative skew indicates a longer tail on the left.
True or false: Plotting data can help in understanding its summary better.
TRUE
Visualizations can reveal patterns and insights that may not be apparent from summary statistics alone.
Name the popular descriptive statistics for representing a ‘typical’ value for a dataset.
These measures help summarize the central tendency of the data.
Why do we study Descriptive statistics?
Because Data Analytics is based on statistical models
Understanding your data profile is critical to select the appropriate test that best fits your data.
What does the term distribution mean in data analytics or statistics?
A probability distribution
A distribution refers to the spread of the values across a range.
What is a variable?
Any characteristic, behaviour, category, or number that can be measured or counted
Variables are fundamental in data analysis and statistics.
What are the two main types of numerical variables?
Numerical variables can be either whole numbers or values within a range.
Define a continuous variable.
A variable that may contain any value in a range
Examples include spending amounts or time measured in seconds.
Define a discrete variable.
A variable that has only particular valid values
Examples include shoe sizes or the number of times you went to the market.
What are categorical variables?
Variables selected from a group of labels
Examples include coin toss outcomes or marital status.
What is cardinality in the context of categorical variables?
The number of different labels a categorical variable can have
For example, a coin toss has two outcomes: heads or tails.
What is the difference between ordinal and nominal variables?
Ordinal variables have a ranking, while nominal variables do not.
True or false: A test result can be encoded as a number, such as 0 for fail and 1 for pass, making it a discrete variable.
FALSE
It is a categorical variable that was encoded, not a discrete variable.
What is an example of a situation where a categorical variable is encoded as a number?
A test result encoded as 0 for fail and 1 for pass
This illustrates how categorical data can be represented numerically.
What is a unique ID in the context of data variables?
A set of numbers generated for identification purposes
IDs are typically categorical variables, even if they are numeric.
What is the purpose of descriptive statistics?
Describing and summarising the data
This includes a quantitative approach (numerical summary) and graphical representation (plots).