Mean
Average of all numbers
Median
Middle number in a sequence
Mode
Number that occurs most often within a set
Range
Difference between highest and lowest values
Standard Derivation is a measure used to
quantify the amount of variation of data values
Histogram (2 points)
Name the distribution
normal
Name the distribution
right skewed (where tail goes)
Name type of distribution
Multimodal
Draw
Scatter plots show…
how much one variable is affected by another
Correlations show
how strongly pairs of variables are related
What is the measure of correlation?
correlation coefficient r
1 is perfect
0 is no correlation
-1 is perfectly negative correlation
An outlier is
an observation that lies an abnormal distance from other values in a random sample
How do you identify outliers? (BPRD)
Handling missing values on small scale < 5%
Drop or omit
Handling missing values on larger scale methods (MKFS)
3 types of invalid data
missing data values
invalid values that suggest true values
invalid values that provide no information regarding true values
What is scaling?
scaling features to lie between a given minium and maximum value
Transformation is…
converting data from one format or structure into another format or structure
Feature selection is…
the process of selecting a subset of relevant features for use in model construction
4 Reasons for using feature selection (REIR)
reduces the complexity of a model
enables the machine learning algorithm to train faster
improve the accuracy if the right subset is chosen
reduces overfitting
5 methods for dimensionality reduction
Dimensionality reduction…
creates new combination of attributes