REVERSED
from py_stringmatching import similarity_measure as sm
What is the python library for computing similarity measures?
REVERSED
What is concept hierarchy generation?
REVERSED
Effective if data is clustered but not if data is “smeared”
When is data reduction through clustering useful and when is it not useful?
REVERSED
Random error or variance in a measured variable
What is noise in data?
REVERSED
What are the 4 heuristic methods for selecting the subset in attribute subset selection?
REVERSED
Quantifies the local density of a data point with the use of a neighbourhood of size k
-Introduces a smoothing parameter: reachability distance RD
RDk(x,y) = max{K dist(x), dist(x,y)}, where K dist(x) is the distance between x and its K-nearest neighbour
-the local reachability distance of point x is:
LRDk(x) = k/[sum of y in KNN(x) * RDk(x,y)]
-the local outlier factor LOF is:
LOFk(x) = sum of y in [KNN(x)*LRDk(y)/LRDk(x)] / k
-Generally, LOF >1 means x has a lower density than its neighbours
What is the local outlier factor for outlier detection?
REVERSED
lev_sim = sm.levenshtein.Levenshtein()
lev_sim.get_sim_score (s1, s2)
How do you compute the levenshtein similarity between strings s1 and s2 in python?
REVERSED
What are examples of data quality metrics? (5)
REVERSED
Novelty detection involves seeing if new data fits with an existing data or would be considered an outlier
What is the difference between outlier detection and novelty detection?
REVERSED
Attributes that duplicate much or all of the information contained in one or more other attributes
What are redundant attributes?
REVERSED
Transform the multi aria text outlier detection task into a univariate outlier detection problem
What is the general approach for outlier detection with multivariate data?
REVERSED
What is correlation analysis for discretisation?
REVERSED
Fit a model to the data and save the model instead
What is model based data reduction?
REVERSED
Problem of identifying and linking/grouping different representations of the same real-world object
What is entity resolution?
REVERSED
df.corr()
How do you find the correlation matrix for a dataframe in python?
REVERSED
global, contextual, collective
What are the three kinds of outliers?
REVERSED
Don’t assume an a-priori statistical model and determine the model from the input data
e.g. histogram and kernel density estimation
What are non-parametric methods for outlier detection?
REVERSED
What are the three types of outlier detection methods?
REVERSED
Simple random sampling may have poor performance in the presence of skew
When does simple random sampling have poor performance?
REVERSED
checking permitted characters
finding type-mismatched data
What is data validation?
REVERSED
What do we need in a definition of data quality? (3)
REVERSED
Assumes that the normal data is generated by a parametric distribution with the parameter theta
What are parametric methods for outlier detection?
REVERSED
#fill each na with the value before it data.fillna(method=‘pad') or method=‘ffill’
#fill each na with the value after it data.fillna(method=‘bfill’) or method=‘backfill’
#set a limit on the number of forward or backward fills data.fillna(method=‘pad’, limit=1)
What are the 2 different methods for filling nas in python?
REVERSED
What makes data “dirty”? (2)