What is Data Science
process of building, cleaning, structuring datasets to analyse and extract meaning
Process of Data Science
key principles in DS
What does the discussion of probability include
-random experiments that produce a series of possible outcomes (can be infinity outcomes)
elements of probability model (uncertainty of experiment)
conditional probability
probability of outcome A given that event B (DENOMINATOR)has occurred.
independent
A and B are independent if the occurrence of B provides no information about A. intersect of events A and B =P(A)*P(B)
Variable?
variable is any characteristic observed in a study. summary of ALL outcomes in a random process
quantitative variable
there is meaningful distance between any 2 points of data
types of categorical variable
- nominal
types of quantitative variable
- continuous (possible values form an interval)
distribution of a variable (probability distribution)
list of possible outcomes+associated probability
Cumulative probability distribution
probability that the discrete variable is less than or equal to a particular value.
probability density function (used for continuous variable as impossible to list down all values and prob for each value
Probability density function (PDF) is the probability that the value of a continuous variable falls within an interval.
cumulative density function
Cumulative distribution function (CDF) is the probability that the variable is less than or equal to a particular value.
modal category?
category with the highest frequency
Bar plot (common way to display categorical variable)
One vertical bar for each possible category that could occur,
with the height proportional to the frequency of that category.
Histogram(quantitative variable)
Weakness of range?
sensitive to extreme observations
variance definition
average squared deviations from the mean
empirical rules of SD
interquartile range
range between upper and lower quartiles (robust to outliers)
5 number summary
min , lower quartile, median (X0.5), upper quartile, max (min N max NOT considering outliers)
when does an association exist?
if a particular value for a certain variable(response/dependent) is more likely to occur with certain values of another variable(explanatory/independent)