What is data cleaning and why is it important?
What are the stages of data cleaning?
What are the key properties of data cleaning?
How can we deal with missing data?
How do we normalise data?
Ensure data is stored in the correct type (int, string etc)
Correct casing, get rid of whitespace, rename columns
Techniques to identify missing and irregular data?
Fill empty values with mean value
ave_price = df.price.mean()
print(ave_price)
df.fillna(ave_price)
Check how many values are null in the whole dataframe?
df.isnull().sum()
Is it best to fill with mean, median or mode?
Mean-It is preferred if data is numeric and not skewed.
Median-It is preferred if data is numeric and skewed.
Mode-It is preferred if the data is a string(object) or numeric.
Datetime object in Pandas
data[“date”] = pd.to_datetime(data[“date”])