Data Cleansing Flashcards

Basic Summary of Data Cleansing Interview Questions (26 cards)

1
Q

What are the types of data sources mentioned?

A
  • Technical
  • Data structure - Databases - onPrem/Cloud/Legacy
  • Records
  • Reputational
  • Limited
  • Special
  • Social data
  • Public datasets
  • Sourced media
  • Transactional

These categories represent various origins and types of data used in analysis.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is data collection?

A

The process of gathering raw data from various sources for analysis and decision-making.

Data collection is essential for obtaining accurate information.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Why is data collection important?

A

It provides accurate information needed for analysis, planning, and decision-making.

Accurate data is crucial for effective decision-making.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What are common methods of data collection?

A
  • Interviews
  • Questionnaires
  • Observations
  • Surveys
  • System records

These methods help gather data from various sources.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What are primary data?

A

Data collected directly from the source for a specific purpose.

Primary data is often more reliable for specific research needs.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What are secondary data?

A

Data already collected by others and reused for analysis.

Secondary data can save time and resources in research.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What tools are used for data collection?

A
  • Forms
  • Surveys
  • Mobile apps
  • Sensors
  • Databases
  • Spreadsheets

These tools facilitate the gathering of data efficiently.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What problems can occur during data collection?

A
  • Incomplete data
  • Inaccurate responses
  • Bias
  • Data duplication

These issues can compromise the quality of the collected data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is data accuracy?

A

The degree to which data correctly represents real-world values.

High data accuracy is essential for reliable analysis.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is data validity?

A

It ensures data measures what it is intended to measure.

Valid data is crucial for drawing correct conclusions.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is data consistency?

A

Data is uniform and does not contradict across systems.

Consistent data is vital for maintaining integrity in datasets.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is data cleansing (data cleaning)?

A

The process of identifying and correcting or removing errors and inconsistencies in data.

Data cleansing is essential for improving data quality.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Why is data cleansing important?

A

It improves data quality, accuracy, and reliability for analysis.

Clean data leads to better decision-making.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What types of errors require data cleansing?

A
  • Missing values
  • Duplicate records
  • Incorrect formats
  • Outliers

Addressing these errors is crucial for data integrity.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is missing data?

A

Data fields with no recorded values.

Missing data can skew analysis results.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

How can missing data be handled?

A
  • Deleting records
  • Filling values manually
  • Using averages or estimates

Proper handling of missing data is necessary for accurate analysis.

17
Q

What is duplicate data?

A

Repeated records representing the same data item.

Duplicate data can lead to inflated results and inaccuracies.

18
Q

How do you remove duplicate data?

A

By using unique identifiers and data-matching techniques.

Effective removal of duplicates is essential for data integrity.

19
Q

What is data standardization?

A

Converting data into a consistent format (e.g., date or text format).

Standardized data is easier to analyze and compare.

20
Q

What is data validation?

A

Checking data against rules to ensure correctness.

Validation helps maintain data quality.

21
Q

What is data normalization in cleansing?

A

Organizing data to reduce redundancy and improve consistency.

Normalization is key for efficient data management.

22
Q

How are data collection and data cleansing related?

A

Collected data must be cleansed to ensure it is accurate and usable.

The cleansing process is vital after data collection.

23
Q

What happens if data is not cleansed?

A

It can lead to incorrect analysis and poor decisions.

Unclean data can severely impact outcomes.

24
Q

When should data cleansing be done?

A

After data collection and regularly during data use.

Ongoing cleansing ensures data remains reliable.

25
What tools are used for **data cleansing**?
* Excel * SQL * Python * R * Data management software ## Footnote These tools facilitate effective data cleansing.
26
Give an example of **data cleansing**.
Removing duplicate customer records and correcting wrong phone number formats. ## Footnote This example illustrates practical data cleansing techniques.