data nerds
50-90% time on project spent to mater the data
ex: scrubbing, cleaning, cleansing
schema
table (rectangle)
attributes
nuggets
unified modeling language (UML)
relational database
relational databases ensure data…
4 types of attributes
primary key
unique identifier
foreign key
attributes that point to primary key in another table
composite keys
combo of two foreign keys used for line items
descriptive attributes
include everything else
data dictionaries
legend/ log for full description of each column
big 4
currently love alteryx
- extract
- transform
- load
aka cleaning the data
requesting data process
step 1 : determine purpose and scope of data request
step 2: obtain the data – questions
step 2: obtain the data – methods
**think about tables, which tables have relations & what attributed are in which table
** if someone else, more complex bc need to explain and bigger orgs have a lot of approval processes
step 3: validate data for completeness and integridy
check that data transferred correctly
step 4: clean the data
make data consistent
- remove heading or subtotals
- clean leading zeros and nonprint characters
- format neg numbers
- correct inconsistencies
–> dates (6/7/2023 or 7/6/2023 or 2023-07-06)
–> numbers (1 or I, 6 or six)
–> international character encoding (“” or <>)
–> languages and measures (currency signs)
–> human error (23 or 32)
step 5: load data for data analysis
import data and make sure functions work properly