What’s the difference between structured and unstructured data?
Structured data: data that can be easily organized into data tables (eg. Stock returns)
Unstructured data: data that can’t be easily organized (eg. Social media images, text messages)
What are 4 V’s in big data investment management?
What are 5 steps in ordinary Machine Learning Model Building Steps (for structured data)? DDDDM
Ordinary ML Model Building Steps
- Step 1: determining the modeling task/output
Step 2: Data collection
Step 3: Data preparation and wrangling (data cleaned, eg. Removing outliers.)
Step 4: Data exploration
Step 5: Model training (model begins working and make necessary adjustments.)
What are 4 steps in text machine learning model? DDTT
Text ML Model Building Steps
- Step 1: Text problem formulation (define objective).
- Step 2: Data (text) collection
- Step 3: Text preparation and wrangling (data needs to be cleansed & preprocessed.)
- Step 4: Text exploration
What is data cleansing/preparation vs data wrangling/preprocessing?
Data Cleansing (Preparation): Improve data quality by removing or correcting errors. (Raw data contain errors like inaccuracies and duplications, must be cleaned up before use in an ML model. Data cleansing involves identifying and dealing with these errors.)
Data Wrangling (Preprocessing): Transform raw data into a usable format for analysis. (Once cleansed, data prepared to be used as inputs in ML model. This activity involves addressing outliers, identifying useful variables, as well as formatting data appropriately so they can be used as inputs in an ML model.)
What are 6 possible errors when preparing data to be read by computers? IIIIND
What are 5 techniques used in data wrangling & processing step? FAFSC
What is difference between trimming, winsorization (for structured data)?
trimming: removing all extreme values and outliers. (eg. top 5% and bottom 5% of observations are removed.)
winsorization: replaces extreme values with the maximum or minimum of the data not containing outliers.
What’s the difference between normalization and standardization & formulas for each (for structured data)
(Xi - Xmin) / (Xmax - Xmin)
X - u / o
x = variable
u = mean
o = standard deviation
What is text processing?
process of taking unstructured data and turning it into structured data.
What are 4 common steps of Text preparation (cleansing)?
What is text wrangling (preprocessing)?
process of separating text into tokens (tokens aka words)
What are 4 techniques used in the process of normalizing text data: LSSL
what is a bag of words after the normalization process?
What is document term matrix?
The bag of words doesn’t tell us the sequence of words so what can we use instead?
What is 1 n gram, 2 n grams, and 3 n grams?
Unigram
Stock prices closed higher today > “stock” “prices” “closed” “higher” “today”
Bigram
Stock prices closed higher today > “stock_prices” “prices_closed” “closed_higher” “higher_today”
Trigram
Stock prices closed higher today > “stock_prices_closed” “prices_closed_higher” “closed_higher_today”
What are the 3 activities conducts as part of the data exploration for structured data?
What is exploratory data analysis (EDA) for structured data?
eg. One dimensional data can be summarized into mean, median
eg. Two dimension data can be summarized into correlation matrix
What is feature selection for structured data?
What is feature engineering for structured data?
What is exploratory data analysis for unstructured data (aka text analytics)?
What are 3 common applications of text analytics?
Text classification — This uses supervised ML methods to place text into different classes.
Topic modeling — This creates clusters of text using unsupervised ML methods.
Sentiment analysis — This is similar to text classification in that the text is grouped (positive, neutral, or negative). Either supervised or unsupervised ML methods can be used in this analysis.
What is a collection of text called?