Big Data Projects Flashcards by Shannon Smith

What’s the difference between structured and unstructured data?

Structured data: data that can be easily organized into data tables (eg. Stock returns)

Unstructured data: data that can’t be easily organized (eg. Social media images, text messages)

How well did you know this?

Not at all

Perfectly

What are 4 V’s in big data investment management?

Volume: The quantity of data being collected
Variety: The array of available data sources
Velocity: The speed at which data are created
Veracity: The credibility and reliability of data sources (eg. 20% of internet is spam & 10-15% of social media is fake)

How well did you know this?

Not at all

Perfectly

What are 5 steps in ordinary Machine Learning Model Building Steps (for structured data)? DDDDM

Ordinary ML Model Building Steps
- Step 1: determining the modeling task/output
Step 2: Data collection
Step 3: Data preparation and wrangling (data cleaned, eg. Removing outliers.)
Step 4: Data exploration
Step 5: Model training (model begins working and make necessary adjustments.)

How well did you know this?

Not at all

Perfectly

What are 4 steps in text machine learning model? DDTT

Text ML Model Building Steps
- Step 1: Text problem formulation (define objective).
- Step 2: Data (text) collection
- Step 3: Text preparation and wrangling (data needs to be cleansed & preprocessed.)
- Step 4: Text exploration

How well did you know this?

Not at all

Perfectly

What is data cleansing/preparation vs data wrangling/preprocessing?

Data Cleansing (Preparation): Improve data quality by removing or correcting errors. (Raw data contain errors like inaccuracies and duplications, must be cleaned up before use in an ML model. Data cleansing involves identifying and dealing with these errors.)

Data Wrangling (Preprocessing): Transform raw data into a usable format for analysis. (Once cleansed, data prepared to be used as inputs in ML model. This activity involves addressing outliers, identifying useful variables, as well as formatting data appropriately so they can be used as inputs in an ML model.)

How well did you know this?

Not at all

Perfectly

What are 6 possible errors when preparing data to be read by computers? IIIIND

Incompleteness: some data is missing
Invalidity: some data doesn’t make sense (eg. House price -100k)
Inaccuracy: Some data is wrong but looks reasonable (eg. One row says a house was sold for $500K, but another part of the dataset says the same house was sold for $450K.)
Inconsistency: same type of data is recorded in different ways. (eg. One row says a house was sold for $500K, but another part of the dataset says the same house was sold for $450K.)
Non-uniformity: Data are provided in different formats.
Duplication: Two different entries have the same data.

How well did you know this?

Not at all

Perfectly

What are 5 techniques used in data wrangling & processing step? FAFSC

Feature extraction: creation of new variable from an existing variable for better analysis. (eg. using the date of birth to create a variable for age.)
Aggregation: combining variables that contain similar information. (eg. price return + roll return + collateral return can be aggregated into total return for valuing commodities.)
Filtration: removal of rows that are not applicable to the project.
Selection: involves removing data columns that are not needed.
Conversion: process of making appropriate adjustments to data so they are relevant for analytical purposes. (Eg. values of properties in various countries are not comparable unless stated in common currency terms.)

How well did you know this?

Not at all

Perfectly

What is difference between trimming, winsorization (for structured data)?

trimming: removing all extreme values and outliers. (eg. top 5% and bottom 5% of observations are removed.)

winsorization: replaces extreme values with the maximum or minimum of the data not containing outliers.

How well did you know this?

Not at all

Perfectly

What’s the difference between normalization and standardization & formulas for each (for structured data)

Normalization rescales each observation to a range of 0 to 1 based on where it falls in the range from lowest (minimum) to highest (maximum)

(Xi - Xmin) / (Xmax - Xmin)

Standardization produces a scaled value based on how many standard deviations an observation differs from the mean

X - u / o

x = variable
u = mean
o = standard deviation

How well did you know this?

Not at all

Perfectly

What is text processing?

process of taking unstructured data and turning it into structured data.

How well did you know this?

Not at all

Perfectly

What are 4 common steps of Text preparation (cleansing)?

Removing html tags — Many web pages include html markup tags which do not contribute to the text being analyzed.
Removing punctuation — In most cases, punctuation does not help in text analysis and should be removed. (Eg. Commas(
Removing numbers (eg. removing 12 million into /number/) (we do this or else computer may treat each number as a separate word)
Removing white spaces — Extra white spaces are unnecessary and should be deleted. (remove any extra spaces)

How well did you know this?

Not at all

Perfectly

What is text wrangling (preprocessing)?

process of separating text into tokens (tokens aka words)

How well did you know this?

Not at all

Perfectly

What are 4 techniques used in the process of normalizing text data: LSSL

lowercasing (Money into money so they all look the same)
stop words (removing words like “the & “is” to reduce number of tokens)
stemming (convert word into base word. Eg. Increasing or increase to increas)
lemmatization (convert word into its morphological root. eg. Better & best changed to good)

How well did you know this?

Not at all

Perfectly

what is a bag of words after the normalization process?

a technique that represents text data as a collection of words, ignoring word order and context, while focusing on the frequency of each word’s occurrence

How well did you know this?

Not at all

Perfectly

What is document term matrix?

text analysis technique that represents the frequency or how many times a word occurs across a collection of documents, used in areas like natural language processing

How well did you know this?

Not at all

Perfectly

The bag of words doesn’t tell us the sequence of words so what can we use instead?

Study These Flashcards

we can use n grams

What is 1 n gram, 2 n grams, and 3 n grams?

Study These Flashcards

n gram: number of words combined to make one token

Unigram
Stock prices closed higher today > “stock” “prices” “closed” “higher” “today”

Bigram
Stock prices closed higher today > “stock_prices” “prices_closed” “closed_higher” “higher_today”

Trigram
Stock prices closed higher today > “stock_prices_closed” “prices_closed_higher” “closed_higher_today”

What are the 3 activities conducts as part of the data exploration for structured data?

Study These Flashcards

exploratory data analysis
feature selection
feature engineering.

What is exploratory data analysis (EDA) for structured data?

Study These Flashcards

Exploratory data analysis (EDA) for structured data involves creating data visualizations to summarize and explore information

eg. One dimensional data can be summarized into mean, median

eg. Two dimension data can be summarized into correlation matrix

What is feature selection for structured data?

Study These Flashcards

using the most relevant factors (independent variables) for machine learning models.

What is feature engineering for structured data?

Study These Flashcards

transforming existing features (independent variables) to either create new features that are more descriptive and meaningful or breaking down existing features.

What is exploratory data analysis for unstructured data (aka text analytics)?

Study These Flashcards

identifies patterns in unstructured data

What are 3 common applications of text analytics?

Study These Flashcards

Text classification — This uses supervised ML methods to place text into different classes.

Topic modeling — This creates clusters of text using unsupervised ML methods.

Sentiment analysis — This is similar to text classification in that the text is grouped (positive, neutral, or negative). Either supervised or unsupervised ML methods can be used in this analysis.

What is a collection of text called?

Study These Flashcards

corpus (which is a sequence of tokens)

What is term frequency?

number of times a token/word appears / total amount of tokens/words in a data set

What is topic modeling?

- removes words used to often and remove words not used enough (eliminates words with the highest term frequency (is, the) and the lowest term frequency (brand name or unique words not used often)

What are 3 techniques used to help with token/ feature selection for text data?

-document frequency = # of documents containing tokens / total # of documents -chi square: tokens/ words that are useful to a running class will have higher chi square than words that are useful for a baking class -mutual information: gives a scale based on how important a word is to a class. (Eg. Running will have higher value for running class but lower value for baking class, mutual information is less useful as word can be used in both classes eg. stand can be used in running and cooking class)

What is feature engineering for unstructed data?

- creating structured data from the unstructured data

What are 4 techniques used for feature engineering?

- Values with different numbers of digits can be converted into different tokens eg. 846 > /number3/ - n gram: used to enhance the distinctiveness of certain words. Eg. the word “issue” used in articles about any topic, but the bigram “bond_issue” most likely indicates article is about financial markets. - name entity recognition: classify tokens based on context. eg. Classify march as month unless it’s a last name - Parts of speech (POS) programs tag tokens according to their classification noun, verb, preposition, etc

What are the three main tasks involved in model training? MPT

objective of ML model training is to discover patterns in a training dataset that can be generalized to out-of-sample data. you want it to work for out of sample data in order to not overfit the model 1) method selection: appropriateness of an ML model for a given project will depend on factors such as the type of learning (supervised or unsupervised), the type of data (numerical, text, etc.), and the size (length and width) of the dataset. 2) performance evaluation: quantify effectiveness of model 3) tuning: improving the model

What is confusion matrix and what is the layout?

- organizing results into 4 boxes (true positives, false positives, false negatives, and true negatives. top left box = true positive top right box = false positive (type 1) bottom left box = false negative (type 2) bottom right box = true negative

What is precision and recall and formula for precision and recall in the confusion matrix?

- precision: Percent of true positives accurate out of the total number of positives P = TP/ TP + FP - recall: percent of true positives accurate out of left side of confusion matrix (true positive & false negative) R = TP/TP + FN

What is accuracy and formula for the confusion matrix?

- total number of true positives & negatives out of all negatives & positives A = TP + TN / TP + TN + FP + FN

What is formula for f1 for confusion matrix?

F1 = 2 * P * R / P +R

Big Data Projects Flashcards

(34 cards)