Big Data Projects Flashcards

(45 cards)

1
Q

What is volume in big data

A

This refers to the quantity of data, big data needs a lots of information,

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is variety in big data

A

This refers to the diversity of the ata sources
You need user generated content, images, traditional data sources, ecetra

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is velocity of big data
What is another v that oyu need to consider

A

Velocity is the speed that data is created and collected.
High velocity is examples like social media, this is generated very quickly.
You also have to consider validity. This refers to the reliability of data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What are the 5 steps involved in data analaysis

A

1
Conceptualisation of the modeling task.
2
Data collection
3
Data preparation and wrangling
4
Data exploration
5
Model training

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is involved in detail with conceptualising the modeling task

A

This si when you define the problem and you work out what you want to get out of the model
You also then identify who will use the model and how it will be incorporated into business processes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Explain the concept of data collection in detail
What is this step called when you are using unstructured data

A

You have to identify and gather from INTERNAL AND EXTERNAL SOURCES
This step is often referred to as curation when referencing unstructured data.
This step is also the stage where web scraping, annotating target variables for what you want the output to be and other data analysis processes happen.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Explain data preparation and wrangling

A

Data preparation and wrangling when you clean and organise the raw data
Cleansing is when oyu adress missing values and identify invalid or out of range values.
Wrangling is when you transform data through extraction and aggregation., this is done to make sure that it is suitable for the model. For unstructured text this means removing punctuation and other things like that

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What are the three main things in data exploration

A

The three main things in data exploration are
Explanitory data analysis
Feature selection
Feature engineering

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is the meaning of explanatory data analysis

A

Explanatory data analysis uses statistics and visualisations, to understand data properties nad the relationships between the data,

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is feature selection

A

feature selection is when you choose only the most relevant attributes to reduce the model complexity, this also helps you to reduce noise

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is feature engineering

A

Feature engineering is when you create new variables by transforming or combing existing ones,
This includes combining several valitbales into a single score which gives the probabliyly or indicator that you want.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is model training in big data

A

Model training invlolves selecting the correct machine learning algorithm based on the task that you want to complete.
The researcher then evaluates the algorithm based on the training data set and tunes the algorithm’s parameters in order to increase the performance.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is the longest and most difficult part of a data analysis project

A

Usually the data wrangling element, where oyu have to synthesise lots of data into a usable form

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What are the 3 main steps of data collection .

A

Data identification - working out what data is worth collecting
Data collection - data gathered through internal databases or vendors. Or though api’s
Data documentation - when you use a README file to say where the data is stored and what the starting point is for identifying errors

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is involved int eh cleansing stage of the data wrangling stage
What are the specific errors that you have to adress with cleansing

A

Data cleansing is when oyu try to reduce the errors in the raw data set.
MISSING VALUES
Invalid values - which fall outside the logical range
Inaccurate values - just wrong
Non uniform values - resulting form inconsistent formatting
Duplicate observations - redundant

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is data wrangling

A

When you process data through transformation and scaling, so that it is ready to use

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What are the main data wrangling/data transformation types

A

Extraction - when you create a new usable vaiable from the raw data. Eg creating a years of employment metric that starts at a beginning start data
Aggregation - when you create muiltiple related variables into a single variable using weightings
Filtration- when oyu remove observations that are irrelevant to the task
Selection - when you remove entire features that aren’t needed
Conversion - when you change data types such as converting nominal data to ordinal forms

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What are some of the common transformations that are used to deal with handling outliers

A

Trimming
Windsorisation - when you take the outlier values and make them equal to a maximum or minimum allowed value
Harmonisation - when you take the outlier values and make them equal to the value in a certain percentile 95%

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What is the meaning of scaling as it relates to homogenous data

A

Normalisation rescales variables to fall between 0-1
Standardisation centres values at a mean of 0 and scales them in standard deviation units. It assumes a normal distribution but it is not sensitive to outliers,

20
Q

What are the three main phases for the conversion of unstructured text based data into a structured format that a computer can ingest.

A

Preparation
Wrangling
Exploration

21
Q

Explain cleansing process as it relates to text based data.

A

Remove the HTML tags, you do this using regex which will identify specific charter patterns
Remove the punctuation. Most of the punctuation is deleted however if there are specific symbols that are needed then they are replaced with ANNOTATIONS
Remove the numbers digits are typically removed or replaced with generic annotations. Unless their specific value is needed.

22
Q

What is the process of text wrangling in text based machine learning
What is stemming and what is lemmatisation
What is tokenisation

A

Text wrangling is normalised into a standard format:
Lowercase
Removing stop words. These include the and is. These have little value to a modle
Stemming is when you break a word down into a common base. Integrate and integrating become intergrat
Lemmatisation is more advanced and resource intensive that transforms words into their lemma (morphological root)
Tokenisation is when you split sentences into their individual words which are referred to as tokens

23
Q

What is the structuring section of dealing with text based machine learning
What are the structuring steps

A

Bag of words - this si when you just put all the unique tokens in a list without regard for their original sequence
Document term matrix - then you take the final step, this is when the unstructured data becomes the matrix. Each row represents a document, and then each row represents a token. The values within the matrix refer to how often the document contains that token which can be identified by the row.
If the sequence of the words matters then the tokens can be grouped into multi word sentences.
These can be called bi grams if there are two words that always go together. (Market_up)

24
Q

What is text exploration as it relates to machine learning using text.

A

Text exploration is when you evaluate the data set using methods to see what the best model for training might be
One method is visualisation where you use word clouds, tokens which appear more frequently are displayed in a larger font and helps analaysis.
One method is feature selecting which to reduce the model nouse you can estimate the high frequency and low frequency words.
Another is document frequency. With this the number of documents containing a specific token divided by the documents.

25
What is the meaning of feature engineering
Feature engineering assigns tags to Tolkien’s based on their entity type,
26
What is parts of speech
Parts of speech uses dictionaries to assign strtuctural tags. This mean you can give each word a grammatical label, that then allows a machine to be able to construct sentences becuase the machine knows what elements of the sentence each part relates to.
27
What is data exploration for unstructured data
Unstructured text means you have to initially process the data in order to identify contextually important elements. - this involves tokenisaition - summary statistics - visualisation
28
What is the meaning of feature selecting when it comes to unstructured data What does chi squared do What is mutual information
Feature selection is to create a parsimonious model by selecting a subset of tokens from the bow (Bag of words) Therefore you reduce the feature induced noise an oyu improve the predictive accuracy. You remove words like ‘the’ Chi square ranks the tokens by their usefulness in distinguishing between classes. Tokens with a higher chi squared statistic are highly assosiated with classes and very good for analysis. Mutual information - measures how much a token contributes to specific classes If a token appears in all classes it’s mutual information index is 0 If it only is included in a few,then the mutual information index gets closer to one
29
What is an N gram in feature engineering
This is a multi word pattern where, if they are useful the order is preserved,
30
What is name entity recognition as it relates to feature engineering
Entity name recognition searches for key token values, int he context that it was used against their internal library and assigns a NER tag to the token. This lets you know what catagory the token shoudl go in
31
What is parts of speech in feature engineering
Parts of speech uses language structure dictionaries to asssign tags to text. This lets you know what gramtatical meaning the word has.
32
What can cause model fitting errors
Number of features Size of the data sample
33
What is the model method selection catagories based on
Supervised or unsupervised learning Type of data - for numberical data you might use a classification or a regression tree method for text you might use a GLM Size of data large data sets can be handled with support vector machines
34
What can model fitting errors be caused by
Modle fitting errors can be caused by the size of the small or the number of features.
35
How do you evaluate the performance of a machine learning algorithm Classification models What are the 4 metrics you can use to quantify performance
For classification tasks. The main tool is the confusion matrix. This classifies results into 4 outcomes True positive True negative False positive False negative Prescision Recall Accuracy F1 score
36
What is a type 1 error What is a type 2 error
A type one error is incorrectly predicting a false positive A type two error is when you fail to identify a positive class
37
What is the definition of prescisison and how is it calculated When would oyu want a model that is highly accurate
Prescision is basically how accurate positive predictions are. This is calculated by taking the number of true positives and working out the proportion of true positives there were as a proportion of all the positive outcomes that the model predicted Therefore the answer is TP/FP+TP You use this when the cost of recording a false positive is very high
38
What is the definition of recall and when are you likely to prioritise strong recall
Recall is the models ability to find all positive instances. High recall is prioritised when the total cost of a type two error is high Recall is basically what proportion of all the positive readings that we woudl above wanted you to pick up have you been able to pick up So equal to TP/(TP+FN) Used basically when the cost of you not identifying something is high.
39
What is the definition of accuracy and what is the calculation for it
Accuracy is equal to basically the number of true positives and trie negatives that the model was able to identify in total, so the percentage of correctly identified items out of the total items that it predicted Formula equals TP+TN/FP+FN+TP+TN
40
What is the definition of the F1 score and what is it used for
The F1 score is the harmonic mean of the Prescision and recall metrics 2 x P X R/(P+R)
41
What is the receiver operating statistic and what is it used to measure What is the calculation for the false positives rate Waht does the area under the curve represent
The ROC curve is uesed to plot the tradoff between the true positives call rate (recall) and the False positive rate The calculation for the false positive rate is just FP/(FP+TN) The area under the curve, is a single metric between zero and one that measures the predictive accuracy. If the area under the curve is equal to 0.5 then it means that hte model is just making random guesses. The more that the line is curved the better the model fit and the more predictive accuracy the model has.
42
What is the performance analaysis for continuous data.
For continuous data there are three main metrics. Root mean squared error Mean absolute error Mean absolute percentage error
43
What is the meaning of root mean squared error
root mean squared error is a metric which summarises the avergage prediction error in a sample The formula is the sum squared errors of the predicted value versus the actual value all squared divided by n then all square rooted,
44
What is the mean absolute error and the mean absolute percentage error
The mean absolute error is just what it says on the tin, you take the absolute differences and then average them out The mean absolute percentage error is the same but you just are using the percentages
45
What is the fitting curve
The fitting curve basically says that the model has to be revised until it reaches an acceptable level of performance However if you modify the model too much then you get to the point where you have variance error The Burke is upward sloping and oyu basically want to get to the pint on the curve wich is the lowest that means that oyu have the optimal complexity model The two tails of the curve are points where you have bias error and where yo have variance error,