Big Data Projects Flashcards

Question

What is the meaning of feature engineering

Answer 1

Feature engineering assigns tags to Tolkien’s based on their entity type,

Answer 2

Parts of speech uses dictionaries to assign strtuctural tags. This mean you can give each word a grammatical label, that then allows a machine to be able to construct sentences becuase the machine knows what elements of the sentence each part relates to.

Answer 3

Unstructured text means you have to initially process the data in order to identify contextually important elements. - this involves tokenisaition - summary statistics - visualisation

Answer 4

Feature selection is to create a parsimonious model by selecting a subset of tokens from the bow (Bag of words) Therefore you reduce the feature induced noise an oyu improve the predictive accuracy. You remove words like ‘the’ Chi square ranks the tokens by their usefulness in distinguishing between classes. Tokens with a higher chi squared statistic are highly assosiated with classes and very good for analysis. Mutual information - measures how much a token contributes to specific classes If a token appears in all classes it’s mutual information index is 0 If it only is included in a few,then the mutual information index gets closer to one

Answer 5

This is a multi word pattern where, if they are useful the order is preserved,

Answer 6

Entity name recognition searches for key token values, int he context that it was used against their internal library and assigns a NER tag to the token. This lets you know what catagory the token shoudl go in

Answer 7

Parts of speech uses language structure dictionaries to asssign tags to text. This lets you know what gramtatical meaning the word has.

Answer 8

Number of features Size of the data sample

Answer 9

Supervised or unsupervised learning Type of data - for numberical data you might use a classification or a regression tree method for text you might use a GLM Size of data large data sets can be handled with support vector machines

Answer 10

Modle fitting errors can be caused by the size of the small or the number of features.

Answer 11

For classification tasks. The main tool is the confusion matrix. This classifies results into 4 outcomes True positive True negative False positive False negative Prescision Recall Accuracy F1 score

Answer 12

A type one error is incorrectly predicting a false positive A type two error is when you fail to identify a positive class

Answer 13

Prescision is basically how accurate positive predictions are. This is calculated by taking the number of true positives and working out the proportion of true positives there were as a proportion of all the positive outcomes that the model predicted Therefore the answer is TP/FP+TP You use this when the cost of recording a false positive is very high

Answer 14

Recall is the models ability to find all positive instances. High recall is prioritised when the total cost of a type two error is high Recall is basically what proportion of all the positive readings that we woudl above wanted you to pick up have you been able to pick up So equal to TP/(TP+FN) Used basically when the cost of you not identifying something is high.

Answer 15

Accuracy is equal to basically the number of true positives and trie negatives that the model was able to identify in total, so the percentage of correctly identified items out of the total items that it predicted Formula equals TP+TN/FP+FN+TP+TN

Answer 16

The F1 score is the harmonic mean of the Prescision and recall metrics 2 x P X R/(P+R)

Answer 17

The ROC curve is uesed to plot the tradoff between the true positives call rate (recall) and the False positive rate The calculation for the false positive rate is just FP/(FP+TN) The area under the curve, is a single metric between zero and one that measures the predictive accuracy. If the area under the curve is equal to 0.5 then it means that hte model is just making random guesses. The more that the line is curved the better the model fit and the more predictive accuracy the model has.

Answer 18

For continuous data there are three main metrics. Root mean squared error Mean absolute error Mean absolute percentage error

Answer 19

root mean squared error is a metric which summarises the avergage prediction error in a sample The formula is the sum squared errors of the predicted value versus the actual value all squared divided by n then all square rooted,

Answer 20

The mean absolute error is just what it says on the tin, you take the absolute differences and then average them out The mean absolute percentage error is the same but you just are using the percentages

Answer 21

The fitting curve basically says that the model has to be revised until it reaches an acceptable level of performance However if you modify the model too much then you get to the point where you have variance error The Burke is upward sloping and oyu basically want to get to the pint on the curve wich is the lowest that means that oyu have the optimal complexity model The two tails of the curve are points where you have bias error and where yo have variance error,

Big Data Projects Flashcards

(45 cards)