What is volume in big data
This refers to the quantity of data, big data needs a lots of information,
What is variety in big data
This refers to the diversity of the ata sources
You need user generated content, images, traditional data sources, ecetra
What is velocity of big data
What is another v that oyu need to consider
Velocity is the speed that data is created and collected.
High velocity is examples like social media, this is generated very quickly.
You also have to consider validity. This refers to the reliability of data.
What are the 5 steps involved in data analaysis
1
Conceptualisation of the modeling task.
2
Data collection
3
Data preparation and wrangling
4
Data exploration
5
Model training
What is involved in detail with conceptualising the modeling task
This si when you define the problem and you work out what you want to get out of the model
You also then identify who will use the model and how it will be incorporated into business processes
Explain the concept of data collection in detail
What is this step called when you are using unstructured data
You have to identify and gather from INTERNAL AND EXTERNAL SOURCES
This step is often referred to as curation when referencing unstructured data.
This step is also the stage where web scraping, annotating target variables for what you want the output to be and other data analysis processes happen.
Explain data preparation and wrangling
Data preparation and wrangling when you clean and organise the raw data
Cleansing is when oyu adress missing values and identify invalid or out of range values.
Wrangling is when you transform data through extraction and aggregation., this is done to make sure that it is suitable for the model. For unstructured text this means removing punctuation and other things like that
What are the three main things in data exploration
The three main things in data exploration are
Explanitory data analysis
Feature selection
Feature engineering
What is the meaning of explanatory data analysis
Explanatory data analysis uses statistics and visualisations, to understand data properties nad the relationships between the data,
What is feature selection
feature selection is when you choose only the most relevant attributes to reduce the model complexity, this also helps you to reduce noise
What is feature engineering
Feature engineering is when you create new variables by transforming or combing existing ones,
This includes combining several valitbales into a single score which gives the probabliyly or indicator that you want.
What is model training in big data
Model training invlolves selecting the correct machine learning algorithm based on the task that you want to complete.
The researcher then evaluates the algorithm based on the training data set and tunes the algorithm’s parameters in order to increase the performance.
What is the longest and most difficult part of a data analysis project
Usually the data wrangling element, where oyu have to synthesise lots of data into a usable form
What are the 3 main steps of data collection .
Data identification - working out what data is worth collecting
Data collection - data gathered through internal databases or vendors. Or though api’s
Data documentation - when you use a README file to say where the data is stored and what the starting point is for identifying errors
What is involved int eh cleansing stage of the data wrangling stage
What are the specific errors that you have to adress with cleansing
Data cleansing is when oyu try to reduce the errors in the raw data set.
MISSING VALUES
Invalid values - which fall outside the logical range
Inaccurate values - just wrong
Non uniform values - resulting form inconsistent formatting
Duplicate observations - redundant
What is data wrangling
When you process data through transformation and scaling, so that it is ready to use
What are the main data wrangling/data transformation types
Extraction - when you create a new usable vaiable from the raw data. Eg creating a years of employment metric that starts at a beginning start data
Aggregation - when you create muiltiple related variables into a single variable using weightings
Filtration- when oyu remove observations that are irrelevant to the task
Selection - when you remove entire features that aren’t needed
Conversion - when you change data types such as converting nominal data to ordinal forms
What are some of the common transformations that are used to deal with handling outliers
Trimming
Windsorisation - when you take the outlier values and make them equal to a maximum or minimum allowed value
Harmonisation - when you take the outlier values and make them equal to the value in a certain percentile 95%
What is the meaning of scaling as it relates to homogenous data
Normalisation rescales variables to fall between 0-1
Standardisation centres values at a mean of 0 and scales them in standard deviation units. It assumes a normal distribution but it is not sensitive to outliers,
What are the three main phases for the conversion of unstructured text based data into a structured format that a computer can ingest.
Preparation
Wrangling
Exploration
Explain cleansing process as it relates to text based data.
Remove the HTML tags, you do this using regex which will identify specific charter patterns
Remove the punctuation. Most of the punctuation is deleted however if there are specific symbols that are needed then they are replaced with ANNOTATIONS
Remove the numbers digits are typically removed or replaced with generic annotations. Unless their specific value is needed.
What is the process of text wrangling in text based machine learning
What is stemming and what is lemmatisation
What is tokenisation
Text wrangling is normalised into a standard format:
Lowercase
Removing stop words. These include the and is. These have little value to a modle
Stemming is when you break a word down into a common base. Integrate and integrating become intergrat
Lemmatisation is more advanced and resource intensive that transforms words into their lemma (morphological root)
Tokenisation is when you split sentences into their individual words which are referred to as tokens
What is the structuring section of dealing with text based machine learning
What are the structuring steps
Bag of words - this si when you just put all the unique tokens in a list without regard for their original sequence
Document term matrix - then you take the final step, this is when the unstructured data becomes the matrix. Each row represents a document, and then each row represents a token. The values within the matrix refer to how often the document contains that token which can be identified by the row.
If the sequence of the words matters then the tokens can be grouped into multi word sentences.
These can be called bi grams if there are two words that always go together. (Market_up)
What is text exploration as it relates to machine learning using text.
Text exploration is when you evaluate the data set using methods to see what the best model for training might be
One method is visualisation where you use word clouds, tokens which appear more frequently are displayed in a larger font and helps analaysis.
One method is feature selecting which to reduce the model nouse you can estimate the high frequency and low frequency words.
Another is document frequency. With this the number of documents containing a specific token divided by the documents.