Advanced Data Analytics, Coursera Flashcards

(149 cards)

1
Q

Data Professional

A

A term used to describe any individual who works with data and/or has data skills.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Machine Learning

A

An alternative approach to automation, expressing the way you want a task done by using data instead of explicit instructions.

AKA: The use and development of algorithms and statistical models to teach computer systems to analyze patterns in data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Data Science vs Data Analytics

A

data science vs. data analytics
Data science is an entire field dedicated to making data more useful. A data scientist is a professional that uses raw data to develop new ways to model data and understand the unknown. Often, their job responsibilities incorporate various components of computer science, predictive analytics, statistics, and machine learning. The collections of information that data scientists work with can be quite large, requiring expertise to organize and navigate.

Data analytics is a subfield of the larger data science discipline. The aim of data analytics is to create methods to capture, process, and organize data to uncover actionable insights for current problems. Analysts focus on processing the information stored in existing datasets and establishing the best way to present this data. Data analysts rely on statistics and data modeling to solve problems and offer recommendations that can lead to immediate improvements.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

R programming

A

Used by researchers and academics
Can create complex statistical models

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Jupyter Notebooks

A

An open-source web application used to create and share documents that contain live code, equations, visualizations, and narrative text

Allows you to run code in real time and helps identify errors easily.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Data stewardship

A

The practices of an organization that ensure that data is accessible, usable, and safe

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Edge computing

A

A way of distributing computational tasks over a bunch of nearby processors (i.e., computers) that is good for speed and resiliency and does not depend on a single source of computational power

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Machine learning:

A

The use and development of algorithms and statistical models to teach computer systems to analyze patterns in data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Metrics

A

Methods and criteria used to evaluate data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Python

A

A general-purpose programming language

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Technical Data Professionals

A

Machine Learning Engineers & Statisticians:
-Expertise in mathematics, statistics, and computing.
-Build models and make predictions.

Advanced Data Analyst:
-Explore datasets to identify directions worth pursuing.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Strategic Data Professionals

A

Business Intelligence (BI) Professionals
Technical Project Managers

-Interpret information for an organization’s operations, finance, research, and development
-Work aligns with business strategy

Seek solutions to problems through data analytics

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Open Data

A

Data that is available to the public and free to use, with guidance on how to navigate the datasets and acknowledge the source.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Personally Identifiable Information (PII)

A

Information that permits the identity of an individual to be inferred by either direct or indirect means.

Examples: Biometric records, usernames, social security, or national identification numbers. (Information that’s often associated with medical, financial, and employment records.)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Aggregate Information

A

Data from a significant number of users that has eliminated personal information.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Sample

A

A segment of the population that is representative of the entire population.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Data Anonymization

A

The process of protecting people’s private or sensitive data by eliminating PII.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Data Aggregation

A

Process of collecting and combining details from a significant number of users in terms of totals or summary.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Data that is often anonymized:

A

telephone numbers, names, license plates and license numbers, social security numbers, IP addresses, medical records, email addresses, photographs, & account numbers.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

General Data Protection Regulation (GDPR)

A

European Union Law.
The GDPR is described on their website as the toughest privacy and security law in the world. It imposes obligations onto organizations anywhere, so long as they target or collect data related to people in the European Union.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Lei Geral de Protecao de Dado Pessoais (LGPD)

A

Brazil’s Law for the protection of personal data

The LGPD is a data protection law that governs how companies collect, use, disclose, and process personal data belonging to people in Brazil. LGPD applies to companies that process data about individuals in Brazil.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

California Consumers Privacy Act (CCPA)

A

Privacy rights for California’s consumers.

The CCPA gives consumers more control over the personal information that businesses collect about them. These regulations provide guidance on how to implement the law.

Additionally, states like Colorado, Utah, Virginia, New York, and Connecticut have enacted similar legislation to protect consumer privacy in their states. New York, and Connecticut have enacted similar legislation to protect consumer privacy in their states.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

RACI

A

Responsible, Accountable, Consulted, Informed

Responsible: Responsible for performing the work necessary or making the decisions that are directly related to completing a task within a project. There can be several roles or groups responsible for a task.

Accountable: These individuals must approve the work performed by those who are “responsible”. As a general rule, there is usually a single person in this role, often a manager or project lead.

Consulted: Those assigned to offer input on a task. There should be a clear and open line of two-way communication between those assigned to “responsible” and “consulted”. There can be several people in this role. In many situations, they are referred to as subject matter experts (SMEs).

Informed: Those in this role need to be kept aware of progress and concerns of those working on a project. Those who are “informed” tend to be in higher levels of senior leadership. They need to understand insights from the projects rather than details of how the specific tasks are performed.

Note: On any given RACI chart, not every letter will be assigned. Not all tasks include every letter (e.g., for access to data, you could mark the BI Engineer, Analytics Team Manager, and Date Engineer “R” and the Data Scientist “C”).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

Data Scientist

A

Professionals who work closely with analytics to provide meaningful insights that help improve current business operations.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Data Engineer
Professionals concerned with infrastructure and are responsible for developing and managing databases. They often work alongside data scientists to build custom pipelines to manage the analysis and organization of raw data. Data compliance is part of developing and managing databases. The data engineer is responsible for ensuring compliance. *Data compliance is the act of handling and managing personal and sensitive data in a way that adheres to regulatory requirements, industry standards and internal policies involving data security and privacy. Responsibilities: -Make data accessible -Ensure the data ecosystem produces reliable results -Deal with infrastructure for data across enterprise
26
Analytics Team Manager/Insights Team Manager/Analytics Team Director/Head of Data/Data Science Director
Professionals who build and support a team of data scientists and analysts. Often they will lead a company's analytics department. In this role, they supervise different projects to develop and implement strategies that convert raw data into business insights. Responsibilities: -Supervise analytical strategy of an organization -Manage multiple groups of customers and stakeholders -Often a hybrid between the data scientist and the decision maker (rare combo of skills, which makes this position hard to fill).
27
Business Intelligence (BI) Manager/Business Intelligence (BI) Analyst
Professionals who use their knowledge of business trends and databases to organize information and make it accessible.
28
Data compliance
the act of handling and managing personal and sensitive data in a way that adheres to regulatory requirements, industry standards and internal policies involving data security and privacy.
29
Data Professional Responsibilities
Vary by company and greatly depend on the structure of the team, but standardly encompass the following: -Look for patterns and trends within big datasets -Uncover the stories inside data -Help guide decision making -Translate key information into visuals
30
Data Professional Titles
Junior Data Scientist, Data Scientist -- Entry Level, Associate Data Scientist, Data Science Associate, BI Analyst, BI Manager, Data Scientist, Data Engineer, etc.
31
Data Science
The discipline of making data useful
32
Python
A general purpose programming language
33
Tableau
A business intelligence and analytics platform that helps people visualize, understand, and make decisions with data.
34
Main areas covered by data professions:
Statistical inference, machine learning, and data analytics Statistical inference refers to the use of statistics to draw conclusions about an unknown aspect of a population based on a random sample. Machine Learning: The use and development of algorithms and statistical models to teach computer systems to analyze patterns in data Data analytics create methods to capture, process, and organize data to uncover actionable insights for current problems. Analysts focus on processing the information stored in existing datasets and establishing the best way to present this data. Data analysts rely on statistics and data modeling to solve problems and offer recommendations that can lead to immediate improvements.
35
Data Frame
A table used to organize data
36
LLM
Large Language Model. A type of AI algorithm that uses deep learning techniques to identify patterns in text and map how different words and phrases relate to each other. This allows LLMs to predict what word should come next. LLMs can generate human-like text in response to a wide range of prompts and questions. Examples: Gemini & ChatGPT
37
AI Limitations (/the human role)
Intuition. AI models are trained on data, and they can only make decisions based on the patterns they observe in the data. Humans can use their intuition and personal experience to make decisions that are not explicitly programmed into the AI model. For this reason, it’s important to always verify a model’s output before relying on it. Deal with ambiguity. AI models are good at solving problems that are well-defined and have clear parameters. However, humans can identify and understand complex problems that are not well-defined and have ambiguous parameters by considering key details offered in the context of the project. Interpersonal communication. AI models can generate reports and presentations, but they cannot communicate with stakeholders in the nuanced way that humans can. Humans can explain the results of their analysis to fit the needs of specific stakeholders, and use their emotional intelligence to address concerns. Creativity. AI models are good at following instructions, but they are not imaginative like humans. Humans can be creative in their approach to data analysis, and imagine new and innovative solutions to complex problems. Critical thinking. Humans can think critically about their data and identify potential biases and ethical issues. AI models are usually trained on real-world data that contains biases and are therefore likely to reflect those biases in model outputs. Leadership. Humans can be leaders, and they can motivate and inspire others. AI may have difficulty understanding the nuances of human emotion, motivation, and communication. This limits AI’s ability to effectively run an organization. Factuality. Generative AI models are trained to output text based on patterns in language. Sometimes the model output may be very well-composed and as a result, seem reliable, but may not be factual. As noted above, it’s important to always verify model output.
38
How can data professionals use AI?
Data professionals can use AI to help automate tasks, make predictions, generate insights, and communicate findings. They can leverage AI to be more productive in their work and more impactful in their organizations. Overall, AI is a powerful tool for data professionals but it is not without limitations. For this reason, human oversight and intervention is critical when working with AI and related tools. For example, data professionals can use AI to: --Create predictive models to help accurately forecast future events or outcomes. --Automate time-consuming tasks such as data cleaning, coding, and report writing. --Analyze extremely large datasets. --Improve the quality of data by identifying and correcting errors. --Generate insights from data that would not be obvious to humans. --Provide guidance on tasks such as choosing the right algorithms and interpreting results. --Facilitate collaboration among team members. Tools like Gemini and ChatGPT can help data professionals in a variety of ways. A data professional might ask Gemini or ChatGPT to: --Clean a dataset by removing missing values, outliers, and duplicate data. --Create interactive data visualizations such as dashboards and heatmaps. --Recommend a specific algorithm for a particular task based on the data professional's input. ---Create a shared document to facilitate a brainstorming session among a team of data professionals.
39
Artificial Intelligence (AI)
AI refers to the development of computer systems able to perform tasks that normally require human intelligence. For example, practical applications of AI include voice assistants, self-driving vehicles, automated recommendation systems, and more.
40
Best Practices when writing prompts for LLMs:
Be clear and concise in your instructions. It is important to be clear and concise in your instructions so the LLM can understand how to help you. Details are great—just make sure they’re useful and relevant. Avoid giving the LLM unnecessary information. Be precise. When posing a question to an LLM, be precise about the input (if any) and the desired output. Include a description of LLM’s role. This reinforces the purpose of your prompt. For example, you can tell the LLM to assume the role of a data scientist by writing “Act as a data scientist” or “You are a data scientist.” Provide context. Providing context allows the LLM to understand the nuances of the relevant issue and generate more informed responses. Try multiple prompts. Trying different prompts can provide different perspectives on a problem and enable the LLM to generate a variety of useful responses.
41
Data Science Workflows/Examples of how data professionals can use LLMs
Data cleaning. LLMs can automate tasks such as data cleaning and coding. For example, you can ask an LLM to clean a dataset by removing missing values, outliers, and duplicate data. Exploratory data analysis (EDA). LLMs can perform exploratory data analysis (EDA) on datasets. For example, you can ask an LLM to create data visualizations, identify patterns and trends, and calculate summary statistics. Modeling. LLMs can build and evaluate models. For example, you can ask an LLM to build a machine learning model to predict an outcome, and evaluate the performance of the model. Interpreting results. LLMs can interpret the results of models. For example, you can ask an LLM to explain the features that are most important for a model, or generate insights from the results of a model. Collaboration. LLMs can help you collaborate with teammates. For example, you can ask an LLM to create a shared document for a brainstorming session with a team of data professionals.
42
PACE
Plan: -What are the goals of the project? -What strategies will be needed? -What will be the business or operational impacts of this plan? Tasks: Research business data, define the project scope, develop a workflow, and assess project and/or stakeholder needs Analyze: -acquire data from primary and secondary sources -clean, organize, and transform the data for analysis. -engage in EDA Tasks: format the database, scrub the data, and convert the data into usable formats Construct -Build and revise machine learning models -uncover relationships in the data -apply statistical inferences about data relationships Tasks: select the modeling approach, build models, and build machine learning algorithms Execute: -Present findings to internal and external stakeholders -Answer questions -Consider differing viewpoints Tasks: share results, present findings to other stakeholders, and address feedback.
43
Exploratory data analysis (EDA)
Exploratory Data Analysis is an approach to analyzing data sets to summarize their main characteristics, often using visual methods. Used by data scientists to analyze and investigate data sets and summarize their main characteristics, often employing data visualization methods. EDA helps determine how best to manipulate data sources to get the answers you need, making it easier for data scientists to discover patterns, spot anomalies, test a hypothesis, or check assumptions. The main purpose of EDA is to help look at data before making any assumptions. It can help identify obvious errors, as well as better understand patterns within the data, detect outliers or anomalous events, find interesting relations among the variables.
44
Identify Assumptions & Limitations about the Data
Questions to ask yourself Assumption check: -Is there something that I am taking for granted? -Am I assuming something here that I shouldn't? -Can I determine if the assumption is correct? Limitation check: -Is the data complete? Are there missing values or sections? -Are the datasets formatted correctly? -Is this a sufficient sample size to conduct an analysis of an entire population or group? -What are the biases present in the data set? -Does this data contain personally identifiable information? What steps will I take to protect this information?
45
The Business Side of Data
Speak the language of your audience. -Why has this person contacted you? -What does your stakeholder want from this interaction? -What's important to them, their team, or their organization? When interacting with stakeholders: -Break down technical concepts into simpler terms -Use shorter sentences so main ideas are easier to understand and remember -Use direct language and minimize embellishments or unnecessary detail -Pay attention to diverse backgrounds and respect the lived experience of others -Avoid jargon, acronyms, and technical "buzzwords" that could lead to confusion *Invite questions and welcome feedback -Merge your passion for finding solutions with the goals of the project -Continue to strive for greater understanding of the results -Elicit feedback and questions to improve communication about your projects -Consider opportunities to reflect on your communication skills -**Analyze feedback. Is it valid? Does the person have a complete understanding of the goals of the project or data analytical process? If not, set up an additional meeting to clarify. *Be the connection to the data -Focus on the objective to help others better understand your data process -Tell the story of the data with a compelling and cohesive narrative -Respond to questions in a timely manner -Demonstrate your value to the team -Find opportunities to address stakeholder questions Let your visualizations help you tell the story -Be sure that your visuals tell the story within the data -Design visuals for inclusivity -Use labels and text to clarify, not clutter -Use fonts that are easy to read -Use high contrast, shading, and other customizations to communicate your message clearly -Offer handouts, slides, and other material in accessible formats. -*Keep visuals simple. When deciding what to include in a presentation, less is more. Build positive professional relationships -*Focus on what matters to your audience -**Invite feedback and discussion -Be a trusted subject matter expert who communicates clearly and inclusively -Cultivate positive interactions to strengthen working relationships and improve morale -When a stakeholder contacts you, be accessible and engaged in your communication Share Findings -*Craft results to the needs of your stakeholders. Communicate why this data will help them achieve their goals -Determine the visuals and/or dashboards that are the most effective. What data will you need to show and how do you want stakeholders to interact with it? -Think about the design carefully. A simple yet visually appealing approach to visualizations is always the best. -Use a hierarchy of data in your visualizations/dashboards. Information that is most important should be easily accessible, but you should provide a path for more details. What should I keep in mind when I share results? -What information is the most important to my audience? -What is the most efficient way to share with the tools available and the time I’m allotted? -What can I do to make the key points effectively?
46
Tips for Presentations
Tips for presentations -Structure your presentation. Be sure there is a logical structure: a beginning, middle, and end. -Presentation slides are not scripts. Don’t read or put complete paragraphs on presentation slides. -Make sure your data can be understood visually and consider potential accessibility challenges for your audience. -Focus most on the points your data illustrates. -Share one—and only one—major point from each chart. -Label chart components clearly. -Visually highlight “Aha!” zones. -Write a slide title that reinforces the data’s point.
47
Time is Money Tips/Respect Others' Time
Use direct language Minimize wordiness Avoid unnecessary details Always strive for clarity Use proper grammar and punctuation Keep vocabulary simple and avoid technical language Break complex ideas into shorter sentences to make concepts easier to understand and remember
48
Project Proposal
Main function: To outline objectives and requirements. Project proposals present ideas in detailed and actionable segments called milestones. Proposals are commonly created with input from team members and other stakeholders. They may be shared with clients or executives to gain approval and inform them of the project's path to completion.
49
Common Sections of a Project Proposal
-Project title: Should be brief and purposeful. -Project Objective: 1-3 sentences explaining what the project is trying to achieve -Milestones: Groupings of tasks within a project, breaking the work into manageable goals. -Tasks: Tasks detail the work that needs to be completed within a milestone. -Outcomes: Completed actions or results that allow a project to continue -Deliverables: Items that can be shared among team members or with stakeholders -Stakeholders: Individuals or groups who are directly involved and have a vested interest in the success of a project -Estimated time: At the beginning of a project, the time needed to complete milestones is completed. As a project develops, these estimates will often need to be updated to account for adjustments to timelines or changes in team members. -
50
Executive Summary
A document used to update decision makers who may not be directly involved in the tasks of a project. They can also be used to help new team members quickly become acquainted with a project. There are many ways to present the information within a summary, including software options built specifically for that purpose.
51
PACE workflow:
A framework that provides an initial structure to guide the process of data analytics; PACE stands for plan, analyze, construct, and execute
52
Plan stage:
Stage of the PACE workflow where the scope of a project is defined and the informational needs of the organization are identified
53
Analyze stage:
Stage of the PACE workflow where the necessary data is acquired from primary and secondary sources and then cleaned, reorganized, and analyzed
54
Construct stage:
Stage of the PACE workflow where data models and machine learning algorithms are built, interpreted, and revised to uncover relationships within the data and help unlock insights from those relationships
55
Execute stage:
Stage of the PACE workflow where a data professional will present findings with internal and external stakeholders, answer questions, consider different viewpoints, and make recommendations
56
Chief Data Officer:
An executive-level data professional who is responsible for the consistency, accuracy, relevancy, interpretability, and reliability of the data a team provides
57
Experiential Learning
Understanding through doing.
58
Transferable Skill
A capability or proficiency that can be applied from one job to another
59
user churn
the number of existing customers lost over a given period of time. Example: the number of users who have uninstalled an app or stopped using the app
60
Descriptive statistics:
stats that summarize or describe features of a data set, such as its central tendency or dispersion *** Central tendency: a single value that attempts to describe a set of data by identifying the central position within that set of data. It could be the mean/avg, median, or mode. Mode: most frequent value in a data set. Notes: -Mean/avg can be susceptible to the influence of outliers. The data can be skewed by outliers. -Median: less affected by outliers and skewed data. (Standard to use the median whenever tests of normality show that the data is non-normal/skewed.) *** Dispersion: Way of describing how spread out a set of data is.
61
Jupyter Notebooks, command and edit mode
Command mode: Used to interact with the notebook as a whole and perform actions like adding, moving, and deleting cells. Edit mode: used to type code or markdown text in a particular cell.
62
Jupyter Notebooks, Modular/interactive computing
You can write and execute individual chunks of code in small, manageable chunks, which are called cells.
63
Jupyter Notebooks, cells
individual chunks of code in small, manageable chunks
64
Markdown
A markup language that lets you add formatting elements to plain text.
65
Data Type
An attribute that describes a piece of data based on its values, its programming language, or the operations it can perform.
66
Variable Algorithm Questions
What's the variable's name? What's the variable's type? What's the variable's starting value?
67
Assignment
The process of storing a value in a variable
68
Expression
A combination of numbers, symbols, or other variables that produce a result when evaluated.
69
Dynamic Typing
Variables can point to objects of any data type
70
Naming Restrictions
Rules built into the syntax of the language itself that must be followed
71
Naming Conventions
Consistent guidelines that describe the content, creation date, and version of a file in its name
72
Executive Summary
A concise document that summarizes a longer report or proposal, highlighting the main points and findings for decision makers.
73
Regression Model
A statistical process for estimating the relationships among variables, often used to predict outcomes based on input data.
74
What does it mean that the data was log-transformed?
A log transformation means the analyst replaced the original values with their logarithm (usually log10 or natural log). So, instead of working with: 10, 100, 1000, 10000 they work with: 1, 2, 3, 4
75
Why do people log-transform data?
What does it mean that the data was log-transformed? A log transformation means the analyst replaced the original values with their logarithm (usually log10 or natural log). So, instead of working with: 10, 100, 1000, 10000 they work with: 1, 2, 3, 4 Why do people log-transform data? -highly skewed (long right tail) -spanning multiple orders of magnitude -multiplicative rather than additive -dominated by extreme outliers Common examples: -income -view counts (YouTube, TikTok) -population sizes -sales revenue -biological measurements (gene expression, viral load)
76
Why use log-transformed data?
What does it mean that the data was log-transformed? A log transformation means the analyst replaced the original values with their logarithm (usually log10 or natural log). So, instead of working with: 10, 100, 1000, 10000 they work with: 1, 2, 3, 4 Why use log-transformed data? Without a log transform: -a few massive values dominate the analysis -trends look flat of meaningless -models perform poorly With a log transform: -differences become comparable -patterns become linear -variance stabilizes -visualization becomes readable
77
How do you interpret log-transformed data differently?
After log-transforming: -Differences are multiplicative -A +1 increase ≈ 10× increase (for log10) -A straight line = exponential growth in original units Explanation: Logs turn multiplication into addition: -120 → 1,200 → 12,000 -log₁₀: 2.08 → 3.08 → 4.08 That’s why trends become easier to see.
78
What is pre-filtered data?
pre-filtered means that rows or values were removed before analysis. That is, the dataset you see is not raw; it's already been cleaned or restricted. Data is often pre-filtered to: -Remove missing or invalid values -Exclude outliers -Focus on a specific subgroup -Apply minimum thresholds -Improve data quality or relevance
79
Why use pre-filtered data?
pre-filtered means that rows or values were removed before analysis. That is, the dataset you see is not raw; it's already been cleaned or restricted. Data is often pre-filtered to: -Remove missing or invalid values -Exclude outliers -Focus on a specific subgroup -Apply minimum thresholds -Improve data quality or relevance
80
Why does it matter if pre-filtered data is used?
Pre-filtering affects: -Sample size -Means and medians -Variance -Generalizability -Whether results are biased *If data is pre-filtered, you cannot interpret results as applying to the full population.
81
What does the following mean? Data were log-transformed and pre-filtered
It usually means: -Rows not meeting criteria were removed first -Remaining numeric variables were log-scaled before analysis
82
Log-Transformed Data vs Pre-Filtered Data
Log-transformed = values were rescaled using logarithms to manage skew and extreme ranges Pre-filtered = some data was removed before analysis Both are common, useful, and powerful — but they change how results must be interpreted
83
Explain logarithmic math Example 1: Video Views log10(Views) A 120 2.08 B 250 2.40 Example 2: log10(100) = 2 because... log10(1000) = 3 because...
When you see: log10(120) = 2.08 The statement means: 10^2.08 ≈ 120 Note: ≈ means approximately equal to log10(100) = 2 because 10^2=100 log10(1000) = 3 because 10^3=1000
84
Intuition shortcut for logarithmic values (very useful)
Each +1 in log₁₀ means 10× bigger: | ----------- | ---------- | | 2.00 | 100 | | 2.08 | 120 | | 2.30 | 200 | | 3.00 | 1,000 | So: 2.08 just means “a bit bigger than 100” 2.30 means “about double 100” 3.00 means “ten times 100” | log₁₀ value | Real value |
85
EDA
Exploratory Data Analysis: The process of investigating, organizing, and analyzing datasets and summarizing their main characteristics, often employing data wrangling and visualization methods.
86
6 main practices of EDA
Discover: Data professionals familiarize themselves with the data, so they can start conceptualizing how to use it. Structure: The process of taking raw data and organizing or transforming it to be more easily visualized, explained, or modeled. Clean: The process of removing errors that may distort your data or make it less useful (e.g., missing values, misspellings, duplicate values, or extreme outliers) Join: The process of augmenting or adjusting data by adding values from other datasets (i.e., you might add more value or context to the data by adding more information from other data sources.) Validate: The process of verifying that the data is consistent and high quality (i.e., checking for misspellings, inconsistent number and date formats, and confirming that the cleaning process didn't create more issues). Present: Make your cleaned dataset or data visualizations available to others for analysis or further modeling.
87
Bias (in data structuring)
Organizing data in groupings, categories, or variables that don't accurately represent the whole dataset.
88
Data Visualization
A graph, chart, diagram, or dashboard that is created as a representation of information
89
Six main practices of EDA are ______.
iterative and non-sequential. Discover: Data professionals familiarize themselves with the data, so they can start conceptualizing how to use it. Structure: The process of taking raw data and organizing or transforming it to be more easily visualized, explained, or modeled. Clean: The process of removing errors that may distort your data or make it less useful (e.g., missing values, misspellings, duplicate values, or extreme outliers) Join: The process of augmenting or adjusting data by adding values from other datasets (i.e., you might add more value or context to the data by adding more information from other data sources.) Validate: The process of verifying that the data is consistent and high quality (i.e., checking for misspellings, inconsistent number and date formats, and confirming that the cleaning process didn't create more issues). Present: Make your cleaned dataset or data visualizations available to others for analysis or further modeling.
90
Key principles of the EDA process
The following two principles are inherently part of the EDA process: Human augmentation: This principle ensures humans are inserted throughout the AI or machine learning algorithm systems for oversight. Thorough EDA, performed by data scientists, is perhaps one of the best ways to limit bias, imbalance, and inaccuracies being fed into an algorithm. Bias evaluation: Without human interference, bias is too easily injected and reproduced in machine learning models. Performing methodical EDA processes will lead data scientists to be aware of and act on biases and imbalances in the data.
91
92
json
(jay-son) Data storage files that are saved in a java script format. They may contain nested objects within them. (Think of nested objects as expandable file folders.) For example, it may have all ingredients for a recipe and under each ingredient, it may list weight, calories, and price Note: You can use pandas to import and read and transfer json files: import json pandas.read_json pandas.df.to_json()
93
First-party data
Data that was gathered from inside your organization
94
Second-party data
Data that was gathered outside your organization but directly from the original source
95
Third-party data
Data gathered outside your organization and aggregated.
96
How to import a CSV file into Python (two of the ways):
#Import a CSV file into Python and define the mode (i.e., read, write, append, or create a new file. with open('file_path/file_name', mode='r') as file: #assign the result to a variable name; in this case, "data" data = file.read() *When defining the mode, you use one of the following options: 'r' read 'w' write 'a' append '+' create a new file #Import a CSV file into a dataframe using Pandas: import pandas as pd df = pd.read_csv('example_filepath/file')
97
DataFrame.head()
The head() function will display the number of dataset rows you input in the argument field. For the “X” in the argument field, input the number of rows you want displayed in a Python notebook. The default is 5 rows. Example: dt.head(10) # this will print out the first 10 rows of a table.
98
DataFrame.info()
# Column Dtype The info() function will display a summary of the dataset, including the range index, dtypes (or data types), column headers, and memory usage. Leaving the argument field blank will return a full summary. As an option, in the argument field you can type in “show_counts=True,” which will not return any null fields. Example: df.info() RangeIndex:3401012 entries, 0 to 3401011 Data columns (total 3 columns): -- ---- ----- 0 date object 1 number_of_strikes int64 2 center_point_geom object Dtypes: int64(1), object(2) Memory usage 77.8+ MB
99
Dataframe.describe()
The describe() function will return descriptive statistics of the entire dataset, including total count, mean, minimum, maximum, dispersion, and distribution. Leaving the argument field blank will default to returning a summary of the data frame’s statistics. As an option, you can use “include=[X]” and “exclude=[X]” which will limit the results to specific data types, depending on what you input in the brackets. Once executed, the describe() function looks like this: Example: df_joined.describe()
100
DataFrame.shape
‘Shape’ returns a tuple representing the dimensions of the dataset by number of rows and columns. The code will look something like this: Df.shape (3401012, 3)
101
int64
If you see this, it means the data contains integers or numbers between -9,223,372, 036,854,775,808 and 9,223,372,036,854,775,808 (so between negative 9 quintillion and positive nine quintillion).
102
Strings
Sequences of characters or integers that are unchangeable
103
Why would a data professional use the following methods? describe(), sample(), size, shape
To learn about a dataset
104
str.slice(stop=x)
Example: str.slice(stop=3) str.slice will omit the text after the first three letters.
105
plt.bar()
Pyplot's plt.bar() function takes positional arguments of x and height, representing data for the x- and y-axes. Example: plt.bar(x=df_by_month['month_txt'],height= df_by_month['number_of_strikes'], label="Number of strikes") plt.plot() plt.xlabel("Months(2018)") plt.ylabel("Number of lightning strikes") plt.title("Number of lightning strikes in 2018 by months") plt.legend() plt.show()
106
DataFrameGroupBy.sample(n=None, frac=None, replace=False, weights=None, random_state=None)[source]
eturn a random sample of items from each group. You can use random_state for reproducibility. Parameters: nint, optional Number of items to return for each group. Cannot be used with frac and must be no larger than the smallest group unless replace is True. Default is one if frac is None. fracfloat, optional Fraction of items to return. Cannot be used with n. replacebool, default False Allow or disallow sampling of the same row more than once. weightslist-like, optional Default None results in equal probability weighting. If passed a list-like then values must have the same length as the underlying DataFrame or Series object and will be used as sampling probabilities after normalization within each group. Values must be non-negative with at least one positive element within each group. random_stateint, array-like, BitGenerator, np.random.RandomState, np.random.Generator, optional If int, array-like, or BitGenerator, seed for random number generator. If np.random.RandomState or np.random.Generator, use as given. Example: #Create a 'years_until_unicorn' column using the companies dataset, which has a column for year joined (i.e., year a company became a unicorn) and year founded. companies_sample['years_until_unicorn'] = companies_sample['Year Joined'] - companies_sample['Year Founded']
107
Method Chaining
Code can be written vertically because each line returns an object that the next line can act on. This is called method chaining. Every method in the chain returns a new pandas object (usually a DataFrame or Series), so the next method can immediately be called on it. Example: The fully written-out code: step1 = companies_sample[["Industry", "years_till_unicorn"]] step2 = step1.groupby("Industry") step3 = step2.max() grouped = step3.sort_values(by="years_till_unicorn") is the same as the chained code: grouped = (companies_sample[["Industry", "years_till_unicorn"]] .groupby("Industry") .max() .sort_values(by="years_till_unicorn") )
108
plt.title()
To add a title to a plot created with pandas plotting functionality, which uses Matplotlib as its backend, you can use the plt.title() function from the matplotlib.pyplot module
109
plt.xlabel
The plt.xlabel() function from Matplotlib can be used with Pandas plotting to set the label for the x-axis. This function should be called after generating the plot to modify the current active axes.
110
plt.ylabel
The plt.ylabel() function from Matplotlib can be used with Pandas plotting to set the label for the y-axis. This function should be called after generating the plot to modify the current active axes.
111
plt.xticks
The plt.xticks() function from Matplotlib is used to get or set the x-axis tick locations and labels. When working with Pandas plots, which use Matplotlib as their default backend, you can use plt.xticks() to customize the axis appearance after calling the pandas.DataFrame.plot() method. Example: # Rotate labels on the x-axis as a way to avoid overlap in the positions of the text plt.xticks(rotation=45, horizontalalignment='right')
112
plt.show()
To display plots generated from a pandas DataFrame or Series, you need to use the plt.show() function from the Matplotlib library, which is the backend plotting library pandas uses.
113
dt
ote that it’s not uncommon to import the datetime module from Python’s standard library as dt. You may have encountered this yourself. In such case, dt is being used as an alias. The pandas dt Series accessor (as demonstrated in the last example) is a different thing entirely. Example: print(my_series.dt.year) print(my_series.dt.month) print(my_series.dt.day) output: 0 2023 1 2023 2 2023 dtype: int64 0 1 1 4 2 6 dtype: int64 0 20 1 27 2 15 dtype: int64
114
strftime
# Create four new columns. strftime is short for "string format time." We will use this method on the datetime data in the week column, and it will extract the information we specify, formatted as a string. Example: let's create four new columns: week, month, quarter, and year. You can find a full list of available codes to use in the strftime format codes documentation. We will use %Y for year, %V for week number, %q for quarter. df['week'] = df['date'].dt.strftime('%Y-W%V') df['month'] = df['date'].dt.strftime('%Y-%m') df['quarter'] = df['date'].dt.to_period('Q').dt.strftime('%Y-Q%q') df['year'] = df['date'].dt.strftime('%Y')
115
pyplot as plt
Helpful package for creating bar, line, and pie charts. A package in the matplotlib library. To use: import matplotlib.pyplot as plt
116
pandas
functions and commands that help you work with data sets. To use: import pandas as pd
117
seaborn
visualization library that produces charts To use: import seaborn as sns
118
Why use .sum() (pandas)?
This is the aggregation step. After grouping, pandas needs to know: “When there are multiple rows per week, how do I combine them?” .sum() answers that question. Practice Question: If you were grouping a table by the column week, you have the following code, what does sum() do? df_by_week_2018 = df[df['year'] == '2018'].groupby(['week']).sum().reset_index() df_by_week_2018.head() Answer: Sum() adds up all numeric columns within each week & collapses many rows per week into one row per week
119
What does .reset_index() do in the following code? df_by_week_2018 = df[df['year'] == '2018'].groupby(['week']).sum().reset_index() df_by_week_2018.head()
.reset_index() -Moves week out of the index -Turns it back into a regular column -Creates a clean, flat DataFrame Result: week strikes 1 355 2 190 Step Purpose groupby('week') Define grouping sum() Collapse rows → one per week reset_index() Make the result usable You usually need all three when: -You’re aggregating data -You plan to inspect, plot, or reuse the result
120
Sorting
The process of arranging data into meaningful order
121
Extracting
The process of retrieving data from a dataset or source for further processing
122
Filtering
The process of selecting a smaller part of your dataset based on specified parameters and using it for viewing or analysis.
123
Slicing
A method for breaking information down into smaller parts to facilitate efficient examination and analysis from different viewpoints.
124
Grouping
Aggregating individual observations of a variable into groups
125
Merging
Method to combine two different data frames along a specified starting column
126
df,merge()
df.merge() A method available to the DataFrame class. Use df.merge() to take columns or indices from other dataframes and combine them with the one to which you’re applying the method. Example: df1.merge(df2, how=‘inner’, on=[‘month’,’year’])
127
pd.concat()
pd.concat() A pandas function to combine series and/or dataframes Use pd.concat() to join columns, rows, or dataframes along a particular axis Example: df3 = pd.concat([df1.drop(['column_1','column_2'], axis=1), df2])
128
df.join()
A method available to the DataFrame class. Use df.join() to combine columns with another dataframe either on an index or on a key column. Efficiently join multiple DataFrame objects by index at once by passing a list. Example: df1.set_index('key').join(df2.set_index('key'))
129
df[[columns]]
Use df[[columns]] to extract/select columns from a dataframe. Example: df[['animal', 'legs']]
130
df.select_dtypes()
A method available to the DataFrame class. Use df.select_dtypes() to return a subset of the dataframe’s columns based on the column dtypes (e.g., float64, int64, bool, object, etc.). Example: df2 = df.select_dtypes(include=['int64'])
131
df[condition]
Use df[condition] to create a Boolean mask, then apply the mask to the dataframe to filter according to selected condition. Example: df[df['class']=='Aves']
132
df.sort_values()
A method available to the DataFrame class. Use df.sort_values() to sort data according to selected parameters. Example: df.sort_values(by=['legs'], ascending=False)
133
df.iloc[]
Use df.iloc[] to slice a dataframe based on an integer index location. Examples: df.iloc[5:10, 2:] → selects only rows 5 through 9, at columns 2+ df.iloc[5:10] → selects only rows 5 through 9, all columns df.iloc[1, 2] → selects value at row 1, column 2 df.iloc[[0, 2], [2, 4] → selects only rows 0 and 2, at columns 2 and 4
134
df.loc[]
Use df.loc[] to slice a dataframe based on a label or Boolean array. Example: df.loc[:, ['color', 'class']]
135
Histograms
Histograms are commonly used to illustrate the shape of a distribution, including the presence of any outliers, the center of the distribution, and the spread of the data. Histograms are typically represented by a series of bars, where each bar represents a range of values. Bar height represents the frequency or count of the data points within that range. Histograms are an essential tool for understanding the characteristics of a dataset. They provide a visual representation of the data’s distribution and enable data professionals to identify patterns, trends, or outliers within the data. Histograms can also help data professionals choose appropriate statistical tests and models for the data and determine whether the data meets any assumptions required for the analysis. Histograms are widely used in any field and any situation that requires any kind of data analysis, including finance, health care, engineering, and social sciences.
136
Common Shapes of Histograms
Symmetric: A symmetric histogram has a bell-shaped curve with a peak in the middle, indicating that the data is evenly distributed around the mean. This is also known as a normal, or Gaussian, distribution. Skewed: A skewed histogram has a longer tail on one side than the other. A right-skewed histogram has a longer tail on the right side, indicating that there are more data points on the left side of the histogram. A left-skewed distribution has a longer tail on the left side, indicating more data points on the right side. Bimodal: A bimodal histogram has two distinct peaks, indicating that the data has two modes. Uniform: A uniform histogram has a flat distribution, indicating that all data points are evenly distributed.
137
How do you generate a histogram in matplotlib?
# Plot histogram with matplotlib pyplot use the hist() function in the pyplot module. The function can take many different arguments, but the primary ones are: x: A sequence of values representing the data you want to plot. It can be a list, tuple, NumPy array, pandas series, and so on. bins: The number of bins you want to sort your data into. The default value is 10, but this parameter can be an int, sequence, or string. If you use a sequence, it defines the bin edges, including the left edge of the first bin and the right edge of the last bin. In other words, if bins = [1, 3, 5, 7], then the first bin is [1–3) (including 1, but excluding 3) and the second [3–5). The last bin, however, is [5–7], which includes 7. A string refers to a predefined binning strategy supported by numpy. Refer to the documentation for more information. Example: Plot histogram with matplotlib pyplot plt.hist(df['seconds'], bins=range(40, 101, 5)) plt.xticks(range(35, 101, 5)) plt.yticks(range(0, 61, 10)) plt.xlabel('seconds') plt.ylabel('count') plt.title('Old Faithful geyser - time between eruptions') plt.show();
138
How do you generate a histogram in seaborn?
Use the sns.histplot() function. sns.histplot() can take many arguments. Here are some important ones: x: A sequence of values representing the data you want to plot. It can be a list, tuple, NumPy array, pandas series, and so on. bins: The number of bins you want to sort your data into. The default value is 10, but this parameter can be an int, sequence, or string. If you use a sequence, it defines the bin edges, including the left edge of the first bin and the right edge of the last bin. In other words, if bins = [1, 3, 5, 7], then the first bin is [1–3) (including 1, but excluding 3) and the second [3–5). The last bin, however, is [5–7], which includes 7. A string refers to a predefined binning strategy supported by numpy. Refer to the documentation for more information. binrange: Lowest and highest value for bin edges; can be used either with bins or binwidth; defaults to data extremes binwidth: Width of each bin, overrides bins but can be used with binrange Example: Plot histogram with seaborn ax = sns.histplot(df['seconds'], binrange=(40, 100), binwidth=5, color='#4285F4', alpha=1) ax.set_xticks(range(35, 101, 5)) ax.set_yticks(range(0, 61, 10)) plt.title('Old Faithful geyser - time between eruptions') plt.show();
139
What does the following do? df.drop_duplicates().shape
Check for duplicates Run data_frame.shape first. If the shape of the data is different after running data_frame.drop_duplicates(), you will know there were duplicate rows.
140
What kind of graph represents the counts of samples based on a particular feature?
A histogram is a graphical representation of a frequency distribution, which shows how frequently each value in a dataset or variable occurs.
141
Frequency distribution
A frequency distribution is a table or graph that shows how often values occur within specific intervals.
142
.dt.year == 2021 .dt.month == 6 .dt.day == 14 python, pandas
When you have a column that is in datetime64[ns} and you want to just extract the year, month, or day. Example: The below code will filter results so it just shows companies that joined in 2021. If you didn't use dt.year, no results would show because the datetime shows year, month, and day so it'll never just be 2021. filtered_companies = companies[companies['Date Joined'].dt.year == 2021]
143
.reset_index When to use? python, pandas
# Group `companies_2020_2021` by `Quarter Joined`, groupby() puts your grouping keys into the index. reset_index() turns them back into columns. That’s it. Almost every “Do I need reset_index()?” question reduces to: Do I want my group labels as columns or as an index? Example: # Aggregate by computing average `Funding` of companies that joined per quarter of each year. # Save the resulting DataFrame in a new variable. companies_by_quarter_2020_2021 = companies_2020_2021.groupby(by="Quarter Joined")["Valuation"].mean().reset_index().rename(columns={"Valuation":"Average Valuation"}) *General Rule: If the next method uses columns=, you almost always need reset_index() first. TL;DR groupby() → group keys become the index reset_index() → index becomes columns Need to rename, plot, merge, or export? → reset Doing more math? → don’t reset yet
144
Confidence Interval
In statistics, the parameters of a population are often estimated based on a sample. Examples of parameters that can be estimated: the mean or the variance. Example: You want to know the height of all professional basketball players in the US. For this, you draw a sample. The mean of the sample is likely different from that of the population. If you drew many samples (something that you are unlikely to do), each sample is likely to show a different mean. You will then have a range (the highest mean from the samples to the lowest mean from the samples) in which the true value will lie with a high probability. This range is known as the confidence interval. You will often hear someone state that the confidence interval is 95%. (Sometimes you may hear 99%.) This means that you can be 95% sure that the true parameter lies within this interval. (or 99% sure.) For example, imagine looking at a normal distribution (bell curve). If a confidence interval of 95% is selected, 95% of all values lie within the lower limit (far left of curve) and upper limit (far right of the curve). Note: The confidence interval can be calculated for many different statistical parameters, not only for the mean value. The confidence interval simply states which range the parameter lies within with a certain probability (e.g., 95%).
145
Name a common bias in time data
If you're comparing recent results to older ones, you may not have all necessary results in recent years as it's too early to collect all of them. For example, say that you're looking at the number of years it took US-based companies to hit unicorn status (i.e., hit $1B valuation) and your dataset includes companies founded in 1900 to the current year. The trouble is that the companies founded in recent years (let's say 2020 to 2025) will only include the fast growing companies (i.e., the companies that hit unicorn status within that time period). It will not yet include the slower growing companies that were founded in that recent years. As such, the data will suggest that companies founded in recent years are faster growing (i.e., hit $1B valuation status) faster than companies founded back in the day. But that is misleading as it's simply to early on. In 10, 15, or 20 years, there will be more companies founded in the 2020 to 2025 time period that have since hit unicorn status. This will lower the average number of years that it took companies founded in this time period to hit unicorn status.
146
IQR
Interquartile Range It shows the spread of the middle 50% of the data/the range of that data. Example: IQR in a box plot IQR = Q3 (quartile 3) - Q1 (quartile 1) 6 = 90-84 So, the IQR is 6 YouTube video: https://www.youtube.com/watch?v=QGSwRH0WgBg
147
Tukey's rule
Tukey's rule is for boxplots. Lower bound: Q1 - 1.5 * IQR Upper bound: Q3 + 1.5 * IQR Any observations beyond these bounds are flagged as potential outliers. Q1: Quartile 1 Q3: Quartile 3 IQR: Inner quartile range. Shows the middle 50% spread of the data. Calculated by Q3-Q1.
148
How to deal with outliers, general guidelines
Delete them: If you are sure the outliers are mistakes, typos, or errors and the dataset will be used for modeling or machine learning, then you are more likely to decide to delete outliers. Of the three choices, you’ll use this one the least. Reassign them: If the dataset is small and/or the data will be used for modeling or machine learning, you are more likely to choose a path of deriving new values to replace the outlier values. Leave them: For a dataset that you plan to do EDA/analysis on and nothing else, or for a dataset you are preparing for a model that is resistant to outliers, it is most likely that you are going to leave them in.
149