Advanced Data Analytics, Coursera Flashcards

Question

Data Engineer

Answer 1

Professionals concerned with infrastructure and are responsible for developing and managing databases. They often work alongside data scientists to build custom pipelines to manage the analysis and organization of raw data. Data compliance is part of developing and managing databases. The data engineer is responsible for ensuring compliance. *Data compliance is the act of handling and managing personal and sensitive data in a way that adheres to regulatory requirements, industry standards and internal policies involving data security and privacy. Responsibilities: -Make data accessible -Ensure the data ecosystem produces reliable results -Deal with infrastructure for data across enterprise

Answer 2

Professionals who build and support a team of data scientists and analysts. Often they will lead a company's analytics department. In this role, they supervise different projects to develop and implement strategies that convert raw data into business insights. Responsibilities: -Supervise analytical strategy of an organization -Manage multiple groups of customers and stakeholders -Often a hybrid between the data scientist and the decision maker (rare combo of skills, which makes this position hard to fill).

Answer 3

Professionals who use their knowledge of business trends and databases to organize information and make it accessible.

Answer 4

the act of handling and managing personal and sensitive data in a way that adheres to regulatory requirements, industry standards and internal policies involving data security and privacy.

Answer 5

Vary by company and greatly depend on the structure of the team, but standardly encompass the following: -Look for patterns and trends within big datasets -Uncover the stories inside data -Help guide decision making -Translate key information into visuals

Answer 6

Junior Data Scientist, Data Scientist -- Entry Level, Associate Data Scientist, Data Science Associate, BI Analyst, BI Manager, Data Scientist, Data Engineer, etc.

Answer 7

The discipline of making data useful

Answer 8

A general purpose programming language

Answer 9

A business intelligence and analytics platform that helps people visualize, understand, and make decisions with data.

Answer 10

Statistical inference, machine learning, and data analytics Statistical inference refers to the use of statistics to draw conclusions about an unknown aspect of a population based on a random sample. Machine Learning: The use and development of algorithms and statistical models to teach computer systems to analyze patterns in data Data analytics create methods to capture, process, and organize data to uncover actionable insights for current problems. Analysts focus on processing the information stored in existing datasets and establishing the best way to present this data. Data analysts rely on statistics and data modeling to solve problems and offer recommendations that can lead to immediate improvements.

Answer 11

A table used to organize data

Answer 12

Large Language Model. A type of AI algorithm that uses deep learning techniques to identify patterns in text and map how different words and phrases relate to each other. This allows LLMs to predict what word should come next. LLMs can generate human-like text in response to a wide range of prompts and questions. Examples: Gemini & ChatGPT

Answer 13

Intuition. AI models are trained on data, and they can only make decisions based on the patterns they observe in the data. Humans can use their intuition and personal experience to make decisions that are not explicitly programmed into the AI model. For this reason, it’s important to always verify a model’s output before relying on it. Deal with ambiguity. AI models are good at solving problems that are well-defined and have clear parameters. However, humans can identify and understand complex problems that are not well-defined and have ambiguous parameters by considering key details offered in the context of the project. Interpersonal communication. AI models can generate reports and presentations, but they cannot communicate with stakeholders in the nuanced way that humans can. Humans can explain the results of their analysis to fit the needs of specific stakeholders, and use their emotional intelligence to address concerns. Creativity. AI models are good at following instructions, but they are not imaginative like humans. Humans can be creative in their approach to data analysis, and imagine new and innovative solutions to complex problems. Critical thinking. Humans can think critically about their data and identify potential biases and ethical issues. AI models are usually trained on real-world data that contains biases and are therefore likely to reflect those biases in model outputs. Leadership. Humans can be leaders, and they can motivate and inspire others. AI may have difficulty understanding the nuances of human emotion, motivation, and communication. This limits AI’s ability to effectively run an organization. Factuality. Generative AI models are trained to output text based on patterns in language. Sometimes the model output may be very well-composed and as a result, seem reliable, but may not be factual. As noted above, it’s important to always verify model output.

Answer 14

Data professionals can use AI to help automate tasks, make predictions, generate insights, and communicate findings. They can leverage AI to be more productive in their work and more impactful in their organizations. Overall, AI is a powerful tool for data professionals but it is not without limitations. For this reason, human oversight and intervention is critical when working with AI and related tools. For example, data professionals can use AI to: --Create predictive models to help accurately forecast future events or outcomes. --Automate time-consuming tasks such as data cleaning, coding, and report writing. --Analyze extremely large datasets. --Improve the quality of data by identifying and correcting errors. --Generate insights from data that would not be obvious to humans. --Provide guidance on tasks such as choosing the right algorithms and interpreting results. --Facilitate collaboration among team members. Tools like Gemini and ChatGPT can help data professionals in a variety of ways. A data professional might ask Gemini or ChatGPT to: --Clean a dataset by removing missing values, outliers, and duplicate data. --Create interactive data visualizations such as dashboards and heatmaps. --Recommend a specific algorithm for a particular task based on the data professional's input. ---Create a shared document to facilitate a brainstorming session among a team of data professionals.

Answer 15

AI refers to the development of computer systems able to perform tasks that normally require human intelligence. For example, practical applications of AI include voice assistants, self-driving vehicles, automated recommendation systems, and more.

Answer 16

Be clear and concise in your instructions. It is important to be clear and concise in your instructions so the LLM can understand how to help you. Details are great—just make sure they’re useful and relevant. Avoid giving the LLM unnecessary information. Be precise. When posing a question to an LLM, be precise about the input (if any) and the desired output. Include a description of LLM’s role. This reinforces the purpose of your prompt. For example, you can tell the LLM to assume the role of a data scientist by writing “Act as a data scientist” or “You are a data scientist.” Provide context. Providing context allows the LLM to understand the nuances of the relevant issue and generate more informed responses. Try multiple prompts. Trying different prompts can provide different perspectives on a problem and enable the LLM to generate a variety of useful responses.

Answer 17

Data cleaning. LLMs can automate tasks such as data cleaning and coding. For example, you can ask an LLM to clean a dataset by removing missing values, outliers, and duplicate data. Exploratory data analysis (EDA). LLMs can perform exploratory data analysis (EDA) on datasets. For example, you can ask an LLM to create data visualizations, identify patterns and trends, and calculate summary statistics. Modeling. LLMs can build and evaluate models. For example, you can ask an LLM to build a machine learning model to predict an outcome, and evaluate the performance of the model. Interpreting results. LLMs can interpret the results of models. For example, you can ask an LLM to explain the features that are most important for a model, or generate insights from the results of a model. Collaboration. LLMs can help you collaborate with teammates. For example, you can ask an LLM to create a shared document for a brainstorming session with a team of data professionals.

Answer 18

Plan: -What are the goals of the project? -What strategies will be needed? -What will be the business or operational impacts of this plan? Tasks: Research business data, define the project scope, develop a workflow, and assess project and/or stakeholder needs Analyze: -acquire data from primary and secondary sources -clean, organize, and transform the data for analysis. -engage in EDA Tasks: format the database, scrub the data, and convert the data into usable formats Construct -Build and revise machine learning models -uncover relationships in the data -apply statistical inferences about data relationships Tasks: select the modeling approach, build models, and build machine learning algorithms Execute: -Present findings to internal and external stakeholders -Answer questions -Consider differing viewpoints Tasks: share results, present findings to other stakeholders, and address feedback.

Answer 19

Exploratory Data Analysis is an approach to analyzing data sets to summarize their main characteristics, often using visual methods. Used by data scientists to analyze and investigate data sets and summarize their main characteristics, often employing data visualization methods. EDA helps determine how best to manipulate data sources to get the answers you need, making it easier for data scientists to discover patterns, spot anomalies, test a hypothesis, or check assumptions. The main purpose of EDA is to help look at data before making any assumptions. It can help identify obvious errors, as well as better understand patterns within the data, detect outliers or anomalous events, find interesting relations among the variables.

Answer 20

Questions to ask yourself Assumption check: -Is there something that I am taking for granted? -Am I assuming something here that I shouldn't? -Can I determine if the assumption is correct? Limitation check: -Is the data complete? Are there missing values or sections? -Are the datasets formatted correctly? -Is this a sufficient sample size to conduct an analysis of an entire population or group? -What are the biases present in the data set? -Does this data contain personally identifiable information? What steps will I take to protect this information?

Answer 21

Speak the language of your audience. -Why has this person contacted you? -What does your stakeholder want from this interaction? -What's important to them, their team, or their organization? When interacting with stakeholders: -Break down technical concepts into simpler terms -Use shorter sentences so main ideas are easier to understand and remember -Use direct language and minimize embellishments or unnecessary detail -Pay attention to diverse backgrounds and respect the lived experience of others -Avoid jargon, acronyms, and technical "buzzwords" that could lead to confusion *Invite questions and welcome feedback -Merge your passion for finding solutions with the goals of the project -Continue to strive for greater understanding of the results -Elicit feedback and questions to improve communication about your projects -Consider opportunities to reflect on your communication skills -**Analyze feedback. Is it valid? Does the person have a complete understanding of the goals of the project or data analytical process? If not, set up an additional meeting to clarify. *Be the connection to the data -Focus on the objective to help others better understand your data process -Tell the story of the data with a compelling and cohesive narrative -Respond to questions in a timely manner -Demonstrate your value to the team -Find opportunities to address stakeholder questions Let your visualizations help you tell the story -Be sure that your visuals tell the story within the data -Design visuals for inclusivity -Use labels and text to clarify, not clutter -Use fonts that are easy to read -Use high contrast, shading, and other customizations to communicate your message clearly -Offer handouts, slides, and other material in accessible formats. -*Keep visuals simple. When deciding what to include in a presentation, less is more. Build positive professional relationships -*Focus on what matters to your audience -**Invite feedback and discussion -Be a trusted subject matter expert who communicates clearly and inclusively -Cultivate positive interactions to strengthen working relationships and improve morale -When a stakeholder contacts you, be accessible and engaged in your communication Share Findings -*Craft results to the needs of your stakeholders. Communicate why this data will help them achieve their goals -Determine the visuals and/or dashboards that are the most effective. What data will you need to show and how do you want stakeholders to interact with it? -Think about the design carefully. A simple yet visually appealing approach to visualizations is always the best. -Use a hierarchy of data in your visualizations/dashboards. Information that is most important should be easily accessible, but you should provide a path for more details. What should I keep in mind when I share results? -What information is the most important to my audience? -What is the most efficient way to share with the tools available and the time I’m allotted? -What can I do to make the key points effectively?

Answer 22

Tips for presentations -Structure your presentation. Be sure there is a logical structure: a beginning, middle, and end. -Presentation slides are not scripts. Don’t read or put complete paragraphs on presentation slides. -Make sure your data can be understood visually and consider potential accessibility challenges for your audience. -Focus most on the points your data illustrates. -Share one—and only one—major point from each chart. -Label chart components clearly. -Visually highlight “Aha!” zones. -Write a slide title that reinforces the data’s point.

Answer 23

Use direct language Minimize wordiness Avoid unnecessary details Always strive for clarity Use proper grammar and punctuation Keep vocabulary simple and avoid technical language Break complex ideas into shorter sentences to make concepts easier to understand and remember

Answer 24

Main function: To outline objectives and requirements. Project proposals present ideas in detailed and actionable segments called milestones. Proposals are commonly created with input from team members and other stakeholders. They may be shared with clients or executives to gain approval and inform them of the project's path to completion.

Answer 25

-Project title: Should be brief and purposeful. -Project Objective: 1-3 sentences explaining what the project is trying to achieve -Milestones: Groupings of tasks within a project, breaking the work into manageable goals. -Tasks: Tasks detail the work that needs to be completed within a milestone. -Outcomes: Completed actions or results that allow a project to continue -Deliverables: Items that can be shared among team members or with stakeholders -Stakeholders: Individuals or groups who are directly involved and have a vested interest in the success of a project -Estimated time: At the beginning of a project, the time needed to complete milestones is completed. As a project develops, these estimates will often need to be updated to account for adjustments to timelines or changes in team members. -

Answer 26

A document used to update decision makers who may not be directly involved in the tasks of a project. They can also be used to help new team members quickly become acquainted with a project. There are many ways to present the information within a summary, including software options built specifically for that purpose.

Answer 27

A framework that provides an initial structure to guide the process of data analytics; PACE stands for plan, analyze, construct, and execute

Answer 28

Stage of the PACE workflow where the scope of a project is defined and the informational needs of the organization are identified

Answer 29

Stage of the PACE workflow where the necessary data is acquired from primary and secondary sources and then cleaned, reorganized, and analyzed

Answer 30

Stage of the PACE workflow where data models and machine learning algorithms are built, interpreted, and revised to uncover relationships within the data and help unlock insights from those relationships

Answer 31

Stage of the PACE workflow where a data professional will present findings with internal and external stakeholders, answer questions, consider different viewpoints, and make recommendations

Answer 32

An executive-level data professional who is responsible for the consistency, accuracy, relevancy, interpretability, and reliability of the data a team provides

Answer 33

Understanding through doing.

Answer 34

A capability or proficiency that can be applied from one job to another

Answer 35

the number of existing customers lost over a given period of time. Example: the number of users who have uninstalled an app or stopped using the app

Answer 36

stats that summarize or describe features of a data set, such as its central tendency or dispersion *** Central tendency: a single value that attempts to describe a set of data by identifying the central position within that set of data. It could be the mean/avg, median, or mode. Mode: most frequent value in a data set. Notes: -Mean/avg can be susceptible to the influence of outliers. The data can be skewed by outliers. -Median: less affected by outliers and skewed data. (Standard to use the median whenever tests of normality show that the data is non-normal/skewed.) *** Dispersion: Way of describing how spread out a set of data is.

Answer 37

Command mode: Used to interact with the notebook as a whole and perform actions like adding, moving, and deleting cells. Edit mode: used to type code or markdown text in a particular cell.

Answer 38

You can write and execute individual chunks of code in small, manageable chunks, which are called cells.

Answer 39

individual chunks of code in small, manageable chunks

Answer 40

A markup language that lets you add formatting elements to plain text.

Answer 41

An attribute that describes a piece of data based on its values, its programming language, or the operations it can perform.

Answer 42

What's the variable's name? What's the variable's type? What's the variable's starting value?

Answer 43

The process of storing a value in a variable

Answer 44

A combination of numbers, symbols, or other variables that produce a result when evaluated.

Answer 45

Variables can point to objects of any data type

Answer 46

Rules built into the syntax of the language itself that must be followed

Answer 47

Consistent guidelines that describe the content, creation date, and version of a file in its name

Answer 48

A concise document that summarizes a longer report or proposal, highlighting the main points and findings for decision makers.

Answer 49

A statistical process for estimating the relationships among variables, often used to predict outcomes based on input data.

Answer 50

A log transformation means the analyst replaced the original values with their logarithm (usually log10 or natural log). So, instead of working with: 10, 100, 1000, 10000 they work with: 1, 2, 3, 4

Answer 51

What does it mean that the data was log-transformed? A log transformation means the analyst replaced the original values with their logarithm (usually log10 or natural log). So, instead of working with: 10, 100, 1000, 10000 they work with: 1, 2, 3, 4 Why do people log-transform data? -highly skewed (long right tail) -spanning multiple orders of magnitude -multiplicative rather than additive -dominated by extreme outliers Common examples: -income -view counts (YouTube, TikTok) -population sizes -sales revenue -biological measurements (gene expression, viral load)

Answer 52

What does it mean that the data was log-transformed? A log transformation means the analyst replaced the original values with their logarithm (usually log10 or natural log). So, instead of working with: 10, 100, 1000, 10000 they work with: 1, 2, 3, 4 Why use log-transformed data? Without a log transform: -a few massive values dominate the analysis -trends look flat of meaningless -models perform poorly With a log transform: -differences become comparable -patterns become linear -variance stabilizes -visualization becomes readable

Answer 53

After log-transforming: -Differences are multiplicative -A +1 increase ≈ 10× increase (for log10) -A straight line = exponential growth in original units Explanation: Logs turn multiplication into addition: -120 → 1,200 → 12,000 -log₁₀: 2.08 → 3.08 → 4.08 That’s why trends become easier to see.

Answer 54

pre-filtered means that rows or values were removed before analysis. That is, the dataset you see is not raw; it's already been cleaned or restricted. Data is often pre-filtered to: -Remove missing or invalid values -Exclude outliers -Focus on a specific subgroup -Apply minimum thresholds -Improve data quality or relevance

Answer 55

pre-filtered means that rows or values were removed before analysis. That is, the dataset you see is not raw; it's already been cleaned or restricted. Data is often pre-filtered to: -Remove missing or invalid values -Exclude outliers -Focus on a specific subgroup -Apply minimum thresholds -Improve data quality or relevance

Answer 56

Pre-filtering affects: -Sample size -Means and medians -Variance -Generalizability -Whether results are biased *If data is pre-filtered, you cannot interpret results as applying to the full population.

Answer 57

It usually means: -Rows not meeting criteria were removed first -Remaining numeric variables were log-scaled before analysis

Answer 58

Log-transformed = values were rescaled using logarithms to manage skew and extreme ranges Pre-filtered = some data was removed before analysis Both are common, useful, and powerful — but they change how results must be interpreted

Answer 59

When you see: log10(120) = 2.08 The statement means: 10^2.08 ≈ 120 Note: ≈ means approximately equal to log10(100) = 2 because 10^2=100 log10(1000) = 3 because 10^3=1000

Answer 60

Each +1 in log₁₀ means 10× bigger: | ----------- | ---------- | | 2.00 | 100 | | 2.08 | 120 | | 2.30 | 200 | | 3.00 | 1,000 | So: 2.08 just means “a bit bigger than 100” 2.30 means “about double 100” 3.00 means “ten times 100” | log₁₀ value | Real value |

Answer 61

Exploratory Data Analysis: The process of investigating, organizing, and analyzing datasets and summarizing their main characteristics, often employing data wrangling and visualization methods.

Answer 62

Discover: Data professionals familiarize themselves with the data, so they can start conceptualizing how to use it. Structure: The process of taking raw data and organizing or transforming it to be more easily visualized, explained, or modeled. Clean: The process of removing errors that may distort your data or make it less useful (e.g., missing values, misspellings, duplicate values, or extreme outliers) Join: The process of augmenting or adjusting data by adding values from other datasets (i.e., you might add more value or context to the data by adding more information from other data sources.) Validate: The process of verifying that the data is consistent and high quality (i.e., checking for misspellings, inconsistent number and date formats, and confirming that the cleaning process didn't create more issues). Present: Make your cleaned dataset or data visualizations available to others for analysis or further modeling.

Answer 63

Organizing data in groupings, categories, or variables that don't accurately represent the whole dataset.

Answer 64

A graph, chart, diagram, or dashboard that is created as a representation of information

Answer 65

iterative and non-sequential. Discover: Data professionals familiarize themselves with the data, so they can start conceptualizing how to use it. Structure: The process of taking raw data and organizing or transforming it to be more easily visualized, explained, or modeled. Clean: The process of removing errors that may distort your data or make it less useful (e.g., missing values, misspellings, duplicate values, or extreme outliers) Join: The process of augmenting or adjusting data by adding values from other datasets (i.e., you might add more value or context to the data by adding more information from other data sources.) Validate: The process of verifying that the data is consistent and high quality (i.e., checking for misspellings, inconsistent number and date formats, and confirming that the cleaning process didn't create more issues). Present: Make your cleaned dataset or data visualizations available to others for analysis or further modeling.

Answer 66

The following two principles are inherently part of the EDA process: Human augmentation: This principle ensures humans are inserted throughout the AI or machine learning algorithm systems for oversight. Thorough EDA, performed by data scientists, is perhaps one of the best ways to limit bias, imbalance, and inaccuracies being fed into an algorithm. Bias evaluation: Without human interference, bias is too easily injected and reproduced in machine learning models. Performing methodical EDA processes will lead data scientists to be aware of and act on biases and imbalances in the data.

Answer 67

(jay-son) Data storage files that are saved in a java script format. They may contain nested objects within them. (Think of nested objects as expandable file folders.) For example, it may have all ingredients for a recipe and under each ingredient, it may list weight, calories, and price Note: You can use pandas to import and read and transfer json files: import json pandas.read_json pandas.df.to_json()

Answer 68

Data that was gathered from inside your organization

Answer 69

Data that was gathered outside your organization but directly from the original source

Answer 70

Data gathered outside your organization and aggregated.

Answer 71

#Import a CSV file into Python and define the mode (i.e., read, write, append, or create a new file. with open('file_path/file_name', mode='r') as file: #assign the result to a variable name; in this case, "data" data = file.read() *When defining the mode, you use one of the following options: 'r' read 'w' write 'a' append '+' create a new file #Import a CSV file into a dataframe using Pandas: import pandas as pd df = pd.read_csv('example_filepath/file')

Answer 72

The head() function will display the number of dataset rows you input in the argument field. For the “X” in the argument field, input the number of rows you want displayed in a Python notebook. The default is 5 rows. Example: dt.head(10) # this will print out the first 10 rows of a table.

Answer 73

# Column Dtype The info() function will display a summary of the dataset, including the range index, dtypes (or data types), column headers, and memory usage. Leaving the argument field blank will return a full summary. As an option, in the argument field you can type in “show_counts=True,” which will not return any null fields. Example: df.info() RangeIndex:3401012 entries, 0 to 3401011 Data columns (total 3 columns): -- ---- ----- 0 date object 1 number_of_strikes int64 2 center_point_geom object Dtypes: int64(1), object(2) Memory usage 77.8+ MB

Answer 74

The describe() function will return descriptive statistics of the entire dataset, including total count, mean, minimum, maximum, dispersion, and distribution. Leaving the argument field blank will default to returning a summary of the data frame’s statistics. As an option, you can use “include=[X]” and “exclude=[X]” which will limit the results to specific data types, depending on what you input in the brackets. Once executed, the describe() function looks like this: Example: df_joined.describe()

Answer 75

‘Shape’ returns a tuple representing the dimensions of the dataset by number of rows and columns. The code will look something like this: Df.shape (3401012, 3)

Answer 76

If you see this, it means the data contains integers or numbers between -9,223,372, 036,854,775,808 and 9,223,372,036,854,775,808 (so between negative 9 quintillion and positive nine quintillion).

Answer 77

Sequences of characters or integers that are unchangeable

Answer 78

To learn about a dataset

Answer 79

Example: str.slice(stop=3) str.slice will omit the text after the first three letters.

Answer 80

Pyplot's plt.bar() function takes positional arguments of x and height, representing data for the x- and y-axes. Example: plt.bar(x=df_by_month['month_txt'],height= df_by_month['number_of_strikes'], label="Number of strikes") plt.plot() plt.xlabel("Months(2018)") plt.ylabel("Number of lightning strikes") plt.title("Number of lightning strikes in 2018 by months") plt.legend() plt.show()

Answer 81

eturn a random sample of items from each group. You can use random_state for reproducibility. Parameters: nint, optional Number of items to return for each group. Cannot be used with frac and must be no larger than the smallest group unless replace is True. Default is one if frac is None. fracfloat, optional Fraction of items to return. Cannot be used with n. replacebool, default False Allow or disallow sampling of the same row more than once. weightslist-like, optional Default None results in equal probability weighting. If passed a list-like then values must have the same length as the underlying DataFrame or Series object and will be used as sampling probabilities after normalization within each group. Values must be non-negative with at least one positive element within each group. random_stateint, array-like, BitGenerator, np.random.RandomState, np.random.Generator, optional If int, array-like, or BitGenerator, seed for random number generator. If np.random.RandomState or np.random.Generator, use as given. Example: #Create a 'years_until_unicorn' column using the companies dataset, which has a column for year joined (i.e., year a company became a unicorn) and year founded. companies_sample['years_until_unicorn'] = companies_sample['Year Joined'] - companies_sample['Year Founded']

Answer 82

Code can be written vertically because each line returns an object that the next line can act on. This is called method chaining. Every method in the chain returns a new pandas object (usually a DataFrame or Series), so the next method can immediately be called on it. Example: The fully written-out code: step1 = companies_sample[["Industry", "years_till_unicorn"]] step2 = step1.groupby("Industry") step3 = step2.max() grouped = step3.sort_values(by="years_till_unicorn") is the same as the chained code: grouped = (companies_sample[["Industry", "years_till_unicorn"]] .groupby("Industry") .max() .sort_values(by="years_till_unicorn") )

Answer 83

To add a title to a plot created with pandas plotting functionality, which uses Matplotlib as its backend, you can use the plt.title() function from the matplotlib.pyplot module

Answer 84

The plt.xlabel() function from Matplotlib can be used with Pandas plotting to set the label for the x-axis. This function should be called after generating the plot to modify the current active axes.

Answer 85

The plt.ylabel() function from Matplotlib can be used with Pandas plotting to set the label for the y-axis. This function should be called after generating the plot to modify the current active axes.

Answer 86

The plt.xticks() function from Matplotlib is used to get or set the x-axis tick locations and labels. When working with Pandas plots, which use Matplotlib as their default backend, you can use plt.xticks() to customize the axis appearance after calling the pandas.DataFrame.plot() method. Example: # Rotate labels on the x-axis as a way to avoid overlap in the positions of the text plt.xticks(rotation=45, horizontalalignment='right')

Answer 87

To display plots generated from a pandas DataFrame or Series, you need to use the plt.show() function from the Matplotlib library, which is the backend plotting library pandas uses.

Answer 88

ote that it’s not uncommon to import the datetime module from Python’s standard library as dt. You may have encountered this yourself. In such case, dt is being used as an alias. The pandas dt Series accessor (as demonstrated in the last example) is a different thing entirely. Example: print(my_series.dt.year) print(my_series.dt.month) print(my_series.dt.day) output: 0 2023 1 2023 2 2023 dtype: int64 0 1 1 4 2 6 dtype: int64 0 20 1 27 2 15 dtype: int64

Answer 89

# Create four new columns. strftime is short for "string format time." We will use this method on the datetime data in the week column, and it will extract the information we specify, formatted as a string. Example: let's create four new columns: week, month, quarter, and year. You can find a full list of available codes to use in the strftime format codes documentation. We will use %Y for year, %V for week number, %q for quarter. df['week'] = df['date'].dt.strftime('%Y-W%V') df['month'] = df['date'].dt.strftime('%Y-%m') df['quarter'] = df['date'].dt.to_period('Q').dt.strftime('%Y-Q%q') df['year'] = df['date'].dt.strftime('%Y')

Answer 90

Helpful package for creating bar, line, and pie charts. A package in the matplotlib library. To use: import matplotlib.pyplot as plt

Answer 91

functions and commands that help you work with data sets. To use: import pandas as pd

Answer 92

visualization library that produces charts To use: import seaborn as sns

Answer 93

This is the aggregation step. After grouping, pandas needs to know: “When there are multiple rows per week, how do I combine them?” .sum() answers that question. Practice Question: If you were grouping a table by the column week, you have the following code, what does sum() do? df_by_week_2018 = df[df['year'] == '2018'].groupby(['week']).sum().reset_index() df_by_week_2018.head() Answer: Sum() adds up all numeric columns within each week & collapses many rows per week into one row per week

Answer 94

.reset_index() -Moves week out of the index -Turns it back into a regular column -Creates a clean, flat DataFrame Result: week strikes 1 355 2 190 Step Purpose groupby('week') Define grouping sum() Collapse rows → one per week reset_index() Make the result usable You usually need all three when: -You’re aggregating data -You plan to inspect, plot, or reuse the result

Answer 95

The process of arranging data into meaningful order

Answer 96

The process of retrieving data from a dataset or source for further processing

Answer 97

The process of selecting a smaller part of your dataset based on specified parameters and using it for viewing or analysis.

Answer 98

A method for breaking information down into smaller parts to facilitate efficient examination and analysis from different viewpoints.

Answer 99

Aggregating individual observations of a variable into groups

Answer 100

Method to combine two different data frames along a specified starting column

Answer 101

df.merge() A method available to the DataFrame class. Use df.merge() to take columns or indices from other dataframes and combine them with the one to which you’re applying the method. Example: df1.merge(df2, how=‘inner’, on=[‘month’,’year’])

Answer 102

pd.concat() A pandas function to combine series and/or dataframes Use pd.concat() to join columns, rows, or dataframes along a particular axis Example: df3 = pd.concat([df1.drop(['column_1','column_2'], axis=1), df2])

Answer 103

A method available to the DataFrame class. Use df.join() to combine columns with another dataframe either on an index or on a key column. Efficiently join multiple DataFrame objects by index at once by passing a list. Example: df1.set_index('key').join(df2.set_index('key'))

Answer 104

Use df[[columns]] to extract/select columns from a dataframe. Example: df[['animal', 'legs']]

Answer 105

A method available to the DataFrame class. Use df.select_dtypes() to return a subset of the dataframe’s columns based on the column dtypes (e.g., float64, int64, bool, object, etc.). Example: df2 = df.select_dtypes(include=['int64'])

Answer 106

Use df[condition] to create a Boolean mask, then apply the mask to the dataframe to filter according to selected condition. Example: df[df['class']=='Aves']

Answer 107

A method available to the DataFrame class. Use df.sort_values() to sort data according to selected parameters. Example: df.sort_values(by=['legs'], ascending=False)

Answer 108

Use df.iloc[] to slice a dataframe based on an integer index location. Examples: df.iloc[5:10, 2:] → selects only rows 5 through 9, at columns 2+ df.iloc[5:10] → selects only rows 5 through 9, all columns df.iloc[1, 2] → selects value at row 1, column 2 df.iloc[[0, 2], [2, 4] → selects only rows 0 and 2, at columns 2 and 4

Answer 109

Use df.loc[] to slice a dataframe based on a label or Boolean array. Example: df.loc[:, ['color', 'class']]

Answer 110

Histograms are commonly used to illustrate the shape of a distribution, including the presence of any outliers, the center of the distribution, and the spread of the data. Histograms are typically represented by a series of bars, where each bar represents a range of values. Bar height represents the frequency or count of the data points within that range. Histograms are an essential tool for understanding the characteristics of a dataset. They provide a visual representation of the data’s distribution and enable data professionals to identify patterns, trends, or outliers within the data. Histograms can also help data professionals choose appropriate statistical tests and models for the data and determine whether the data meets any assumptions required for the analysis. Histograms are widely used in any field and any situation that requires any kind of data analysis, including finance, health care, engineering, and social sciences.

Answer 111

Symmetric: A symmetric histogram has a bell-shaped curve with a peak in the middle, indicating that the data is evenly distributed around the mean. This is also known as a normal, or Gaussian, distribution. Skewed: A skewed histogram has a longer tail on one side than the other. A right-skewed histogram has a longer tail on the right side, indicating that there are more data points on the left side of the histogram. A left-skewed distribution has a longer tail on the left side, indicating more data points on the right side. Bimodal: A bimodal histogram has two distinct peaks, indicating that the data has two modes. Uniform: A uniform histogram has a flat distribution, indicating that all data points are evenly distributed.

Answer 112

# Plot histogram with matplotlib pyplot use the hist() function in the pyplot module. The function can take many different arguments, but the primary ones are: x: A sequence of values representing the data you want to plot. It can be a list, tuple, NumPy array, pandas series, and so on. bins: The number of bins you want to sort your data into. The default value is 10, but this parameter can be an int, sequence, or string. If you use a sequence, it defines the bin edges, including the left edge of the first bin and the right edge of the last bin. In other words, if bins = [1, 3, 5, 7], then the first bin is [1–3) (including 1, but excluding 3) and the second [3–5). The last bin, however, is [5–7], which includes 7. A string refers to a predefined binning strategy supported by numpy. Refer to the documentation for more information. Example: Plot histogram with matplotlib pyplot plt.hist(df['seconds'], bins=range(40, 101, 5)) plt.xticks(range(35, 101, 5)) plt.yticks(range(0, 61, 10)) plt.xlabel('seconds') plt.ylabel('count') plt.title('Old Faithful geyser - time between eruptions') plt.show();

Answer 113

Use the sns.histplot() function. sns.histplot() can take many arguments. Here are some important ones: x: A sequence of values representing the data you want to plot. It can be a list, tuple, NumPy array, pandas series, and so on. bins: The number of bins you want to sort your data into. The default value is 10, but this parameter can be an int, sequence, or string. If you use a sequence, it defines the bin edges, including the left edge of the first bin and the right edge of the last bin. In other words, if bins = [1, 3, 5, 7], then the first bin is [1–3) (including 1, but excluding 3) and the second [3–5). The last bin, however, is [5–7], which includes 7. A string refers to a predefined binning strategy supported by numpy. Refer to the documentation for more information. binrange: Lowest and highest value for bin edges; can be used either with bins or binwidth; defaults to data extremes binwidth: Width of each bin, overrides bins but can be used with binrange Example: Plot histogram with seaborn ax = sns.histplot(df['seconds'], binrange=(40, 100), binwidth=5, color='#4285F4', alpha=1) ax.set_xticks(range(35, 101, 5)) ax.set_yticks(range(0, 61, 10)) plt.title('Old Faithful geyser - time between eruptions') plt.show();

Answer 114

Check for duplicates Run data_frame.shape first. If the shape of the data is different after running data_frame.drop_duplicates(), you will know there were duplicate rows.

Answer 115

A histogram is a graphical representation of a frequency distribution, which shows how frequently each value in a dataset or variable occurs.

Answer 116

A frequency distribution is a table or graph that shows how often values occur within specific intervals.

Answer 117

When you have a column that is in datetime64[ns} and you want to just extract the year, month, or day. Example: The below code will filter results so it just shows companies that joined in 2021. If you didn't use dt.year, no results would show because the datetime shows year, month, and day so it'll never just be 2021. filtered_companies = companies[companies['Date Joined'].dt.year == 2021]

Answer 118

# Group `companies_2020_2021` by `Quarter Joined`, groupby() puts your grouping keys into the index. reset_index() turns them back into columns. That’s it. Almost every “Do I need reset_index()?” question reduces to: Do I want my group labels as columns or as an index? Example: # Aggregate by computing average `Funding` of companies that joined per quarter of each year. # Save the resulting DataFrame in a new variable. companies_by_quarter_2020_2021 = companies_2020_2021.groupby(by="Quarter Joined")["Valuation"].mean().reset_index().rename(columns={"Valuation":"Average Valuation"}) *General Rule: If the next method uses columns=, you almost always need reset_index() first. TL;DR groupby() → group keys become the index reset_index() → index becomes columns Need to rename, plot, merge, or export? → reset Doing more math? → don’t reset yet

Answer 119

In statistics, the parameters of a population are often estimated based on a sample. Examples of parameters that can be estimated: the mean or the variance. Example: You want to know the height of all professional basketball players in the US. For this, you draw a sample. The mean of the sample is likely different from that of the population. If you drew many samples (something that you are unlikely to do), each sample is likely to show a different mean. You will then have a range (the highest mean from the samples to the lowest mean from the samples) in which the true value will lie with a high probability. This range is known as the confidence interval. You will often hear someone state that the confidence interval is 95%. (Sometimes you may hear 99%.) This means that you can be 95% sure that the true parameter lies within this interval. (or 99% sure.) For example, imagine looking at a normal distribution (bell curve). If a confidence interval of 95% is selected, 95% of all values lie within the lower limit (far left of curve) and upper limit (far right of the curve). Note: The confidence interval can be calculated for many different statistical parameters, not only for the mean value. The confidence interval simply states which range the parameter lies within with a certain probability (e.g., 95%).

Answer 120

If you're comparing recent results to older ones, you may not have all necessary results in recent years as it's too early to collect all of them. For example, say that you're looking at the number of years it took US-based companies to hit unicorn status (i.e., hit $1B valuation) and your dataset includes companies founded in 1900 to the current year. The trouble is that the companies founded in recent years (let's say 2020 to 2025) will only include the fast growing companies (i.e., the companies that hit unicorn status within that time period). It will not yet include the slower growing companies that were founded in that recent years. As such, the data will suggest that companies founded in recent years are faster growing (i.e., hit $1B valuation status) faster than companies founded back in the day. But that is misleading as it's simply to early on. In 10, 15, or 20 years, there will be more companies founded in the 2020 to 2025 time period that have since hit unicorn status. This will lower the average number of years that it took companies founded in this time period to hit unicorn status.

Answer 121

Interquartile Range It shows the spread of the middle 50% of the data/the range of that data. Example: IQR in a box plot IQR = Q3 (quartile 3) - Q1 (quartile 1) 6 = 90-84 So, the IQR is 6 YouTube video: https://www.youtube.com/watch?v=QGSwRH0WgBg

Answer 122

Tukey's rule is for boxplots. Lower bound: Q1 - 1.5 * IQR Upper bound: Q3 + 1.5 * IQR Any observations beyond these bounds are flagged as potential outliers. Q1: Quartile 1 Q3: Quartile 3 IQR: Inner quartile range. Shows the middle 50% spread of the data. Calculated by Q3-Q1.

Answer 123

Delete them: If you are sure the outliers are mistakes, typos, or errors and the dataset will be used for modeling or machine learning, then you are more likely to decide to delete outliers. Of the three choices, you’ll use this one the least. Reassign them: If the dataset is small and/or the data will be used for modeling or machine learning, you are more likely to choose a path of deriving new values to replace the outlier values. Leave them: For a dataset that you plan to do EDA/analysis on and nothing else, or for a dataset you are preparing for a model that is resistant to outliers, it is most likely that you are going to leave them in.

Advanced Data Analytics, Coursera Flashcards

(149 cards)