Pandas Flashcards

(110 cards)

1
Q

How do we read in a csv using pandas?

A

pd.read_csv(‘data/file.csv’)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

How do we access a column in pandas?

A

df[‘Column’]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

How do get how many unique names there were for a given year?

A

df[‘Year’].value_counts()

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

How do we query for a given year?

A

df[(df[‘Year’] == 1800)]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

If we want to get the number of unique names for a certain year, how do we do that?

A

df[(df[‘Year’] == 1800)].value_counts(‘Name’)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

How many babies were recorded per year?

A

df.groupby(‘Year’)[‘Count’].sum()

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What does the following statement give us?

(df
.assign(first_letter=df[‘Name’].str[0])
.query(‘first_letter == “L”’)
.groupby(‘Year’)
[‘Count’]
.sum()
.plot())

A

We are plotting a graph that shows the number of babies born with an “L” name per year

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

How would we make a function to create a name graph for any specific name?

A

def name_graph(name):
return (df
.query(f’Name == “{name}”)
.groupby(‘Year’)
[‘Count’]
.sum()
.plot(title=f’Number of Babies Born Named “{name}” Per Year’))

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What does df.head(2) return?

A

the first two columns of the dataframe df

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What does this code do?

whoa = np.random.choice([True, False], size = len(dogs))
(dogs[whoa]
.groupby(‘size’)
.max()
.get(‘longevity’)
)

A

the whoa portion is randomly selecting rows from the dogs dataframe. then we get a random subset of rows, group by the size column, takes the maximum value of each group and retrieves the longevity column. this gives us the max longevity per dog size based on a random sample.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is numpy’s main object?

A

the array

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What are the two traits of numpy arrays?

A

they are homogenous - all values are of the same type. and potentially multidimensional

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What does np.arange(10) give us?

A

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

How do we import pandas into a jupyter notebook?

A

import pandas as pd

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What does df.get(‘Population’) return?

A

a series

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is a series?

A

like an array but with an index

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

When can we perform arithmetic operations with two series?

A

anytime since we treat them like arrays, as long as they have the same length and index

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

How do we assign a new column?

A

df.assign(new_col = df.get(‘Population’) / df.get(‘Land Area’))

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

When assigning a new column, what do we not want to do?

A

put quotes around the name of the new column

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

If we want to use the dataframe with a newly assigned column, what must we do?

A

assign it to a variable like new_df = df.assign(…)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

What kind of methods can we use on series?

A

.min(), .max(), .mean()

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

How would we get the median of a series?

A

df.get(‘Density’).median()

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

How do we get descriptive values of a column?

A

using .describe() on the specific column

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

What is the syntax to sort a dataframe?

A

.sort_values(by=’column_name’)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
What order does .sort_values() use by default?
ascending order
26
How do we specify that we want to sort a dataframe by descending order?
.sort_values(by='col_name', ascending=False)
27
How do we retrieve specific information from a series?
using .iloc
28
What does .iloc stand for?
integer location
29
How do we navigate to a particular entry of a series?
since a series is an array, we use .iloc[integer_position], and it starts at 0
30
How do we change the index of a dataframe to a specific column's values?
.set_index('col_name')
31
What does .set_index() return?
a copy of the dataframe, if we want to keep using it, we have to assign it to a variable
32
How we access the element of a series with a particular row label?
using .loc[]
33
For this example, get density of pennsylvania.
df.get('Density').loc['Pennsylvania']
34
When will the integer position and row label be the same?
when we initially read in the csv
35
How do we query to get only the region in the west?
df[df.get('Region') == 'West']
36
What do we have to do if we want to use the dataframe that we queried into?
assign it to a new variable
37
What is a query?
code that extracts rows from a dataframe for which certain condition(s) are true
38
What do we use queries to filter Dataframes for?
to contain only the rows that satisfy given conditions
39
What do we use .shape for?
to return the number of rows and columns in a given Dataframe
40
What type of object is .shape
an attribute because it describes the Dataframe
41
What does .shape[0] return?
the number of rows
42
What does .shape[1] return?
the number of columns
43
Give an example of what df.shape would return.
(31, 6)
44
How do we write a query with multiple conditions?
using & for "and" and | for "or", then wrapping each query in parentheses within the brackets []
45
How do we select rows in specific positions?
using .take(sequence_of_integer_position) and passing a specific rows as an array -> [0, 1, 2]
46
What is another way other than an array to use .take()?
.take(np.arange(3))
47
How does .groupby('col_name') work?
it groups all the values for a specific column
48
What happens to the column that we grouped by?
it becomes the index
49
What do we have to follow up and use after .groupby()?
.sum(), .mean(), .median(), .count(), .max(), and .min()
50
In general, how does .groupby() work?
it aggregates all rows with the same value in a specified column into a single row in the resulting DataFrame
51
What are some keywords we look for to know that we need to .groupby()?
per, for each, indexed by
52
How does the aggregation method work in the .groupby()?
it is applied separately to each column, if it doesn't make sense to apply the aggregation, the column will disappear
53
How do we drop old columns?
use .drop(columns=list_of_column_labels)
54
How do we get several specific columns in a Dataframe?
df.get(['col1', 'col2', 'col3'])
55
What happens if the array we pass into .get() only contains one column name?
we get a DataFrame with only that specific column
56
What is an individual?
a person/place/thing for which data is recorded, found in the rows
57
What is another name for an individual?
an observstion
58
What is a variable?
something that is recorded for each individual, found in columns
59
What is another name for variable?
a feature
60
What are the two main types of variables?
numerical and categorical
61
What is a numerical variable?
it makes sense to do arithmetic with the values
62
What is a categorical variable?
values fall into categories, they may or may not have some order to them
63
What type of visualization do we use if we are trying to visualize numerical vs. numerical variables?
scatter plot
64
What type of visualization do we use if we are trying to visualize sequential numerical (time) vs. numerical variables?
line plot
65
What type of visualization do we use if we are trying to visualize categorical vs. numerical variables?
bar chart
66
What type of visualization do we use if we are trying to visualize numerical variables?
histogram
67
How would we plot the relationship between 'distance' and 'magnitude'?
df.plot(kind='scatter', x='Distance', y='Magnitude');
68
What is the x in .plot()?
column for horizontal
69
What is the y in .plot()?
column for vertical
70
What is the syntax for a line plot?
df.plot(kind='line', x='x_col', y='y_col');
71
What is the syntax for a bar chart?
df.plot(kind='barh', x='categorical_col', y='numerical_col');
72
What does the h in 'barh' stand for?
horizontal, if we want a vertical bar chart we just omit the h
73
How would we plot multiple plots on the same axes?
df.get(['magnitude', 'radius']).plot(kind='barh');
74
How do we implement a density histogram?
df.plot(kind='hist', y='Radius', density=True, bins = np.arange(0, 3.4, 0.5), ec= 'w');
75
What do we generally use .count() for?
to count specific things of each column; often with groupby and counting one of each group
76
What happens when calling .plot and omiting the y=column_name?
all other columns will be plotted
77
what is the distribution of a variable?
consists of all the values of the variable that occur in the data, along with their frequencies
78
What do distributions help us understand?
how often a variable takes on a certain value
79
How do we assign a title to a chart?
specify the title='title_here' argument at the end of the .plot()
80
What are some optional arguments?
legend, figsize, xlabel, and ylabel
81
How would we specify getting the series after using groupby?
df.groupby('col_name').mean()..get('othercol_name')
82
What kind of chart is not the right choice for the distribution of a numerical variable?
a bar chart
83
What should the horizontal axis be?
numerical, not categorical
84
What is binning?
the act of counting the number of numerical values that fall within ranges defined by two endpoints
85
What is the names we call ranges in a histogram?
bins
86
How do we know if a value falls in a bin?
if it is greater than or equal to the left endpoint and less than the right endpoint
87
What do density histograms visualize?
the distribution of a single numerical variable by placing numbers into bins
88
What is the syntax to create a density histogram?
df.plot(kind='hist', y=col_name, density=True)
89
What do we use ec='w' for?
its recommended to add to a histogram plot to see where bins start and end more clearly
90
What is the default number of bins that python will bin our data into?
10 equally sized bins
91
What is the total area of a histogram?
the bars of a density histogram have a combined total area of 1
92
What is the area of a bar equal to?
the proportion of all data points that fall into that bin
93
What happens when we use 'some string'.split(' ')?
we end up getting a list like ['some', 'string'] which we can then index into
94
How would we apply a specific function to every element of a specific column in a DataFrame?
df.get('col').apply(func_name)
95
What type of method is the .apply method?
a Series method
96
What is the output of .apply?
also a Series
97
What is an incorrect way passing the function into .apply()?
calling it like func_name()
98
What can kind of functions can we use .apply with too?
built-in functions like abs
99
How does .reset_index() work?
turns the index of a DataFrame into a column, and resets the index back to the default of 0, 1, 2, 3, etc.
100
How would we group on multiple columns?
we would pass a list of column name to .groupby
101
What is the syntax for passing a list of column names to .groupby?
df.groupby(['col_1', 'col_2', ..., 'col_k'])
102
What is the order that we will be grouping by when passing a list into .groupby?
first by 'col_1', within each group, group by 'col_2', and so on
103
What would the resulting DataFrame's rows look like?
one row per unique combination of entries in the specified columns
104
What do we usually want to do when we create a multiIndex?
.reset_index() to flatten our DataFrame back to normal
105
What does .merge do?
"merges" two DataFrames into one
106
What is the syntax to apply .merge?
left_df.merge(right_df, left_on='left_col_name', right_on='right_col_name')
107
What should left_on and right_on be?
column names and they dont have to be same
108
What will the resulting DataFrame contain?
a single row for every match between the two columns
109
What happens to rows in either DataFrame without a match?
they disappear
110
What can we use if the names of the columns we want to merge on are both the same?
we can just use on='col'