What tools are commonly used for summarizing data?
What are the four typical steps in the data analysis process?
What is a population in statistics?
A population is the complete set of all entities of interest in a study. It includes every possible member that fits the criteria you’re studying.
Studying entire population is impossible due to time, cost and logistics
What is a sample, and why do we use samples instead of populations?
A sample is a subset of the population that is used to collect data. Sampling allows researchers to make estimates or generalizations about the population without needing to gather data from every member.
Sample must be representative of the population to avoid biased results.
What is a dataset?
A dataset is an organized collection of data, typically arranged in a rectangular table format:
What are variables in a dataset?
Variables are the characteristics or attributes that are measured in a study. In a dataset, they are represented by columns.
What are observations in a dataset?
An observation is a single instance or row in a dataset. Each observation contains data for all the variables measured on one subject.
Why is the distinction between population and sample not always important when analyzing datasets?
The distinction between population and sample becomes crucial when you want to generalize findings beyond your dataset (such as making predictions or decisions). However, if your focus is strictly on analyzing the data at hand without making broader generalizations, whether the data comes from a full population or a sample doesn’t immediately affect the analysis process.
What are the 3 types of data?
Variable is numeric if meaningful arithmetic can be performed on it.
What are numeric variables?
Numeric Variables are numbers that represent quantities or measurements where mathematical operations like addition, subtraction, or averaging make sense.
For example, age, income, or temperature are numeric because it makes sense to perform calculations on them.
What are categorical variables?
Categorical Variables represent categories or groups that label data points. These can be numbers, but their purpose is to identify rather than quantify.
For example, eye color, brands, or types of fruit are categorical variables. Mathematical operations on these numbers don’t make sense.
Why Phone Numbers, Zip Codes, and Social Security Numbers Are Categorical?
Primary role is identification, not measurement thus categorical data.
What are dummy variables?
Dummy variables allow categorical data to be used in numerical models by converting categories into 0s and 1s (a series of binary values).
What is binning?
Binning is a data preprocessing technique where continuous numerical data is grouped into discrete intervals, or “bins.” This helps to simplify data, reduce noise, and make patterns more visible in analyses, especially when dealing with large ranges of values.
Turns continuos data into categorical data by grouping results like 0-10
What is VLOOKUP (Vertical Lookup)?
VLOOKUP (Vertical Lookup) is a function in Excel that allows you to search for a specific value in the first column of a table and return a corresponding value from a specified column in the same row.
How does VLOOKUP work?
VLOOKUP stands for Vertical Lookup because it searches for values vertically down the first column of the table.
VLOOKUP syntax:
VLOOKUP(lookup_value, table_array, col_index_num, [range_lookup])
2 types of datasets:
What is cross-sectional data?
Cross-sectional data is data collected from multiple subjects (such as people, companies, or countries) at a single point in time or over a very short period. It provides a “snapshot” of the situation at that specific moment.
used to compare and analyze differences between subjects
What is time series data?
Time series data involves tracking one or more variables over a sequence of time periods. It captures how something changes over time, focusing on temporal trends rather than comparing different subjects.
used to study trends, make forecasts and understand how variables evolve
What are the key differences between cross-sectional and time series data?
Data Collection:
* Cross-Sectional: Collected at a single point in time from multiple subjects.
* Time Series: Collected at multiple time points, focusing on the same variable(s) over time.
Focus of analysis:
* Cross-Sectional: Analyzes differences or relationships between subjects at one point in time.
* Time Series: Analyzes patterns, trends, and changes in data over time.
Applications:
* Cross-Sectional: Useful for comparing groups, identifying relationships, or describing a population at a specific moment.
* Time Series: Useful for identifying trends, detecting seasonality, and making predictions about the future.
How can categorical variables be summarized, and why is counting important?
Summarizing categorical variables involves counting the occurrences of each category and presenting these counts as raw numbers or percentages. Since categorical variables represent groups or labels rather than quantities, arithmetic operations are inappropriate, making counting the most straightforward and meaningful way to summarize these variables.
What Are Categorical Variables?
Categorical variables represent data that can be divided into distinct groups or categories. They do not hold numerical values that have a meaningful order or scale but instead describe characteristics or attributes like gender, region, or opinion.
What are the steps for summarizing categorical variables?