What is Corpora and how does it differ from a data set used in other kinds of research?
A collection of language data that is representative of some aspect of language production and use. And it differs because it is BIG and SHARED.
How does a corpus get made
Method 1: Curation
Method 2: Data in the wild
Method 3: Scraping
What is the impact of the distribution of available
corpora on the research being done or the language technology being
created?
The distribution of available corpora strongly shapes both research and the development of language technologies, as models trained on skewed or high-resource datasets often perform poorly on underrepresented languages, dialects, or domains.
This imbalance can reinforce social and cultural biases, limit technological access for certain communities, and influence which research questions and languages receive attention.
Consequently, narrow or uneven corpora constrain theoretical insights, reduce model generalizability, and perpetuate cycles of neglect in language technology development.
What is web scraping?
Web scraping is the automated process of extracting data from websites for research or
analysis.
How do you scrap responsibly
What do you think is the ”right” way of collecting data, from keeping in mind both legal and ethical
concerns.
The “right” way to collect data involves balancing legal compliance with ethical responsibility. Legally, data collection should adhere to privacy laws and regulations, such as obtaining informed consent, protecting personally identifiable information, and following rules on data storage and sharing.
Ethically, researchers should ensure transparency about how data will be used, avoid exploiting vulnerable populations, minimize harm, respect cultural and linguistic contexts, and strive for fairness and inclusivity, especially when creating corpora that will shape language technologies.
What do data protection laws do and what else must they consider
These laws determine how Personally Identifiable Information (PII) – such as names,
addresses, or metadata – must be handled and anonymized. In addition to privacy,
researchers must consider:
What is Beautiful Soup
BeautifulSoup is a Python library for parsing and cleaning HTML files. It helps turn raw web pages into usable text data for analysis.
What is the format command lines generally take (Code)
command_name –flags (flag_value) argument
What command is used to change directory
cd
What is the shortcut for home directory
~
What symbol indicates your current directory
A single period
What symbol indicates your parent directory
Double period
What symbol is for a parent’s parent directory
Double periods can be stacked
../ ../
What is a function in Python and what are they useful for
A function reusable block of code that performs a specific task.
Functions help organize code, making it easier to read and debug. Functions can take inputs (arguments) and return outputs.
What is an If statement
An if statement is used to execute a block of code if a specified
condition is true. It is the basic way to make decisions in your program
What is an If-Else Statement
You can use else to specify a block of code that will execute if the
condition is false.
What is Elif
allows you to check multiple conditions in a single if statement
What is a String (str)
Used for the literal orthography written
e.g., “Hello” “I am a goose” “3”
They are defined with single or double quotes
What is an Integer (int)
used for numbers that do not have decimals
Example: 0, -5, 15
What is a float
used for numbers with decimals
What is a boolean (bool)
takes either the value True or False
What is a list
a collection with an order
They can be indexed, meaning you can reference some data according to its position in the list
Example:
list1 = [ “This”, “is”, “a”, “sentence”]
print(list1[3])
What is a tuple
It is very similar to a list but it cannot be edited (this makes them more memory efficient)