final Flashcards

(173 cards)

1
Q

What is Corpora and how does it differ from a data set used in other kinds of research?

A

A collection of language data that is representative of some aspect of language production and use. And it differs because it is BIG and SHARED.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

How does a corpus get made

A

Method 1: Curation

  • Research collects specific types of samples via lab elicitation, etc.

Method 2: Data in the wild

  • A collection of data exists for some non-NLP application

Method 3: Scraping

  • Language exists in a non-curated form, and you organize it
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is the impact of the distribution of available
corpora on the research being done or the language technology being
created?

A

The distribution of available corpora strongly shapes both research and the development of language technologies, as models trained on skewed or high-resource datasets often perform poorly on underrepresented languages, dialects, or domains.

This imbalance can reinforce social and cultural biases, limit technological access for certain communities, and influence which research questions and languages receive attention.

Consequently, narrow or uneven corpora constrain theoretical insights, reduce model generalizability, and perpetuate cycles of neglect in language technology development.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is web scraping?

A

Web scraping is the automated process of extracting data from websites for research or
analysis.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

How do you scrap responsibly

A
  1. Check whether the site expects scraping (many offer APIs)
  2. Review the site’s terms of service
  3. When possible, contact the webmaster for permission – especially for smaller or
    community-based sites
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What do you think is the ”right” way of collecting data, from keeping in mind both legal and ethical
concerns.

A

The “right” way to collect data involves balancing legal compliance with ethical responsibility. Legally, data collection should adhere to privacy laws and regulations, such as obtaining informed consent, protecting personally identifiable information, and following rules on data storage and sharing.

Ethically, researchers should ensure transparency about how data will be used, avoid exploiting vulnerable populations, minimize harm, respect cultural and linguistic contexts, and strive for fairness and inclusivity, especially when creating corpora that will shape language technologies.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What do data protection laws do and what else must they consider

A

These laws determine how Personally Identifiable Information (PII) – such as names,
addresses, or metadata – must be handled and anonymized. In addition to privacy,
researchers must consider:

  • Copyright: Whether data can legally be reused
  • Patents and trademarks: Which may restrict data use
  • Data sovereignty: Who controls how information about individuals is shared across jurisdictions
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is Beautiful Soup

A

BeautifulSoup is a Python library for parsing and cleaning HTML files. It helps turn raw web pages into usable text data for analysis.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is the format command lines generally take (Code)

A

command_name –flags (flag_value) argument

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What command is used to change directory

A

cd

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is the shortcut for home directory

A

~

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What symbol indicates your current directory

A

A single period

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What symbol indicates your parent directory

A

Double period

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What symbol is for a parent’s parent directory

A

Double periods can be stacked

../ ../

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is a function in Python and what are they useful for

A

A function reusable block of code that performs a specific task.

Functions help organize code, making it easier to read and debug. Functions can take inputs (arguments) and return outputs.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is an If statement

A

An if statement is used to execute a block of code if a specified
condition is true. It is the basic way to make decisions in your program

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What is an If-Else Statement

A

You can use else to specify a block of code that will execute if the
condition is false.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What is Elif

A

allows you to check multiple conditions in a single if statement

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What is a String (str)

A

Used for the literal orthography written

e.g., “Hello” “I am a goose” “3”

They are defined with single or double quotes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

What is an Integer (int)

A

used for numbers that do not have decimals

Example: 0, -5, 15

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

What is a float

A

used for numbers with decimals

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

What is a boolean (bool)

A

takes either the value True or False

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

What is a list

A

a collection with an order

They can be indexed, meaning you can reference some data according to its position in the list

Example:

list1 = [ “This”, “is”, “a”, “sentence”]

print(list1[3])

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

What is a tuple

A

It is very similar to a list but it cannot be edited (this makes them more memory efficient)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
What is is dictionary
pairs of values, where one can be used to look up the other
26
Why is clear prose important in code
Clear prose in code—meaning readable, well-structured, and well-documented writing—is important because it makes the code easier to understand, maintain, and debug. Code is often read more frequently than it is written, and clear prose helps other developers (and your future self) grasp the logic, purpose, and constraints of the program quickly.
27
Documentation types
* Lab book: Notes, bugs, and to-dos during development. * Inline comments: Explain complex regions of code. * README: Includes setup instructions and usage. * Manual: Detailed explanations and references.
28
What is a bug
Something is wrong with your code, they are unintended behaviors
29
What is an exception
Generally, a bug that we know about, it is also called an error
30
5 Steps of Debugging
1. identify an error a) error message b) unintended output 2. Hypothesize issue and solution 3. Make changes one at a time 4. Observe changes to output or error message 5. Resolve bug and make note in lab book
31
What is Interactive Debugging (pdb)
Pause and inspect code during execution
32
What is a language model (3 things)
Computes the probability of a sentence or a sequence of words of length n Formula: P(W) = P(w1, w2, w3, ..., wn) computes the probability of an upcoming word Formula: P(w4|w1, w2, w3) computes a missing word Formula: P(w3|w1, w2, w4)
33
Why assign probabilities to sentences (3 things)
1. Machine translation 2. Spelling correction 3. speech recognition
34
Whats a token
A token is an individual instance of a word
35
Whats a type
tokens of the same form are grouped into types
36
What's a unigram, bigram, and trigram
Unigram: A single word considered on its own. Example: “The cat sat” → unigrams: “The,” “cat,” “sat.” Bigram: A sequence of two consecutive words. Example: “The cat sat” → bigrams: “The cat,” “cat sat.” Trigram: A sequence of three consecutive words. Example: “The cat sat” → trigram: “The cat sat.”
37
How can we estimate n gram probabilities
We can estimate these probabilities through maximum likelihood estimation For each possible n-gram in a training corpus, we obtain the count of that n-gram, then normalize that count, such that it lies between 0 and 1
38
What is add-one estimation (Laplace smoothing same thing)
In n-gram models (like unigrams, bigrams, trigrams), if a word sequence never appeared in the training corpus, its probability would be 0, which can break the model. Add-one smoothing fixes this by adding 1 to every count before calculating probabilities, ensuring no sequence has a zero probability.
39
What are regular expressions (regexes) and what are they used for
describes a pattern of acceptable strings of characters Used to filter, search, or replace text in programming, text editing, and data processing.
40
Whats a single character
/./
41
any two characters
/ ../
42
What could /d.g/ be
dog, dig, dug...
43
What symbol is used to escape special characters
Backslash \ Example: /Dr\.Ng/ matches “Dr. Ng”.
44
What does * mean
zero or more repetitions
45
What does + mean
one or more repetitions
46
What does ? mean
zero or one repetition
47
What would {m,n} mean
between m and n repetitions Example a{1,3} --> 1, 2, or 3 a's
48
What does [abc] match
“a”, “b”, or “c”
49
What does [^abc] match
matches any character except a, b, or c
50
what does [A-Z] match
any uppercase letter
51
what does [0-9] match
any digit
52
What does the pipe do
Matches one of multiple options. Example: /dog|cat|mouse/ → “dog”, “cat”, or “mouse”.
53
What does \d match
any digit [0-9]
54
What does \D match
any non-digit
55
What does \w match
alphanumeric or underscore
56
What does \W match
any non-alphanumeric
57
What is the Chompsky hierarchy
describes a taxonomy of formal languages
58
What is library os used for
Interact with operating system and file paths
59
What is glob used for
Find pathnames using wildcard
60
what is pandas used for
Store many kinds of data in tables
61
What is numpy used for
Store numbers in matrices, apply formulae efficiently
62
what is re used for
Filter and modify data using regexes
63
what is nltk used for
General utilities for text analysis
64
What is collections used for
Count data
65
What does join() do
A method that applies to a string object, and takes as its argument a list of strings
66
What does strip() do
A method that applies to a string object and removes characters from the beginning and end
67
What does split() do
A method that applies to a string object and creates a list of substrings that are divided by some partition
68
What does re.search() do
* Find the first occurrence of a pattern anywhere in the string. Returns a match object or None. * two arguments: regex, text to search
69
What does re.match() do
Match the pattern only at the beginning of a string. Returns a match object or None. Args: regex, text to search
70
What does re.findall() do
Returns all non-overlapping matches of the pattern as a list Arguments: regex, data
71
What does re.sub() do
* Replace all occurrences of the pattern with new text. * 3 arguments: the pattern to match, what to replace it with, and the text to search (first two can be regex) * returns a single string
72
What does re.split() do
Split a string by a regex pattern
73
What does re.compile() do
Precompile a regex that can be saved to a variable enables you to split up long lines = more readability
74
What is text normalization
refers to the process of converting text from a corpus into a more convenient, standardized form.
75
what are the 4 subtypes of text normalization
1.Tokenizing (segmenting) words 2.Normalizing word formats 3.Segmenting sentences 4.Removing undesirable parts of text data
76
6 problems with tokenization
1. Unseen Tokens (Solution Unk) 2. Multiple sentences - How do we know when a sentence ends (Solution: BOS/EOS) 3. Case (Skipped solution) 4. Words aren't great (Lemmatization/BPE) 5. Anonymization 6. Other filtering (Abusive language, hate speech, etc).
77
What is used to handle unknown words
We create a OOV token and its called
78
1 strategy for UNK
Replace words in the training data with based on their frequency. * E.g., replace any word that appears fewer than 100 times
79
What are BOS and EOS tokens
Beginning of sentence (BOS) and end of sentence (EOS) tokens are added to the text at sentence boundaries. they allow us to process longer texts than sentences while maintaining the important sentence boundaries
80
what is lemmatization
Lemmatization refers to the task of determining the root (i.e., lemma) of some word form. Example: * The word forms sing, sang, sung, and singing all share the lemma sing. Two word forms share a lemma if they have the same stem, belong to broadly the same part of speech
81
What is Byte Pair Encoding
Byte Pair Encoding (BPE) is a subword tokenization method that breaks words into smaller, frequent units by iteratively merging the most common pairs of characters or subwords. This reduces vocabulary size, handles rare or unseen words, and allows language models to represent text more efficiently.
82
What is a versioned repository (work on code with others)
A versioned repository is a central storage location that tracks every change made to project files over time. Allowing developers to see who changed what, when, and why, and easily revert to older states if needed, crucial for collaboration
83
What can virtual environments do and what is the initial problem
Problem: Different projects may require conflicting package versions. Solution: Use venv to isolate Python versions and packages per project.
84
What is the purpose of Python classes and what are the 2 benefits
Encapsulate data and behaviors for objects. Benefits: * Standardized handling of similar data types. * Methods allow operations directly on objects.
85
What is a morpheme
* Smallest meaningful component of language. * Formed from one or more sounds/segments. Can combine into words such as: – Read – Reading – Rereading – Unrereadable
86
Determiner
Select nouns (the, a, an), quantifiers (one, some, many).
87
Noun
Take determiners, modified by adjectives, singular/plural.
88
Verbs
Heart of sentence; take auxiliaries (can, will, have), objects; conjugated.
89
Adjectives
Modify nouns, can take suffixes (-ish, -ier, -iest).
90
Adverbs
Modify verbs, often formed with -ly.
91
What is TF IDF and how does it work
TF-IDF (Term Frequency–Inverse Document Frequency) is used to measure how important a word is in a document relative to a collection of documents (corpus) Term Frequency – Inverse Document Frequency, or TF-IDF for short, is a common baseline model for embedding words. In TF-IDF, words (a.k.a. terms) are represented by a simple function of the counts of nearby words, given a corpus of documents
92
So to break it down what is term frequency
The term frequency for word t is the number of times t appears in document d:
93
What then is document frequency
The document frequency of a word t is the number of documents in which t occurs.
94
What is Syntax
Rules for arranging words hierarchically
95
What is headedness
Phrases have heads that select arguments
96
what is a cfg
A CFG is a formal grammar used to describe the syntax of a language. It consists of a set of rules that define how non-terminal symbols an be replaced by terminal symbols
97
What is a fcfg
An FCFG extends a CFG by adding features or attributes (like number, gender, tense) to symbols, allowing more detailed and precise grammatical descriptions.
98
What is constituency parsing
In constituency parsing, the goal is to make explicit constituent structures If a CFG is able to generate a sentence, we say that it accepts the sentence
99
top down parsing
Start from the root (S) and expand rules toward terminals.
100
Bottom up parsing
Start from words (terminals) and combine them into higher-level structures.
101
breadth first
Explore all nodes level by level before going deeper.
102
depth first
Explore one branch fully before backtracking to others.
103
best first
Start with what works, and start over if the parse fails later on
104
exhaustive
work through all potential candidates in parallel
105
what is chompsky normal form
Chomsky Normal Form (CNF) is a way of structuring context-free grammar rules so that each rule either produces two nonterminal symbols or one terminal symbol, enforces binary branching, and contains no empty strings. * Two nonterminal symbols A → B C * One terminal symbol A → a
106
what is cky parsing
a bottom-up parsing algorithm for context-free grammars in Chomsky Normal Form that efficiently determines whether a string can be generated by a grammar and constructs its possible parse trees using a dynamic programming table.
107
What is a server
at minimum it is at least one computer on a network
108
What is distributed computing
If we have multiple computers on the server/cluster, break down computationally intensive tasks into smaller jobs, and split them between the computers Pro: Distributed computing speeds up processing by splitting tasks across multiple machines. Con: Distributed computing can be harder to manage because it depends on complex coordination and stable networks.
109
what is parallel computing
Parallel computing splits one machine up by is processors * Benefit: no risk of loss between machines * limitation: you can't make a machine bigger
110
What is machine learning
Machine learning systems learn and improve automatically through experience. * Machine learning algorithms automate the learning process * Machine learning systems (a.k.a. models) are "trained" on examples of data (i.e., experience; a.k.a. observations) to make predictions about data they haven't seen before
111
what is the goal of machine learning
The goal of machine learning is to create a system that generalizes to new data.
112
What is training data
Contains training examples—the experience the system should learn from
113
What is development data
For iterative error analysis For hyperparameter tuning
114
What is test data
Held-out data to evaluate the system's performance once it's developed However, it is unethical to look at test data, especially while training and developing the model
115
K fold cross validation
K-fold cross-validation is a technique used to evaluate the performance of a machine learning model by splitting the dataset into k equal parts (folds). The model is trained on k–1 folds and tested on the remaining fold; this process is repeated k times, each time using a different fold as the test set, and the results are averaged to give a more reliable estimate of model performanc
116
what 2 approaches do machine learning fall into
1. Supervised * Learning involves labeled training data * It's expensive to procure labeled training data * But supervised systems are easier to design and train well 2. Unsupervised * Learning involves unlabeled training data * It's easier to procure unlabeled training data * But unsupervised systems are harder to design and train well
117
what is overfitting
Overfitting occurs when a model memorizes the details and noise of its training data instead of extrapolating patterns, thereby failing to generalize to new data
118
underfitting
Underfitting occurs when a model performs poorly on the training data
119
Why elegant code
Elegant code is important because it is clear, concise, and easy to understand, which makes it easier to maintain, debug, and extend. It reduces the likelihood of errors, facilitates collaboration with other developers, and often performs efficiently, reflecting thoughtful problem-solving and good design principles.
120
mkdir
make a new folder
121
-p
used with mkdir to allow it to make folders in directories that dont exist yet
122
cp
copies a file from one place to another
123
cd
changes to working directory
124
mv
moves a file from one place to another
125
rm
delete a file
126
cat
print contents of a file
127
head
print first lines of a file
128
tail
print last lines of a file
129
pwd
find the path current working directory.
130
Iteration
apply a same process to each item in a collection
131
What is the goal of writing
mutual conceptual pacts We select the information to include and exclude and format to present the information in, depending on the identity and needs of our intended audience
132
what is try/except
try...except block is used to catch and handle runtime errors (exceptions), allowing the program to continue running instead of crashing The code that might cause an error is placed inside the try block, and the code to execute if an exception occurs is placed inside the except block.
133
markov assumption
it means that the probability of a system moving to a particular next state depends only on its current state, not on the entire sequence of events that led up to it.
134
Shebang
invocation at the top of a script tells the command prompt which executable should be used
135
bag of words
This approach to modeling language ignores word order.
136
All of the possible elements of a command in the command line.
executable/command, flags, and arguments
137
one hot encoding
The shape of a word is a vector with a 1 at the index of the lexeme in vocabulary, zero elsewhere
138
What are dependency grammars
Describe directed grammatical relations (i.e., argument structure) between words in a sentence. It focuses not on grammaticality ("What is a possible sentence?") but on the relationships between words, given some string ( grammatical or not)
139
When might we prefer to work with constituency grammars, such as CFGs? When might we prefer to work with dependency grammars?
We prefer constituency grammars (like CFGs) when modeling hierarchical phrase structure and syntactic recursion is important, while dependency grammars are preferred when focusing on direct word-to-word relationships, head–dependent structure, and efficient parsing, especially in free-word-order languages.
140
Internationalization
Adapting a software or process so that it can be used by an international audience
141
Localization
Tuning a software or process to work well for a specific community
142
Markdown
a software-agnostic syntax for specifying syntax
143
LaTex
generic typesetting syntax, designed for the typesetting program TeX --> Overlead was an example
144
Neural language model
Neural network ⨉ language model = neural language model Like n-gram language models, we can train neural networks to predict upcoming words, given prior word context. They can also predict missing words
145
PEP8
Python Enhancing Protocols
146
command
The program being run When you run python your_script.py, python is the command (the interpreter), and your_script.py is an argument passed to the Python interpreter itself, which in turn becomes the script that runs your program.
147
Argument
Input data for the command In cp source_file.txt destination_folder/, source_file.txt and destination_folder/ are positional arguments. They tell the command what files to operate on
148
flag
A flag is a specific type of optional argument that is Boolean in nature.
149
[ngl]o!
no! go! or lo!
150
[A-z0-9]
any plain letter (upper or lowercase) or a digit
151
lea(n|d)
lean or lead
152
\s
any whitespace character
152
\b
word boundary
152
\S
any non-whitespace character
152
\B
non-word boundary
153
word embeddings
can encode many dimensions of meaning but those dimensions may not actually be meaningful
154
Probabilistic Context free grammar (PCFG)
A PCFG is said to be consistent if the probabilities of every sentence in a language L sum to 1
155
non terminal
Nonterminals = abstract categories (S, NP, VP, Aux) used during derivation
156
terminal
Terminals = the final words you see in the sentence
157
Chain Rule
The product rule of probability allows us to compute the likelihood of joint events using conditional probabilities. Extending this gives us the chain rule, which decomposes a sentence’s probability into smaller, conditional parts
158
treebanks
Annotated corpora (e.g., the Penn Treebank) used to estimate rule probabilities empirically.
159
CPU
what is doing the majority of the work on your computer now
160
RAM
where we store things we want to remember, but not save forever
161
Constituency Tests
* Help us see where phrase boundaries are * Help us see what kind of phrase is allowed as a certain argument
162
Hard Drive
where to put things that you want to persist
163
Replacement Tests
Replace a candidate phrase with a known phrase of the same type
164
Movement tests
Can you move around the constituent, e.g., as a topicalization?
165
Cleft tests
Split up the constituent candidates with “It is _ that”
166
Q/A
Can my constituent candidate answer a question?
167
coordination
* Can the constituent candidate be coordinated with a phrase of a known type.
168
Tree Adjoining Grammar (TAG)
Similar to CFGs, it is a formal grammar that uses trees as its basic building block and combines them through substitution and adjunction to generate larger, hierarchically structured sentences
169
Why typesetting matters
Good typesetting makes projects easier to read and navigate, which reduces intimidation when others interact with unfamiliar codebases and workflows. By improving clarity and accessibility, it encourages collaboration, reuse, and acknowledgment, leading to stronger scientific practice and professional recognition.
170
Job scheduler
instead of ssh-ing onto a specific node and running a script, let the server handle sending the code and data to a specific computer