final Flashcards

Question

What is is dictionary

Answer 1

pairs of values, where one can be used to look up the other

Answer 2

Clear prose in code—meaning readable, well-structured, and well-documented writing—is important because it makes the code easier to understand, maintain, and debug. Code is often read more frequently than it is written, and clear prose helps other developers (and your future self) grasp the logic, purpose, and constraints of the program quickly.

Answer 3

* Lab book: Notes, bugs, and to-dos during development. * Inline comments: Explain complex regions of code. * README: Includes setup instructions and usage. * Manual: Detailed explanations and references.

Answer 4

Something is wrong with your code, they are unintended behaviors

Answer 5

Generally, a bug that we know about, it is also called an error

Answer 6

1. identify an error a) error message b) unintended output 2. Hypothesize issue and solution 3. Make changes one at a time 4. Observe changes to output or error message 5. Resolve bug and make note in lab book

Answer 7

Pause and inspect code during execution

Answer 8

Computes the probability of a sentence or a sequence of words of length n Formula: P(W) = P(w1, w2, w3, ..., wn) computes the probability of an upcoming word Formula: P(w4|w1, w2, w3) computes a missing word Formula: P(w3|w1, w2, w4)

Answer 9

1. Machine translation 2. Spelling correction 3. speech recognition

Answer 10

A token is an individual instance of a word

Answer 11

tokens of the same form are grouped into types

Answer 12

Unigram: A single word considered on its own. Example: “The cat sat” → unigrams: “The,” “cat,” “sat.” Bigram: A sequence of two consecutive words. Example: “The cat sat” → bigrams: “The cat,” “cat sat.” Trigram: A sequence of three consecutive words. Example: “The cat sat” → trigram: “The cat sat.”

Answer 13

We can estimate these probabilities through maximum likelihood estimation For each possible n-gram in a training corpus, we obtain the count of that n-gram, then normalize that count, such that it lies between 0 and 1

Answer 14

In n-gram models (like unigrams, bigrams, trigrams), if a word sequence never appeared in the training corpus, its probability would be 0, which can break the model. Add-one smoothing fixes this by adding 1 to every count before calculating probabilities, ensuring no sequence has a zero probability.

Answer 15

describes a pattern of acceptable strings of characters Used to filter, search, or replace text in programming, text editing, and data processing.

Answer 16

dog, dig, dug...

Answer 17

Backslash \ Example: /Dr\.Ng/ matches “Dr. Ng”.

Answer 18

zero or more repetitions

Answer 19

one or more repetitions

Answer 20

zero or one repetition

Answer 21

between m and n repetitions Example a{1,3} --> 1, 2, or 3 a's

Answer 22

“a”, “b”, or “c”

Answer 23

matches any character except a, b, or c

Answer 24

any uppercase letter

Answer 25

Matches one of multiple options. Example: /dog|cat|mouse/ → “dog”, “cat”, or “mouse”.

Answer 26

any digit [0-9]

Answer 27

any non-digit

Answer 28

alphanumeric or underscore

Answer 29

any non-alphanumeric

Answer 30

describes a taxonomy of formal languages

Answer 31

Interact with operating system and file paths

Answer 32

Find pathnames using wildcard

Answer 33

Store many kinds of data in tables

Answer 34

Store numbers in matrices, apply formulae efficiently

Answer 35

Filter and modify data using regexes

Answer 36

General utilities for text analysis

Answer 37

Count data

Answer 38

A method that applies to a string object, and takes as its argument a list of strings

Answer 39

A method that applies to a string object and removes characters from the beginning and end

Answer 40

A method that applies to a string object and creates a list of substrings that are divided by some partition

Answer 41

* Find the first occurrence of a pattern anywhere in the string. Returns a match object or None. * two arguments: regex, text to search

Answer 42

Match the pattern only at the beginning of a string. Returns a match object or None. Args: regex, text to search

Answer 43

Returns all non-overlapping matches of the pattern as a list Arguments: regex, data

Answer 44

* Replace all occurrences of the pattern with new text. * 3 arguments: the pattern to match, what to replace it with, and the text to search (first two can be regex) * returns a single string

Answer 45

Split a string by a regex pattern

Answer 46

Precompile a regex that can be saved to a variable enables you to split up long lines = more readability

Answer 47

refers to the process of converting text from a corpus into a more convenient, standardized form.

Answer 48

1.Tokenizing (segmenting) words 2.Normalizing word formats 3.Segmenting sentences 4.Removing undesirable parts of text data

Answer 49

1. Unseen Tokens (Solution Unk) 2. Multiple sentences - How do we know when a sentence ends (Solution: BOS/EOS) 3. Case (Skipped solution) 4. Words aren't great (Lemmatization/BPE) 5. Anonymization 6. Other filtering (Abusive language, hate speech, etc).

Answer 50

We create a OOV token and its called

Answer 51

Replace words in the training data with based on their frequency. * E.g., replace any word that appears fewer than 100 times

Answer 52

Beginning of sentence (BOS) and end of sentence (EOS) tokens are added to the text at sentence boundaries. they allow us to process longer texts than sentences while maintaining the important sentence boundaries

Answer 53

Lemmatization refers to the task of determining the root (i.e., lemma) of some word form. Example: * The word forms sing, sang, sung, and singing all share the lemma sing. Two word forms share a lemma if they have the same stem, belong to broadly the same part of speech

Answer 54

Byte Pair Encoding (BPE) is a subword tokenization method that breaks words into smaller, frequent units by iteratively merging the most common pairs of characters or subwords. This reduces vocabulary size, handles rare or unseen words, and allows language models to represent text more efficiently.

Answer 55

A versioned repository is a central storage location that tracks every change made to project files over time. Allowing developers to see who changed what, when, and why, and easily revert to older states if needed, crucial for collaboration

Answer 56

Problem: Different projects may require conflicting package versions. Solution: Use venv to isolate Python versions and packages per project.

Answer 57

Encapsulate data and behaviors for objects. Benefits: * Standardized handling of similar data types. * Methods allow operations directly on objects.

Answer 58

* Smallest meaningful component of language. * Formed from one or more sounds/segments. Can combine into words such as: – Read – Reading – Rereading – Unrereadable

Answer 59

Select nouns (the, a, an), quantifiers (one, some, many).

Answer 60

Take determiners, modified by adjectives, singular/plural.

Answer 61

Heart of sentence; take auxiliaries (can, will, have), objects; conjugated.

Answer 62

Modify nouns, can take suffixes (-ish, -ier, -iest).

Answer 63

Modify verbs, often formed with -ly.

Answer 64

TF-IDF (Term Frequency–Inverse Document Frequency) is used to measure how important a word is in a document relative to a collection of documents (corpus) Term Frequency – Inverse Document Frequency, or TF-IDF for short, is a common baseline model for embedding words. In TF-IDF, words (a.k.a. terms) are represented by a simple function of the counts of nearby words, given a corpus of documents

Answer 65

The term frequency for word t is the number of times t appears in document d:

Answer 66

The document frequency of a word t is the number of documents in which t occurs.

Answer 67

Rules for arranging words hierarchically

Answer 68

Phrases have heads that select arguments

Answer 69

A CFG is a formal grammar used to describe the syntax of a language. It consists of a set of rules that define how non-terminal symbols an be replaced by terminal symbols

Answer 70

An FCFG extends a CFG by adding features or attributes (like number, gender, tense) to symbols, allowing more detailed and precise grammatical descriptions.

Answer 71

In constituency parsing, the goal is to make explicit constituent structures If a CFG is able to generate a sentence, we say that it accepts the sentence

Answer 72

Start from the root (S) and expand rules toward terminals.

Answer 73

Start from words (terminals) and combine them into higher-level structures.

Answer 74

Explore all nodes level by level before going deeper.

Answer 75

Explore one branch fully before backtracking to others.

Answer 76

Start with what works, and start over if the parse fails later on

Answer 77

work through all potential candidates in parallel

Answer 78

Chomsky Normal Form (CNF) is a way of structuring context-free grammar rules so that each rule either produces two nonterminal symbols or one terminal symbol, enforces binary branching, and contains no empty strings. * Two nonterminal symbols A → B C * One terminal symbol A → a

Answer 79

a bottom-up parsing algorithm for context-free grammars in Chomsky Normal Form that efficiently determines whether a string can be generated by a grammar and constructs its possible parse trees using a dynamic programming table.

Answer 80

at minimum it is at least one computer on a network

Answer 81

If we have multiple computers on the server/cluster, break down computationally intensive tasks into smaller jobs, and split them between the computers Pro: Distributed computing speeds up processing by splitting tasks across multiple machines. Con: Distributed computing can be harder to manage because it depends on complex coordination and stable networks.

Answer 82

Parallel computing splits one machine up by is processors * Benefit: no risk of loss between machines * limitation: you can't make a machine bigger

Answer 83

Machine learning systems learn and improve automatically through experience. * Machine learning algorithms automate the learning process * Machine learning systems (a.k.a. models) are "trained" on examples of data (i.e., experience; a.k.a. observations) to make predictions about data they haven't seen before

Answer 84

The goal of machine learning is to create a system that generalizes to new data.

Answer 85

Contains training examples—the experience the system should learn from

Answer 86

For iterative error analysis For hyperparameter tuning

Answer 87

Held-out data to evaluate the system's performance once it's developed However, it is unethical to look at test data, especially while training and developing the model

Answer 88

K-fold cross-validation is a technique used to evaluate the performance of a machine learning model by splitting the dataset into k equal parts (folds). The model is trained on k–1 folds and tested on the remaining fold; this process is repeated k times, each time using a different fold as the test set, and the results are averaged to give a more reliable estimate of model performanc

Answer 89

1. Supervised * Learning involves labeled training data * It's expensive to procure labeled training data * But supervised systems are easier to design and train well 2. Unsupervised * Learning involves unlabeled training data * It's easier to procure unlabeled training data * But unsupervised systems are harder to design and train well

Answer 90

Overfitting occurs when a model memorizes the details and noise of its training data instead of extrapolating patterns, thereby failing to generalize to new data

Answer 91

Underfitting occurs when a model performs poorly on the training data

Answer 92

Elegant code is important because it is clear, concise, and easy to understand, which makes it easier to maintain, debug, and extend. It reduces the likelihood of errors, facilitates collaboration with other developers, and often performs efficiently, reflecting thoughtful problem-solving and good design principles.

Answer 93

make a new folder

Answer 94

used with mkdir to allow it to make folders in directories that dont exist yet

Answer 95

copies a file from one place to another

Answer 96

changes to working directory

Answer 97

moves a file from one place to another

Answer 98

delete a file

Answer 99

print contents of a file

Answer 100

print first lines of a file

Answer 101

print last lines of a file

Answer 102

find the path current working directory.

Answer 103

apply a same process to each item in a collection

Answer 104

mutual conceptual pacts We select the information to include and exclude and format to present the information in, depending on the identity and needs of our intended audience

Answer 105

try...except block is used to catch and handle runtime errors (exceptions), allowing the program to continue running instead of crashing The code that might cause an error is placed inside the try block, and the code to execute if an exception occurs is placed inside the except block.

Answer 106

it means that the probability of a system moving to a particular next state depends only on its current state, not on the entire sequence of events that led up to it.

Answer 107

invocation at the top of a script tells the command prompt which executable should be used

Answer 108

This approach to modeling language ignores word order.

Answer 109

executable/command, flags, and arguments

Answer 110

The shape of a word is a vector with a 1 at the index of the lexeme in vocabulary, zero elsewhere

Answer 111

Describe directed grammatical relations (i.e., argument structure) between words in a sentence. It focuses not on grammaticality ("What is a possible sentence?") but on the relationships between words, given some string ( grammatical or not)

Answer 112

We prefer constituency grammars (like CFGs) when modeling hierarchical phrase structure and syntactic recursion is important, while dependency grammars are preferred when focusing on direct word-to-word relationships, head–dependent structure, and efficient parsing, especially in free-word-order languages.

Answer 113

Adapting a software or process so that it can be used by an international audience

Answer 114

Tuning a software or process to work well for a specific community

Answer 115

a software-agnostic syntax for specifying syntax

Answer 116

generic typesetting syntax, designed for the typesetting program TeX --> Overlead was an example

Answer 117

Neural network ⨉ language model = neural language model Like n-gram language models, we can train neural networks to predict upcoming words, given prior word context. They can also predict missing words

Answer 118

Python Enhancing Protocols

Answer 119

The program being run When you run python your_script.py, python is the command (the interpreter), and your_script.py is an argument passed to the Python interpreter itself, which in turn becomes the script that runs your program.

Answer 120

Input data for the command In cp source_file.txt destination_folder/, source_file.txt and destination_folder/ are positional arguments. They tell the command what files to operate on

Answer 121

A flag is a specific type of optional argument that is Boolean in nature.

Answer 122

no! go! or lo!

Answer 123

any plain letter (upper or lowercase) or a digit

Answer 124

lean or lead

Answer 125

any whitespace character

Answer 126

word boundary

Answer 127

any non-whitespace character

Answer 128

non-word boundary

Answer 129

can encode many dimensions of meaning but those dimensions may not actually be meaningful

Answer 130

A PCFG is said to be consistent if the probabilities of every sentence in a language L sum to 1

Answer 131

Nonterminals = abstract categories (S, NP, VP, Aux) used during derivation

Answer 132

Terminals = the final words you see in the sentence

Answer 133

The product rule of probability allows us to compute the likelihood of joint events using conditional probabilities. Extending this gives us the chain rule, which decomposes a sentence’s probability into smaller, conditional parts

Answer 134

Annotated corpora (e.g., the Penn Treebank) used to estimate rule probabilities empirically.

Answer 135

what is doing the majority of the work on your computer now

Answer 136

where we store things we want to remember, but not save forever

Answer 137

* Help us see where phrase boundaries are * Help us see what kind of phrase is allowed as a certain argument

Answer 138

where to put things that you want to persist

Answer 139

Replace a candidate phrase with a known phrase of the same type

Answer 140

Can you move around the constituent, e.g., as a topicalization?

Answer 141

Split up the constituent candidates with “It is _ that”

Answer 142

Can my constituent candidate answer a question?

Answer 143

* Can the constituent candidate be coordinated with a phrase of a known type.

Answer 144

Similar to CFGs, it is a formal grammar that uses trees as its basic building block and combines them through substitution and adjunction to generate larger, hierarchically structured sentences

Answer 145

Good typesetting makes projects easier to read and navigate, which reduces intimidation when others interact with unfamiliar codebases and workflows. By improving clarity and accessibility, it encourages collaboration, reuse, and acknowledgment, leading to stronger scientific practice and professional recognition.

Answer 146

instead of ssh-ing onto a specific node and running a script, let the server handle sending the code and data to a specific computer

final Flashcards

(173 cards)