Language implementation systems must always…
analyze source code.
What are most syntax analysis based on?
A formal description of the syntax of the source language (BNF).
What are the two parts of the syntax analysis portion of a language processor?
A low-level part (lexical analyzer, a finite automaton based on a regular grammar) and a high-level part (syntax analyzer, a push-down automaton based on a context-free grammar or BNF).
Lexical Analyzer
A pattern matcher for character strings, acting as a “front-end” for the parser.
What are the three approaches to building a lexical analyzer?
Finite State Automata (FSA)
A mathematical model for design descriptions of processes such as lexical analysis. They can model the design description of recognizing string patterns based on regular expressions. There are a finite number of states and transitions between states.
Accepting
A subset of states is called accepting. Accepting state corresponds to
a single token type.
Describe the formal mechanism for a FSA.
Begin at the start state, then process the input by character, each character triggering a new transition. If the FSA reaches an accepting state, then the token type has been determined.
Deterministic FSA (DFA)
Contains a set of states with an input alphabet, unique start state, unique end symbol ($), and a final or accepting state. Can only be deterministic if for each state there is only on outgoing arc from that state for each input symbol.
What are examples of DFA?
Catenation, repetition, and alternation.
What is a configuration on a DSA?
Consists of a state and the remaining input.
What is a “move?”
A move consists of traversing the arc exiting the state that
corresponds to the leftmost input symbol, thereby consuming it. If no arc exists, either the state is final or there is an error.
When is an input accepted?
It is accepted when, starting with the start state, the automaton
consumes all the input and halts in a final state.
How should a state diagram be designed?
Do not include a transition from every state on every character (too large!) Combine transitions to simplify the diagram. For example, when recognizing an identifier, let all uppercase and lowercase letters be equivalent, then include a character class.
How should a state diagram handle reserved words?
Reserved words can be combined with identifiers. A table look-up can be used later on to determine if an identifier is actually a reserved word.
Lexers
Machines recognizing multiple patterns at once. Instead of “end of input,” “not part of this pattern” is considered the end of the pattern. The pattern is then accepted, then return to the start state. The character is still pushed back or saved in the case that it is the beginning of another pattern.
What are two implementation problems with lexers?
Sometimes it cannot be determined if a token has ended without a look-ahead (nowadays most languages can be tokenized with a one-character look-ahead). Additionally, some tokens may be a proper substring of another token. This conflict is usually resolved by looking for the longest match.
How can a look-ahead be performed during lexing?
Look-aheads typically look forward one character. In C++ streams, peek() can be used to access the next character and read() reads it. If it belongs to the next token, putback() can be used to put the character back.
How are lexers used by a parser?
Parsing is done by reading a token at a time and matching as they are read. The lexer is usually a function that is called by the parser to obtain the tokens (getToken(), or yylex() if using an automatic lexer generator).
What is the input/output of a lexer? A parser?
A lexer takes a sequence of characters and returns tokens. A parser takes tokens and returns a parse tree.
What is the regular expression for an identifier?
IDENT ::= Letter ( Letter |Digit | _ | $)*
Letter ::= [a-z A-Z]
Digit ::= [0-9]
What is the regular expression for an integer literal?
ICONST ::= [0-9]+
What is the regular expression for a Boolean constant?