Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
어휘분석(Lexical Analysis) Overview • Main task: to read input characters and group them into “tokens.” • Secondary tasks: – Skip comments and whitespace; – Correlate error messages with source program (e.g., line number of error). Overview (cont’d) Lexical Analysis: Terminology • token: a name for a set of input strings with related structure. Example: “identifier,” “integer constant” • pattern: a rule describing the set of strings associated with a token. Example: “a letter followed by zero or more letters, digits, or underscores.” • lexeme: the actual input string that matches a pattern. Example: count Examples Input: count = 123 Tokens: identifier : Rule: “letter followed by …” Lexeme: count assg_op : Rule: = Lexeme: = integer_const : Rule: “digit followed by …” Lexeme: 123 Why token? • Language syntax is specified with token types, not words (lexeme) 1. goal expr 2. expr expr op term 3. | term 4. term number 5. | id 6. op 7. + | – S = goal T = { number, id, +, - } N = { goal, expr, term, op } P = { 1, 2, 3, 4, 5, 6, 7} Specifying Tokens: regular expressions • Terminology: – alphabet : a finite set of symbols – string : a finite sequence of alphabet symbols – language : a (finite or infinite) set of strings. • Regular Operations on languages: Union: R S = { x | x R or x S} Concatenation: RS = { xy | x R and y S} Kleene closure: R* = R concatenated with itself 0 or more times = {} R RR RRR = strings obtained by concatenating a finite number of strings from the set R. Regular Expressions A pattern notation for describing certain kinds of sets over strings: Given an alphabet : – is a regular exp. (denotes the language {}) – for each a , a is a regular exp. (denotes the language {a}) – if r and s are regular exps. denoting L(r) and L(s) respectively, then so are: • (r) | (s) ( denotes the language L(r) L(s) ) • (r)(s) ( denotes the language L(r)L(s) ) • (r)* ( denotes the language L(r)* ) Common Extensions to r.e. Notation • • • • • One or more repetitions of r : r+ A range of characters : [a-zA-Z], [0-9] An optional expression: r? Any single character: . Giving names to regular expressions, e.g.: – – – – letter = [a-zA-Z_] digit = 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 ident = letter ( letter | digit )* Integer_const = digit+ Examples of Regular Expressions Identifiers: Letter (a|b|c| … |z|A|B|C| … |Z) Digit (0|1|2| … |9) Identifier Letter ( Letter | Digit )* Numbers: [0-9] [0-9]+ [0-9]* [1-9][0-9]* ([1-9][0-9]*)|0 -?[0-9]+ [0-9]*\.[0-9]+ ([0-9]+)|([0-9]*\.[0-9]+) [eE][-+]?[0-9]+ ([0-9]*\.[0-9]+)([eE][-+]?[0-9]+)? -?( ([0-9]+) | ([0-9]*\.[0-9]+)([eE][-+]?[0-9]+)? ) Examples of Regular Expressions Numbers: Integer (+|-|) (0| (1|2|3| … |9)(Digit *) ) Decimal Integer . Digit * Real ( Integer | Decimal ) E (+|-|) Digit * Complex ( Real , Real ) Exercise of Regular Expressions • 문자열 [a-z] [a-zA-Z] [a-zA-Z][a-zA-Z0-9]* [a-zA-Z0-9] • 스트링 – "this is a string" \".*\" <- wrong!!! why? \"[^"]*\" • 몇가지 연습 0과 1로 이루어진 문자열 중에서... – 0으로 시작하는 문자열 – 0으로 시작해서 0으로 끝나는 문자열 – 0과 1이 번갈아 나오는 문자열 – 0이 두번 계속 나오지 않는 문자열 0[01]* 0[01]*0 ______ ______ Recognizing Tokens: Finite Automata A finite automaton is a 5-tuple (Q, , T, q0, F), where: – is a finite alphabet; – Q is a finite set of states; – T: Q Q is the transition function; – q0 Q is the initial state; and – F Q is a set of final states. Finite Automata: An Example A (deterministic) finite automaton (DFA) to match C-style comments: Example 2 Consider the problem of recognizing register names Register r (0|1|2| … | 9) (0|1|2| … | 9)* • Allows registers of arbitrary number • Requires at least one digit RE corresponds to a recognizer (or DFA) (0|1|2| … 9) (0|1|2| … 9) r S0 S1 S2 accepting state Recognizer for Register Transitions on other inputs go to an error state, se Example 2 (continued) DFA operation • • Start in state S0 & take transitions on each input character DFA accepts a word x iff x leaves it in a final state (S2 ) (0|1|2| … 9) r S0 (0|1|2| … 9) S1 S2 accepting state Recognizer for Register So, • r17 takes it through s0, s1, s2 and accepts • r takes it through s0, s1 and fails • a takes it straight to se Example 2 (continued) To be useful, recognizer must turn into code Char next character State s0 while (Char EOF) State (State,Char) Char next character if (State is a final state ) then report success else report failure Skeleton recognizer r 0,1,2,3,4,5, 6,7,8,9 s0 s1 se se s1 se s2 se s2 se s2 se se se se se Table encoding RE All others What if we need a tighter specification? r Digit Digit* allows arbitrary numbers • Accepts r00000 • Accepts r99999 • What if we want to limit it to r0 through r31 ? Write a tighter regular expression – Register r ( (0|1|2) (Digit | ) | (4|5|6|7|8|9) | (3|30|31) ) – Register r0|r1|r2| … |r31|r00|r01|r02| … |r09 Produces a more complex DFA • Has more states • Same cost per transition • Same basic implementation Tighter register specification (continued) The DFA for Register r ( (0|1|2) (Digit | ) | (4|5|6|7|8|9) | (3|30|31) ) (0|1|2| … 9) S2 S3 0,1,2 S0 r S1 3 S5 0,1 4,5,6,7,8,9 S4 • Accepts a more constrained set of registers • Same set of actions, more states S6 Tighter register specification (continued) r 0,1 2 3 4-9 All others s0 S1 se se se se se s1 se s2 s2 s5 s4 se s2 se s3 s3 s3 s3 se s3 se se se se se se s4 se se se se se se s5 se s6 se se se se s6 se se se se se se se se se se se se se Table encoding RE for the tighter register specification Automating Scanner Construction • RE→ NFA (Thompson’s construction) – Build an NFA for each term – Combine them with ε-moves • NFA → DFA (subset construction) – Build the simulation • DFA → Minimal DFA – Hopcroft’s algorithm • DFA →RE (Not part of the scanner construction) – All pairs, all paths problem – Take the union of all paths from s0 to an accepting state