Download LanguageProcessor

System Programming Chapter 1: Language Processor • • • • • Introduction Language Processing Activities Fundamentals of Language Processing Fundamentals of Language Specification Language Processing Development Tools Introduction • Why Language Processor? – Difference between the manner in which software designer describes the ideas (How the s/w should be) and the manner in which these ideas are implemented(CPU). Semantic Gap Application Domain Software Designer Execution Domain CPU • Consequences of Semantic Gap 1. Large development times 2. Large development efforts 3. Poor quality of software • Issues are tackled by Software Engineering through use of Programming Language. Specification Gap Application Domain Execution Gap PL Domain Execution Domain Language Processors • A Language Processor is a software which bridges a specification or execution gap. • Program to input to a LP is referred as a Source Program and output as Target Program. • Languages in which they are written are called as source language and target languages respectively. A Spectrum of Language Processor • A language translator bridges an execution gap to the machine language of a computer system. • A detranslator bridges the same execution gap as the language translator but in reverse direction. • A Preprocessor is a language processor which bridges an execution gap but is not a language translator. • A language migrator bridges the specification gap between two PL’s. Errors C++ program C++ preprocessor C program Errors C++ program C++ translator Machine language program Interpreters • An interpreter is a language processor which bridges an execution gap without generating a machine language program. • Here execution gap vanishes totally. Interpreter Domain Application Domain PL Domain Execution Domain Problem Oriented Language Specification Gap Application Domain PL Domain Execution Gap Execution Domain Procedure Oriented Language Specification Gap Application Domain Execution Gap PL Domain Execution Domain Language Processing Activities • Divided into those that bridge the specification gap and those that bridge the execution gap. 1. Program generation activities. 2. Program execution activities. Program Generation Errors Program specification Program generator Program in target PL • It is a software which accepts the specification of a program to be generated and generates a program in the target PL. Specification Gap Application Domain Program generator domain PL Domain Execution Domain Example • A screen handling Program • Specification is given as below Employee name : char : start(line=2,position=25) end(line=2,position=80) Married : char : start(line =10, position=25) end(line=10,position=27) default(‘Yes’) Employee Name Address Married Age Yes Gender Figure: Screen displayed by a screen handling program Program Execution • Two models of Program Execution – Program translation – Program interpretation Program Translation • Program translation model bridges the execution gap by translating a program written in PL i.e Source Program into machine language i.e Target Program. Errors Source Program Translator Figure: Program translation model Data m/c language program Target Program Program Interpretation • Interpreter reads the source program and stores it in its memory. • During interpretation it determines the meaning of the statement. • Instruction Execution Cycle 1. Fetch the instruction 2. Decode the instruction and determine the operation to be performed. 3. Execute the instruction. • Interpretation Cycle consists of— 1. Fetch the statement. 2. Analyze the statement and determine its meaning. 3. Execute the meaning of the statement. Interpreter Memory PC Errors CPU PC Source program + Data Figure (a) : Interpretation Memory Machine language program + Data Figure (b) : Program Execution • Characteristics of interpretation 1. Source program is retained in the source form itself i.e no target program. 2. A statement is analyzed during its interpretation. Fundamentals of Language Processing • Lang Processing = Analysis of SP+ Synthesis of TP. • Analysis consists of three steps 1. Lexical rule identifies the valid lexical units 2. Syntax rules identifies the valid statements 3. Semantic rules associate meaning with valid statement. Example:percent_profit := (profit * 100) / cost_price; Lexical analysis identifies------:=, * and / as operators 100 as constant Remaining strings as identifiers. Syntax analysis identifies the statement as the assignment statement. Semantic analysis determines the meaning of the statement as profit x 100 cost_price to percent_profit • Synthesis Phase 1. Creation of DS 2. Generation of target code Refered as memory allocation and code generation, respectively. MOVER MULT DIV MOVEM ------------------PERCENT_PROFIT PROFIT COST_PRICE AREG, PROFIT AREG, 100 AREG, COST_PRICE AREG, PERCENT_PROFIT DW DW DW 1 1 1 Analysis Phase and Synthesis phase is not feasible due to 1. Forward References 2. Memory Requirement Forward Reference • It is a reference to the entity which precedes its definition in the program. percent_profit := (profit * 100) / cost_price; …….. …….. long profit; Language Processor Pass • Pass I : performs analysis of SP and notes relevant information. • Pass II: performs synthesis of target program. Pass I analyses SP and generates IR which is given as input to Pass II to generate target code. Intermediate Representation (IR) • An IR reflects the effect of some but not all, analysis and synthesis tasks performed during language processing. Source Program Front End Back End Intermediate Representation (IR) Target Program • Properties 1. Ease of use. 2. Processing efficiency. 3. Memory efficiency. Toy Compiler • Front End Performs lexical, syntax and semantic analysis of SP. Analysis involves 1. Determine validity of source stmt. 2. Determine the content of source stmt. 3. Construct a suitable representation. Output of Front End 1. Tables of information 2. Intermediate code (IC) IC is a sequence of IC units, represents meaning of one action in SP Example i: integer; a,b: real; a := b+i; Symbol Table. No. Symbol Type 1 i int 2 a real 3 b real 4 i* real 5 temp real Length Address • Intermediate code 1. Convert (id, #1) to real, giving (id, #4) 2. Add (id, #4) to (id, #3) giving (id, #5) 3. Store (id, #5) in (Id, #2) Lexical Analysis (Scanning) • Identifies lexical units in a source statement. • Classifies units into different classes e.g id’s, constants, reserved id’s etc and enters them into different tables • Token contains – Class code and number in class Code #no e.g. Id #10 • Example Statement a := b + i; a := b + Id #2 Op #5 Id #3 Op #3 i Id #1 ; Op #10 Syntax Analysis (Parsing) • Determines the statement class such as assignment statement, if stmt etc. e.g.:- a , b : real; and a = b + I ; real a := b + a b i Semantic Analysis • Determines the meaning of the SP • Results in addition of info such as type, length etc. • Determines the meaning of the subtree in IC and adds info to the IC tree. Stmt a := b +i; proceeds as 1. Type is added to the IC tree 2. Rules of assignment indicate expression on RHS should be evaluated first. 3. Rules of addition indicate i should be converted before addition i. Convert i to real giving i*; ii. Add i* to b giving temp. iii. Store temp in a. := := a, real a, real + b, real + b, real i, int (b) (a) := temp, real a, real (c) i*, int Source Program Lexical errors Scanning tokens Syntax errors Parsing Parse tree Semantic errors Semantic analysis IC IR Figure: Front end of a Toy Compiler Symbol table Constant table Other table Back End 1. Memory Allocation Calculated form its type, length and dimentionality. No. Symbol Type Length Address 1 i int 2000 2 a real 2001 3 b real 2002 • Code generation 1. Determine places where results should be kept in registers/memory location. 2. Determine which instruction should be used for type conversion operations. 3. Determine which addressing modes should be used for accessing variables. CONV_R ADD_R MOVEM Figure: Target Code AREG, I AREG, B AREG, A IR IC Memory allocation Symbol table Constants table Other tables Code generation Target program Figure: Back End of the toy compiler Data Structures For Language Processing. CLASSIFICATION OF DATA STRUCTURE • Classification criteria:1. Nature: Linear or Non linear. 2. Purpose: Search or Allocation. 3. Life Time: whether used during language processing or during target program execution LINEAR AND NON LINEAR LINEAR VS NON LINEAR S.N. LINEAR NON LINEAR 1 Linear arrangement in memory Non Linear arrangement in memory 2 Require contiguous area of memory Can use discontinues area of memory 3 Size of data structure must be predictable. No need to predict 4 Wastage of memory Efficient use of memory 5 Very good search efficiency Poor search efficiency SEARCH DATA STRUCTURE • Search data structures are used during language processing to maintain attribute information concerning different entities in the source program. • Key field is symbol field containing the name of an entity. • Entry format: Set of fields Common for entry. Entry format types: 1. Fixed length entries : each entry in symbol table has fields for all attributes specified in the programming language. 2. Variable length entries: the entry occupied by a symbol has fields only for the attributes specified for symbol of its class. 3. Hybrid entries: fixed and variant , each contains a set of fields. Entries in symbol table of a compiler have following fields:  Fixed part: fields symbol and class/tag field.  Variant part: Tag value (Symbol class) Variant part fields. (Attributes) Variable Type ,length ,dimension information Label Statement number Procedure name Address of parameter list , no. of parameters Type of return value , length of returned value, Address of parameter list , no. of parameters Function name Fixed and variable length entries • An entry may be declared as record or structure of language in which the language processor is being implemented. • Fixed length entry format: consist of fields 1. Fields in fixed part of the entry 2. set field in all variant parts of entry. 1 2 3 4 5 1. Symbol 3. Type 5. Dimension information 7. no. Of parameter 9. Length of return value 6 7 8 9 2. Class 4. Length 6. Parameters list address 8.Type of return value 10.Stament no. fixed length entry 10 variable length entries • consist of fields 1. Fields in fixed part of the entry including Tag field 2. { fi | fj Є SFvj} if tag = vj • Leads to compact organization in which no memory wastage occurs. 1. name 2.class 3.statement number 1 2 3 Hybrid entry format Fixed part Length Pointer Variable part Operations on DS 1. ADD: Add entry to symbol table 2. SEARCH: search and locate the entry of a symbol 3. DELETE: Delete the entry of a symbol Entry for symbol is created only once but may be searched for large number of times during processing of program. Deletion operation is not very common. Generic search procedure • To search and locate the entry of symbol s in a search data structure. • algorithm: 1. Make prediction concerning the entry of search DS which symbol s may be occupying. let this be entry e 2. Let Se be the symbol occupying eth entry. Compare s with se. exit with success if two match. Each comparison called as probe 3. Repeat step 1 and 2 till it can be conducted that the symbol does not exist in search data structure Table Organization 1. Sequential search organization 2. Binary search organization 3. Hash table Sequential search organization Search Algorithms Suppose that there are n elements in the array. The following expression gives the average number of comparisons: It is known that Therefore, the following expression gives the average number of comparisons made by the sequential search in the successful case: Search Algorithms Binary Search organization – Binary search algorithm assumes that the items in the array being searched are sorted – The algorithm begins at the middle of the array in a binary search – If the item for which we are searching is less than the item in the middle, we know that the item won’t be in the second half of the array – Once again we examine the “middle” element – The process continues with each comparison cutting in half the portion of the array where the item might be Data Structures Using C++ 64 Algorithm of binary search 1. Start :=1 ; end =f; 2. While start ≤ end start + end E:= a. 2 Exit with success if s= se b. If s<se then end:= e-1; else start:= e+1; 3. Exit with failure. Hash table Description • A hash table is a data structure that stores things and allows insertions, lookups, and deletions to be performed in O(1) time. • An algorithm converts an object, typically a string, to a number. Then the number is compressed according to the size of the table and used as an index. • There is the possibility of distinct items being mapped to the same key. This is called a collision and must be resolved. Key  Hash Code Generator  Number  Compression Smith  7 0 1 2 3 4 5 6 7 8 9 Bob Smith 123 Main St. Orlando, FL 327816 407-555-1111 [email protected]  Index Collision: A collision occurs when a hashing algorithm produces an address for an insertion key and that address is already occupied. Collision Resolution • There are two kinds of collision resolution: 1 – Chaining makes each entry a linked list so that when a collision occurs the new entry is added to the end of the list. 2 – Open Addressing uses probing to discover an empty spot. • With chaining, the table does not have to be resized. With open addressing, the table must be resized when the number of elements is larger than the capacity. Collision Resolution - Chaining • With chaining, each entry is the head of a (possibly empty) linked list. When a new object hashes to an entry, it is added to the end of the list. • A particularly bad hash function could create a table with only one non-empty list that contained all the elements. Uniform hashing is very important with chaining. • The load factor of a chained hash table indicates how many objects should be found at each location, provided reasonably uniform hashing. The load factor LF = n/c where n is the number of objects stored and c is the capacity of the table. • With uniform hashing, a load factor of 2 means we expect to find no more than two objects in one slot. A load factor less than 1 means that we are wasting space. Open Addressing • Another way to avoid collision • In open addressing, each position in the array is in one of three states, EMPTY, DELETED, or OCCUPIED. • If a position is OCCUPIED, it contains a legitimate value (key and data); otherwise, it contains no value. • Positions are initially EMPTY. If the value in a position is deleted, the position is marked as DELETED, not EMPTY: the need for this distinction is explained below. • The algorithm for inserting a value V into a table is this: • compute the position at which V ought to be stored, P = H(key(V)). • If position P is not OCCUPIED, store V there. • If position P is OCCUPIED, compute another position in the table; set P to this value and repeat (2)-(3). Linear Probing • The simplest method is called linear probing. P = (1 + P) mod TABLE_SIZE. This has the virtues that it is very fast to compute and that it `probes' (i.e. looks at) every position in the hash table. To see its fatal drawback, consider the following example. • In this example, the table is size 10, keys are 2digit integers, and H(k) = k mod 10. Initially all the positions in the table are EMPTY: • If we now insert 15, 17, and 8, there are no collisions; the result is: • When we insert 35, we compute P = H(35) = 35 mod 10 = 5. • Position 5 is occupied (by 15). So we must look elsewhere for a position in which to store 35. Using linear probing, the position we would try next is 5+1 mod 10 = 6. Position 6 is empty, so 35 is inserted there. • Now suppose we insert 25. We first try position (25 mod 10). It is occupied. • Using linear probing we next try position 6 (5+1 mod 10). • It too is occupied. Next we try position 7 (6+1 mod 10); it is occupied, as is the next one we try, position 8 (7+1 mod 10). Position 9 is next (8+1 mod 10); it is empty so 25 is inserted there. • Now suppose we insert 75. The first position we try is 5. Then we will try position 6, then 7, then 8, then 9. Finally we try position 0 (9+1 mod 10); it is empty so 75 is stored there. • Open addressing has the advantage that amount of memory needed is fixed. It has two disadvantages: • The first one we've just seen: RETRIEVE is very inefficient when the key does not occur in the table. • The other disadvantage relates to the load factor defined earlier. With open addressing, the amount you can store in the table is. of course, limited by the size of the table; and, what is worse, as the load factor gets large, all the operations degrade to linear-time. Hash Table Organization Organization Advantages Disadvantages Chaining •Unlimited number of elements •Unlimited number of collisions •Overhead of multiple linked lists Re-hashing Overflow area •Fast re-hashing •Fast access through use of main table space •Maximum number of elements must be known •Multiple collisions may become probable •Fast access •Collisions don't use primary table space •Two parameters which govern performance need to be estimated ALLOCATION DATA STRUCTURE • Most basic allocation data structure:1. Stack 2. Heap • When a program is loaded into memory, it is organized into three areas of memory, called segments: the text segment, stack segment, and heap segment. The text segment (sometimes also called the code segment) is where the compiled code of the program itself resides. This is the machine language representation of the program steps to be carried out, including all functions making up the program, both user defined and system. • The remaining two areas of system memory is where storage may be allocated by the compiler for data storage. Program Address Space • Any program you run has, associated with it, some memory which is divided into: – – – – Code Segment Data Segment (Holds Global Data) Stack (where the local variables and other temporary information is stored) Heap The Heap grows downwards The Stack grows upwards Heap Stack Data Segment Code Segment Stack • The stack is where memory is allocated for automatic variables within functions. A stack is a Last In First Out (LIFO) storage device where new storage is allocated and deallocated at only one ``end'', called the Top of the stack Local Variables: Stack Allocation • • When we have a declaration of the form “int a;”: – a variable with identifier “a” and some memory allocated to it is created in the stack. The attributes of “a” are: • Name: a • Data type: int • Scope: visible only inside the function it is defined, disappears once we exit the function • Address: address of the memory location reserved for it. Note: Memory is allocated in the stack for a even before it is initialized. • Size: typically 4 bytes • Value: Will be set once the variable is initialized Since the memory allocated for the variable is set in the beginning itself, we cannot use the stack in cases where the amount of memory required is not known in advance. This motivates the need for HEAP Pointers • We know what a pointer is. Let us say we have declared a pointer “int *p;” The attributes of “a” are: – – – – – • Name: p Data type: Integer address Scope: Local or Global Address: Address in the data segment or stack segment Size: 32 bits in a 32-bit architecture We saw how a fixed memory allocation is done in the stack, now we want to allocate dynamically. Consider the declaration: – “int *p;”. So the compiler knows that we have a pointer p that may store the starting address of a variable of type int. – To point “p” to a dynamic variable we need to use a declaration of the type “ p = new int;” ALLOCATION DATA STRUCTURE -STACK • Linear Data structure. • Allocation and de allocation performed in LIFO manner. • Only the last entry is accessible at any time. • Two pointers are used:Stack Base(SB):Points to first word of stack area. Top of Stack (TOS): Points to last entry allocated. STACK OPERATION When Entry is pushed on stack TOS incremented by l(Where l is size of entry). When the entry is popped i.e. deallocated TOS is decremented by l. EXTENDED STACK MODEL • One extra pointer is used known as a record base pointer. • It points to the first word of the last record in stack Heap In general, perhaps 50 percent of the computer's total memory space might be unused at any given moment. The operating system owns and manages the unused memory, and it is collectively known as the heap. The heap is extremely important because it is available for use by applications during execution using the C functions malloc (memory allocate) and free. The heap allows programs to allocate memory exactly when they need it during the execution of a program, rather than pre-allocating it with a specifically-sized array declaration. HEAP • The heap segment provides more stable storage of data for a program; memory allocated in the heap remains in existence for the duration of a program. Therefore, global variables (storage class external), and static variables are allocated on the heap. The memory allocated in the heap area, if initialized to zero at program start, remains zero until the program makes use of it. Thus, the heap area need not contain garbage. • Non Linear data structure. • It permits allocation and de allocation of entities in random order. • Allocation request returns a pointer to the allocated area in heap. • Deallocation request must present a pointer to the area to be deallocation . HEAP -MEMORY MANAGEMENT • Identifying free memory areas. 1. Reference Counts. 2. Garbage Collection. • Reuse of memory. 1. First fit technique. 2. Best fit technique. First fit technique • In First fit technique, the free list is traversed sequentially to find the first free block whose size is larger than or equal to the amount requested . • Once the block is found – It is removed from free list and free list is suitably manipulated. Every blocked is of fixed size. - the block is spilt into two portions. The first block remains in the free list and the second is allocated. Block size is not fixed. Statue of memory is as follows: U: used A: Available 0k Block1 Size:20K U A U 20K 10K 30K 20k Free size: 10K A 30K 30k 60k U A U A 20K 10K 40K 20K 90k 110k 150k 160k Block2Si 3Free Block3 Free Block4 Free ze:30K size:30k Size:20K size:10K Size:40K size:20K • 20k block memory allocation : free size block >= 20k size starting at address 60k.this block will be splited into to two parts 10k and 20k and second block will be allocated. 0k 20k 30k 60k Block1 Size:20K Free Block2Si Free size: 10K ze:30K size:10k 70k 90k New block=20k Block3 Size:20K 110k 150k Free size:10 K Block4 Size:40K 160k Free size:20K . • 10k block memory allocation : free size block >= 10k size starting at address 20k. 0k 20k 30k 60k Block1 Size:20K New block: 10K Block2Si Free ze:30K size:10k 70k 90k New block=20k Block3 Size:20K 110k Free size:10 K 150k Block4 Size:40K 160k Free size:20K Best fit allocation • In the best fit method , obtain the smallest free block whose size is grater than or equal to the requested size. Statue of memory is as follows: U: used A: Available 0k Block1 Size:20K U A U 20K 10K 30K 20k Free size: 10K A 30K 30k 60k U A U A 20K 10K 40K 20K 90k 110k 150k 160k Block2Si 3Free Block3 Free Block4 Free ze:30K size:30k Size:20K size:10K Size:40K size:20K • 20k block memory allocation : free size block >= 20k size starting at address 160k. 0k 20k Block1 Size:20K Free size: 10K 30k 60k Block2Siz Free e:30K size=30k 90k Block3 Size:20K 110k Free size:10K 150k 160k Block4 Size:40K New block:20K . • 10k block memory allocation : free size block >= 10k size starting at address 20k. 0k 20k 30k 60k Block1 Size:20K New block: 10K Block2Si Free ze:30K size:10k 70k 90k New block=20k Block3 Size:20K 110k Free size:10 K 150k Block4 Size:40K 160k Free size:20K • Worst fit : This method is opposite of best fit method. In this obtain largest free block whose size is grater than or equal to the request size . • Next fit: In this case searching in free list starts from the next of the previously allocated node. It reduces the search time. Fundamentals of Language Specification • Programming Language grammars. alphabets Words / Strings Sentences/ Statements Language Terminal symbols, alphabet and strings • The alphabet of L is represented by a greek symbol Σ. • Such as Σ = {a , b , ….z, 0, 1,…. 9} • A string is a finite sequence of symbols. α= axy Productions • Also called as rewriting rule A nonterminal symbol ::= String of T’s and NT’s e.g.: <Noun Phrase> ::= <Article> <Noun> <Article> ::= a| an | the <Noun> ::= boy | apple Grammar • A grammar G of a language LG is a quadruple (Σ, SNT, S, P) where – Σ is the alphabet – SNT is the set of NT’s – S is the distinguished symbol – P is the set of productions Derivation • Let production P1 of grammar G be of the form P1 : A ::= α And let β be such that β = γAθ β = γαθ Example <Sentence> :: = <Noun Phrase> <Verb Phrase> <Noun Phrase> ::= <Article> <Noun> <Verb Phrase> ::= <Verb> <Noun Phrase> <Article> ::= a| an| the <Noun> ::= boy | apple <Verb> ::= ate <Sentence> <Noun Phrase> <Verb Phrase> <Article> <Noun> <Verb Phrase> <Article> <Noun> <Verb> <Noun Phrase> the <Noun> <Verb> <Article> <Noun> the boy <Verb> <Article> <Noun> the boy ate <Article> <Noun> the boy ate an <Noun> the boy ate an apple Binding and Binding Times • Program entity pei in program P has a some attributes. pei kind variable type size procedure dimen Mem addr etc Reserved id etc Definition • A binding is an association of an attribute of a program entity with a value. • Binding time is the time at which binding is performed. int a; Different Binding Times 1. 2. 3. 4. 5. Language definition time of L Language implementation time of L Compilation time of P Execution init time of proc Execution time of proc program bindings(input , output); var i : integer; a,b : real; procedure proc (x: real; j: integer); var info : array[1…10,1…5] of integer; p : integer; begin new (p); end; begin proc(a,i); end 1. Binding of keywords with its meaning. e.g.: program, procedure, begin and end. 2. Size of type int is bind to n bytes. 3. Binding of var to its type. 4. Memory addresses are allocated to the variables. 5. Values are binded to memory address. Types of Binding • Static Binding: – Binding is performed before the execution of a program begins. • Dynamic Binding: – Binding is performed after the execution of a program has begun. Language Processor Development Tools (LPDT) LPDT requires two inputs: 1. Specification of a grammar of language L 2. Specification of semantic actions Two LPDT’s which are widely used 1. LEX (Lexical analyzer) 2. YACC (Parser Generator) It contains translation rules of form <string specification> { <semantic action>} Source Program Scanning Grammar of L Semantic action Front end generator Parsing Semantic analysis IR Front end Source Program in L Lexical Specification LEX Scanner Syntax Specification YACC Parser IR LEX • LEX consists of two components – Specification of strings such as id’s, constants etc. – Specification of semantic action %{ //Symbols definition; 1st section %} %% //translation rules; Specification 2nd section Action %% //Routines; 3rd section %{ letter digit }% %% begin end {letter} ( {letter}|{digit})* {digit}+ [A-Za-z] [0-9] {return(BEGIN);} {return(END);} {yylval = enter_id(); return(ID);} {yylval=enter_num(); return(NUM);} %% enter_id() { /*enters id in symbol table*/ } enter_num() {/* enters number in constabts table */} YACC • String specification resembles grammar production. • YACC performs reductions according to the grammar. Example %% E : E+T {$$ = gencode(‘+’, $1, $3);} | T {$$ = $1;} T : T*V {$$ = gencode(‘*’, $1, $3);} | V {$$ = $1;} V : id {$$ = gendesc($1);} %% gencode (operator, operand_1, operand_2){ } gendesc(symbol) { } Thank You!!!

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download LanguageProcessor