Download Writing a Compiler

1 Part One Introduction Terminology  A compiler is a program that translates a source program written in a source language (like Pascal, PL/I, C, C++, …) into an object language (default: machine language). Source Program Compiler Target Program Errors Machine, Assembly, and High-Level Languages  Machine language is the native language of the computer on which the program is run; indeed, people sometimes call it "native code".  A typical machine-language instruction in the IMB 370 family of computers looks like this: 0001100000110101  This instruction causes the computer to copy the contents of General Register 5 into General Register 3.  In the early days of computing, people programmed in binary, and wrote out the bit string as shown above.  The first and most primitive programming language translator were assemblers; these permitted the programmer to write (In assembly language): LR 3, 5 instead of the bit string shown above.(LR=00011000, 3=0011, 5=0101)  A single line of assembly-language code normally corresponds to a single machine-language instruction.  Languages like Pascal, C, PL/I, Fortran, and Cobol are known as high-level languages.  They have the property that a single statement, such as: x:= y + z; corresponds to more than one machine language instruction. 2  If the previous Pascal instruction is to be run on IBM 370, then this statement will be translated into the sequence: L 3, Y Load the working register with Y A 3, Z Add Z ST 3, X Store the result in X  Since a typical high-level language statement may corresponds to perhaps 10 assembly-language instructions, it follows that we can roughly 10 times as productive if we program in Pascal or C instead of assembly language(Highlevel programming language increase the productivity).  The high-level language that the compiler accepts as its input is called the source language.  The source-language program that is fed to the compiler is known as source code.  The computer on which the program is to be run is known as the target machine (in most cases the same computer on which the program was compiled).  It is useful to have a compiler that generates code for a machine that is different from the machine on which a compiler runs. Such a compiler is known as a cross-compiler. Compilers, Assemblers and Interpreters  The assembler takes assembly language as its source language, instead of a higher-level language.  A compiler translates the high-level program we give it into machine language.  An interpreter executes the program. An example: 10 y = 10 20 z = 15 30 x = y + z (BASIC) 40 print x 50 end 3  A program in BASIC (Beginner's All-Purpose Symbolic Instruction Code) is not compiled, instead BASIC analyzes a program, line by line, and instead of translating the code, it uses the computer to carry out the operations specified.  Advantages of interpreters: immediate response (see the results immediately), and flexibility (Ex: expand the array into a matrix during the execution of the program).  Disadvantage: low speed of execution. Example: loops: 10 for i = 1 to 100 20 x(i) = y(i) + z(i) 30 next i BASIC will have to analyze statement number 20 a hundred times over in the course of the loop.  However, the analysis takes less time in the translation process.  The importance of the interpretation approach comes in the software development process. The Environment of Compiler: all the system software we need to develop a new program or application (Software Integrated Development Environment (IDE)).  The text editor is software that we use to create the source program file. So the text editor is a part from the compiler environment.  The preprocessor is software that cleanup the source code and prepared to be used by the compiler. So the preprocessor is a part from the compiler environment.  The object file produced by the compiler is not ready to run. For example, if your program contains a statement like: y:=sqrt(x); then that square root (and functions like logs, character-string operations, input-output handling, dynamicmemory operations, external functions) has to be computed. These functions are provided in a run-time library (a collection of object modules for computing these functions). 4  Hence, we need another step, in which all the required runtime library services are identified and loaded into memory along with the object module for the program. This process of loading and inserting addresses is known as linking.  The linker generates an executable program.  Compiler environment refers to where the compiler fits into the overall process of writing and executing a program (Integrated Developing Environment)(IDE) . Source Program comes from text editor Preprocessor Object Program Run-Time Library New Source Program Compiler Linker Re-locatable Code Loader Executable Code The linker produce the re-locatable code which start at location zero at the memory (absolute code), while the loader produce the executable cod that start at location X inside the memory (actual address code). Phases of Compiler  The compiler is composed of two major parts: 1. Analysis part: which includes the first four phases? 2. Synthesis part: which includes the rest of phases? 5 Source Program Lexical Analyzer Syntax Analyzer Symbol Table Handler Semantic Analyzer Error Handler Intermediate Code Generator Code Optimizer Object Code Generator Object Program Symbol Table: is a data structure used to store variables information's. Error Handler: is a part of the compiler code that displays the compilation errors. 6 Lexical Analysis (Scanning)  The lexical analyzer (scanner) breaks the source code up into meaningful units (token). The process is sometimes called tokenizing (its output represented as list of tokens).  Example Source: for i := 1 to max do x[i] := 0; Analysis: Keywords(kw): Identifiers(id): Constants(c): Operators(op): Punctuation(p): Brackets(b): Output: for kw id1 i := Op 1 c To kw for, to, do i, max, x 1, 0 := ; [, ] id2 max do kw id3 x [ b id1 i ] b := op  Functions: 1. Tokenizing: Since the programmer is required to separate many parts of the statement with blanks or tabs characters (white space); these blanks make it easier for the compiler to determine where one token ends and the next begins. 2. Identifier decoding: The actual variables names. Gives to the variables its internal representations within the symbol table. Example: Source: Distance := rate * time; Output: id1 := id2 * id3; 0 c ; p 7 3. Removing excess white space: It may remove excess white space or, in some times, all of it. 4. Identifying comments: Remove comments. 5. Case conversion: For case insensitive languages. Uniform the letters case for such a languages. 6. Identifying string values: example "example". Taking care about constant data such as string data constants or numeric data constants keeping them at the symbol table with specific code names for future usage. 7. Interpretation of compiler directives: Instruction to compiler not for the machine ( ex: # include <iostream> ). 8. Communication with the symbol table handler: Build the symbol table through collecting the variables one by one during its tokenizing process and insert them at the symbol table. At the end of lexical analysis phase the symbol table is built and it is ready to be used by next phases. Syntactic Analysis (Parsing)  The syntactic analyzer (parser) determines the structure of the program and of the individual statements and detects syntax errors in statements. Example: for i := 1 to max do x[i] := 0; Analysis: The structure: for loop Loop counter: i Limits of loop: 1, max Body of the loop: single assignment statement  The term parsing comes from linguistics and draws heavily on generative grammars. Programming languages 8 can also be described by grammars, and the design of a parser takes the grammar as a starting point. Parse trees are an important representation of statements in programming languages. The output of the parser is a parse tree. Example for the assignment statement grammar: S  id := E ; E  E * E | E + E | E – E | E / E | id | const Example: Source of the parser: id1:= id2 * id3; Output: parse tree S id1 := E id 2 id2 E * ; E id id3 Semantic Analysis  The semantic analyzer ensures that valid program statements conform to the meaning constraints ( semantic rules) , such as type matching and scope rules.  Functions:  Declarations and scope rules. For the declarations that phase check about is the variables and identifiers that the user use in his program are already declared or not. Scope rules means that 9 phase check is the variables are use in its right scopes or not. Ex1: Declaration: int main( ) { int x = 5, y =10; z = x + y; // z is undeclared identifier cout << z; return 0; } Ex2: Scope rules: int main( ) { f1( ); // error Scope Rules return 0; } void f1 ( ) { cout << "Welcome"; }  Type checking Ex1: Type checking: int main( ) { int x = 5, z; float y =10.897; z = x + y; // type mismatch cout << z; return 0; }  Storage allocation Ex: In Pascal: Var a:array [ 1..10] of integer; a[11] := 30; { out of range }  Intended meaning of overloaded (more than one use ) operators Ex: In c++ 10 int *i; // * means pointer int x = 6, y = 5, z; z = x * y; // * means multiplication i = &z; cout<< *i; // dereferencing operator  Automatic type conversion Ex: int x=5; float y; y = x + 10; // type conversion from int into float Intermediate Code Generation  The intermediate code generator creates an internal representation of the program that reflects the information uncovered by the parser.  Every high-level language statement corresponds to several machine-language instructions, so the statement has to be broken down into small pieces corresponding to these instructions. This is done in intermediate code generation.  The intermediate code is code at a level between the highlevel form and machine language. It is a form in which the small pieces corresponding machine instructions are visible, but which is not yet machine language or even assembly language. This because optimization is still to be done(Reason of the need of the intermediate language).  The most widely used representation for intermediate code is three-address code (3AC). It is called three address code because each instruction on that language must not contains more than three addresses ( Variables or Values), which takes the following form: result := operand operator operand 11  Example: Source: x := a*y + z; Output: T1 := a * y T2 := T1 + z x := T2 Code Optimization  Code optimization (code enhancement) is the process of identifying and removing redundant operations from the intermediate code to make the code more efficient (efficiency depend on code size and CPU execution time).  Example: Source: The intermediate code given above Output: T1 := a * y x := T1 + z  Example2: Convert the following statement into 3AC and then optimize it? Source: x = (a+b) * (a+b) Output:  Example: Convert the following statement into 3AC and then optimize it? Source: a = a/2 * (b+5) c=a*2 d = b+5 c=c-d e= a/2 * (b+5) f=e+c 12 Object Code Generation  The object code generator translates the optimized intermediate program into the language of the target machine.  Example: (IBM System/370) Source: The intermediate code given above Output: LE 4, A Load A in floating-point register 4 Multiply by Y Add Z Store in X ME 4, Y AE 4, Z STE 4, X  Whereas the other phases are language dependent, this phase machine dependent.  This phase is one of the most difficult parts of compiler writing, because it is machine dependent.  The questions to be considered in this phase are mostly associated with the order in which machine instructions are to be generated and how the machine’s registers are to be used. Compiler Passes  A pass consists of reading a version of the program from a file and writing a new version of it to an output file. A pass normally comprises more than one phase.  The compiler makes one or more passes through the program.  Single-pass compilers tend to be fastest.  Multiple-pass compilers are necessary for two reasons: 1. Certain questions raised earlier in the program may remain unanswered until the rest of the program has been read. (EX:Forward References). 13 2. There may not be enough memory available to hold all the intermediate results obtained in the course of compilation.  On each pass a new portion of the compiler may be loaded into memory, overwriting the portions whose tasks are completed. Two Pass Compiler (Front end Back end):  Since lexical analysis, syntax analysis and intermediate generation are closely related, the parser is sometimes put in charge of these phases (driver’s seat). These three phases, together with some of the optimization phase, are called the front end of the compiler (programming language-dependent), while the rest are called the back end of the compiler (machine-dependent). System Support Symbol table handler  The symbol table is maintained by a procedure known as the symbol-table handler.  The symbol table handler builds and maintains a symbol table based on the defining occurrences and scopes.  The symbol table is the central repository of information about identifiers created by the programmer. For each, this table contains its name and attributes and various other information. It might take the following form: Id Name Type Value Scope Location Line# X int 20 S1 main 20 Example: int x=20; // Definition statement (insert x into symbob table) if x <= 5 cout << "ok"; // usage statement (check in the symbol table)  In strongly-typed languages, where every thing must be declared by the programmer before use, the lexical 14 analyzer and the parser, working together, must make sure that:  every declared identifier (defining occurrence) is entered in the symbol table, and  every identifier used subsequently (applied occurrence) has been declared. Example: X= y + z -5 T1 := y + z // T1,T2 is not a user define variable T2 := T1 -5// identified by the intermediate code gen. X := T2  One of the symbol table handler’s functions is to provide temporary variables for the intermediate-code generator upon demand and to add them to the symbol table. Error Handler  Error handling implements the compiler's response to errors in the code it is compiling.  When an error is detected, the error handler must tell the user about it (what kind and where it occurred). It might also apply some makeshift fix-up (fixing the error during the compilation process to move forward in the source program to check the rest of the program) in order to enable the compiler to continue through the program (to find whole the errors).  The way errors are handled depends in part on how the compiler is intended to be used (e.g., teaching purposes, or Pascal’s integrated development environment) 15 Writing a Compiler:  What is the suitable programming language we can use to write the compiler as a program? Old Compilers written Using Assembly Language:  Writing the compiler: Source for FORTRAN Assembler (In Assembly)  FORTRAN Compiler Object Module Linker Executable FORTRAN Compiler Linker User’s Executable Program Using the compiler: User’s source code FORTRAN Compiler User’s Object Module Using High-Level Languages  If you have a compiler for C and wish to write a compiler for PL/I, you can write the PL/I compiler in C.  Writing the compiler: Source for PL/I (in C)  C Compiler PL/I Compiler Object Module Linker Executable PL/I Compiler Linker User’s Executable Program Using the compiler: User’s source code (in PL/I) PL/I Compiler User’s Object Module Using the Same Language:  In some cases, a compiler may be written in its own language. There are two ways of doing this: I. Bootstrapping Compilers: Suppose you require a compiler for C, you start off by writing a minimal compiler in assembly language. This compiler supports only those operations needed in writing a compiler. Then, in a second 16 step you write a full compiler in C, using only those features in the language supported by the minimal compiler. You then compile the full compiler using the minimal compiler. Are used for producing new compiler versions for the same target machine.  Writing the minimal compiler (first step): Minimal C Compiler Assembler (In Assembly)  Linker Executable Minimal C Compiler Writing the full compiler (second step): Full C Compiler (In Minimal c language)  Minimal Compiler Object Module Minimal c Compiler Full C Compiler Object Module Linker Executable Full C Compiler Using the compiler: User’s source code (Full C) Full C Compiler User’s Object Module Linker User’s Executable Program II. Cross Compilers: Suppose you have a satisfactory C compiler on the VAX and wish to write a C compiler for the Macintosh, you can write the compiler in C and run it on the VAX as a cross compiler( run into VAX produces object code in Macintosh).  Writing the cross compiler (on VAX): Source for MAC C Comp. (C)  Compiler Mac C Compiler Object Module Linker Mac C Compiler (for VAX) The compiler we get from this step is the cross-compiler. To get a compiler we can actually use on the Macintosh, we feed the very same source code into the cross-compiler.  Using the Cross-Compiler: 17 Source for Mac C Comp. (C)  Mac C Cross Compiler Mac C Compiler Object Module Linker Mac C Compiler (for Mac) Using the compiler (on Macintosh): User’s source code (in C) Mac C Compiler User’s Object Module Linker User’s Executable Program  A useful tool for many of these approaches is software for generating parts of the compiler. These are sometimes called “Compiler Compilers”. Lexical analysis and syntax analysis are supported by these tools (ex., Lex, and Yacc).  Lex constructs tables for lexical scanners; the user gives it definitions of tokens and Lex returns the tables needed for the scanner.  Yacc (Yet Another Compiler Compiler) accepts a grammar for the programming language and generates an LR ( Left Recursion ) parser for the language. Retargetable Compilers  A compiler that can be modified to be used with a new target machine is said to be “retargetable”.  Approaches for doing retargetable compilers:  The cross-compiler approach (was explained).  The front-end back-end approach.  Writing a compiler for an imaginary machine: The imaginary machine was a stack-based machine whose language was known as p-code. The entire compiler was written in p-code, and the compiler compiled Pascal to p-code. To install this system on a given machine, you wrote a p-code interpreter for the machine. The result, after installation of the p-code interpreter, was a stack-based virtual computer whose machine language appeared to be p-code. 18 (Examples: P-Code for Pascal, UCSD p-system (University of California, San Diego (UCSD) Institute for Information Systems developed UCSD Pascal), Java virtual machine)

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Writing a Compiler