Download Writing a Compiler

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Join-pattern wikipedia , lookup

Comment (computer programming) wikipedia , lookup

Smart Pascal wikipedia , lookup

Object-oriented programming wikipedia , lookup

Structured programming wikipedia , lookup

ALGOL 68 wikipedia , lookup

Ada (programming language) wikipedia , lookup

C syntax wikipedia , lookup

Very long instruction word wikipedia , lookup

Library (computing) wikipedia , lookup

Falcon (programming language) wikipedia , lookup

Go (programming language) wikipedia , lookup

C++ wikipedia , lookup

PL/I wikipedia , lookup

Pascal (programming language) wikipedia , lookup

Turbo Pascal wikipedia , lookup

C Sharp syntax wikipedia , lookup

Assembly language wikipedia , lookup

Parsing wikipedia , lookup

Program optimization wikipedia , lookup

GNU Compiler Collection wikipedia , lookup

C Sharp (programming language) wikipedia , lookup

Interpreter (computing) wikipedia , lookup

Name mangling wikipedia , lookup

Cross compiler wikipedia , lookup

Compiler wikipedia , lookup

Transcript
1
Part One
Introduction
Terminology
 A compiler is a program that translates a source program
written in a source language (like Pascal, PL/I, C, C++, …)
into an object language (default: machine language).
Source
Program
Compiler
Target
Program
Errors
Machine, Assembly, and High-Level Languages
 Machine language is the native language of the computer
on which the program is run; indeed, people sometimes call
it "native code".
 A typical machine-language instruction in the IMB 370
family of computers looks like this:
0001100000110101
 This instruction causes the computer to copy the contents
of General Register 5 into General Register 3.
 In the early days of computing, people programmed in
binary, and wrote out the bit string as shown above.
 The first and most primitive programming language
translator were assemblers; these permitted the programmer
to write (In assembly language): LR 3, 5 instead of the
bit string shown above.(LR=00011000, 3=0011, 5=0101)
 A single line of assembly-language code normally
corresponds to a single machine-language instruction.
 Languages like Pascal, C, PL/I, Fortran, and Cobol are
known as high-level languages.
 They have the property that a single statement, such as:
x:= y + z; corresponds to more than one machine language
instruction.
2
 If the previous Pascal instruction is to be run on IBM 370,
then this statement will be translated into the sequence:
L
3, Y
Load the working register with Y
A
3, Z
Add Z
ST 3, X
Store the result in X
 Since a typical high-level language statement may
corresponds to perhaps 10 assembly-language instructions,
it follows that we can roughly 10 times as productive if we
program in Pascal or C instead of assembly language(Highlevel programming language increase the productivity).
 The high-level language that the compiler accepts as its
input is called the source language.
 The source-language program that is fed to the compiler is
known as source code.
 The computer on which the program is to be run is known
as the target machine (in most cases the same computer on
which the program was compiled).
 It is useful to have a compiler that generates code for a
machine that is different from the machine on which a
compiler runs. Such a compiler is known as a
cross-compiler.
Compilers, Assemblers and Interpreters
 The assembler takes assembly language as its source
language, instead of a higher-level language.
 A compiler translates the high-level program we give it into
machine language.
 An interpreter executes the program. An example:
10 y = 10
20 z = 15
30 x = y + z (BASIC)
40 print x
50 end
3
 A program in BASIC (Beginner's All-Purpose Symbolic Instruction Code)
is not compiled, instead BASIC analyzes a program, line
by line, and instead of translating the code, it uses the
computer to carry out the operations specified.
 Advantages of interpreters: immediate response (see the
results immediately), and flexibility (Ex: expand the array
into a matrix during the execution of the program).
 Disadvantage: low speed of execution. Example: loops:
10 for i = 1 to 100
20 x(i) = y(i) + z(i)
30 next i
BASIC will have to analyze statement number 20 a hundred
times over in the course of the loop.
 However, the analysis takes less time in the translation
process.
 The importance of the interpretation approach comes in the
software development process.
The Environment of Compiler: all the system
software we need to develop a new program or
application (Software Integrated Development
Environment (IDE)).
 The text editor is software that we use to create the source
program file. So the text editor is a part from the compiler
environment.
 The preprocessor is software that cleanup the source code
and prepared to be used by the compiler. So the preprocessor
is a part from the compiler environment.
 The object file produced by the compiler is not ready to run.
For example, if your program contains a statement like:
y:=sqrt(x); then that square root (and functions like logs,
character-string operations, input-output handling, dynamicmemory operations, external functions) has to be computed.
These functions are provided in a run-time library (a
collection of object modules for computing these functions).
4
 Hence, we need another step, in which all the required runtime library services are identified and loaded into memory
along with the object module for the program. This process
of loading and inserting addresses is known as linking.
 The linker generates an executable program.
 Compiler environment refers to where the compiler fits into
the overall process of writing and executing a program
(Integrated Developing Environment)(IDE) .
Source Program
comes from text
editor
Preprocessor
Object
Program
Run-Time
Library
New Source
Program
Compiler
Linker
Re-locatable
Code
Loader
Executable
Code
The linker produce the re-locatable code which
start at location zero at the memory (absolute
code), while the loader produce the executable
cod that start at location X inside the memory
(actual address code).
Phases of Compiler
 The compiler is composed of two major parts:
1. Analysis part: which includes the first four phases?
2. Synthesis part: which includes the rest of phases?
5
Source
Program
Lexical Analyzer
Syntax Analyzer
Symbol
Table
Handler
Semantic Analyzer
Error
Handler
Intermediate
Code Generator
Code Optimizer
Object Code
Generator
Object
Program
Symbol Table: is a data structure used to store variables
information's.
Error Handler: is a part of the compiler code that displays
the compilation errors.
6
Lexical Analysis (Scanning)
 The lexical analyzer (scanner) breaks the source code up
into meaningful units (token). The process is sometimes
called tokenizing (its output represented as list of tokens).
 Example
Source:
for i := 1 to max do
x[i] := 0;
Analysis:
Keywords(kw):
Identifiers(id):
Constants(c):
Operators(op):
Punctuation(p):
Brackets(b):
Output:
for
kw
id1
i
:=
Op
1
c
To
kw
for, to, do
i, max, x
1, 0
:=
;
[, ]
id2
max
do
kw
id3
x
[
b
id1
i
]
b
:=
op
 Functions:
1. Tokenizing: Since the programmer is required to
separate many parts of the statement with blanks or
tabs characters (white space); these blanks make it
easier for the compiler to determine where one token
ends and the next begins.
2. Identifier decoding: The actual variables names.
Gives to the variables its internal representations
within the symbol table.
Example:
Source:
Distance := rate * time;
Output:
id1 := id2 * id3;
0
c
;
p
7
3. Removing excess white space: It may remove excess
white space or, in some times, all of it.
4. Identifying comments: Remove comments.
5. Case conversion: For case insensitive languages.
Uniform the letters case for such a languages.
6. Identifying string values: example "example".
Taking care about constant data such as string data
constants or numeric data constants keeping them at
the symbol table with specific code names for future
usage.
7. Interpretation of compiler directives: Instruction to
compiler not for the machine ( ex: # include
<iostream> ).
8. Communication with the symbol table handler: Build
the symbol table through collecting the variables one
by one during its tokenizing process and insert them
at the symbol table. At the end of lexical analysis
phase the symbol table is built and it is ready to be
used by next phases.
Syntactic Analysis (Parsing)
 The syntactic analyzer (parser) determines the structure of
the program and of the individual statements and detects
syntax errors in statements.
Example:
for i := 1 to max do
x[i] := 0;
Analysis:
The structure:
for loop
Loop counter: i
Limits of loop: 1, max
Body of the loop: single assignment statement
 The term parsing comes from linguistics and draws
heavily on generative grammars. Programming languages
8
can also be described by grammars, and the design of a
parser takes the grammar as a starting point. Parse trees
are an important representation of statements in
programming languages. The output of the parser is a
parse tree.
Example for the assignment statement grammar:
S  id := E ;
E  E * E | E + E | E – E | E / E | id | const
Example:
Source of the parser: id1:= id2 * id3;
Output:
parse tree
S
id1
:=
E
id
2
id2
E
*
;
E
id
id3
Semantic Analysis
 The semantic analyzer ensures that valid program
statements conform to the meaning constraints ( semantic
rules) , such as type matching and scope rules.
 Functions:
 Declarations and scope rules. For the declarations
that phase check about is the variables and
identifiers that the user use in his program are
already declared or not. Scope rules means that
9
phase check is the variables are use in its right
scopes or not.
Ex1: Declaration:
int main( )
{ int x = 5, y =10;
z = x + y; // z is undeclared identifier
cout << z;
return 0; }
Ex2: Scope rules:
int main( )
{
f1( ); // error Scope Rules
return 0; }
void f1 ( )
{ cout << "Welcome"; }
 Type checking
Ex1: Type checking:
int main( )
{ int x = 5, z;
float y =10.897;
z = x + y; // type mismatch
cout << z;
return 0; }
 Storage allocation
Ex: In Pascal:
Var a:array [ 1..10] of integer;
a[11] := 30; { out of range }
 Intended meaning of overloaded (more than one
use ) operators
Ex: In c++
10
int *i; // * means pointer
int x = 6, y = 5, z;
z = x * y; // * means multiplication
i = &z;
cout<< *i; // dereferencing operator
 Automatic type conversion
Ex:
int x=5;
float y;
y = x + 10; // type conversion from int into float
Intermediate Code Generation
 The intermediate code generator creates an internal
representation of the program that reflects the information
uncovered by the parser.
 Every high-level language statement corresponds to
several machine-language instructions, so the statement
has to be broken down into small pieces corresponding to
these instructions. This is done in intermediate code
generation.
 The intermediate code is code at a level between the highlevel form and machine language. It is a form in which the
small pieces corresponding machine instructions are
visible, but which is not yet machine language or even
assembly language. This because optimization is still to
be done(Reason of the need of the intermediate
language).
 The most widely used representation for intermediate code
is three-address code (3AC). It is called three address code
because each instruction on that language must not
contains more than three addresses ( Variables or Values),
which takes the following form:
result := operand operator operand
11
 Example:
Source:
x := a*y + z;
Output:
T1 := a * y
T2 := T1 + z
x := T2
Code Optimization
 Code optimization (code enhancement) is the process of
identifying and removing redundant operations from the
intermediate code to make the code more efficient
(efficiency depend on code size and CPU execution time).
 Example:
Source: The intermediate code given above
Output:
T1 := a * y
x := T1 + z
 Example2: Convert the following statement into 3AC
and then optimize it?
Source: x = (a+b) * (a+b)
Output:
 Example: Convert the following statement into 3AC
and then optimize it?
Source: a = a/2 * (b+5)
c=a*2
d = b+5
c=c-d
e= a/2 * (b+5)
f=e+c
12
Object Code Generation
 The object code generator translates the optimized
intermediate program into the language of the target
machine.
 Example: (IBM System/370)
Source: The intermediate code given above
Output:
LE
4, A
Load A in floating-point
register 4
Multiply by Y
Add Z
Store in X
ME
4, Y
AE
4, Z
STE
4, X
 Whereas the other phases are language dependent, this
phase machine dependent.
 This phase is one of the most difficult parts of compiler
writing, because it is machine dependent.
 The questions to be considered in this phase are mostly
associated with the order in which machine instructions
are to be generated and how the machine’s registers are to
be used.
Compiler Passes

A pass consists of reading a version of the program from a
file and writing a new version of it to an output file. A
pass normally comprises more than one phase.
 The compiler makes one or more passes through the
program.
 Single-pass compilers tend to be fastest.
 Multiple-pass compilers are necessary for two
reasons:
1. Certain questions raised earlier in the program
may remain unanswered until the rest of the
program has been read. (EX:Forward References).
13
2. There may not be enough memory available to
hold all the intermediate results obtained in the
course of compilation.
 On each pass a new portion of the compiler may be loaded
into memory, overwriting the portions whose tasks are
completed.
Two Pass Compiler (Front end Back end):
 Since lexical analysis, syntax analysis and intermediate
generation are closely related, the parser is sometimes put
in charge of these phases (driver’s seat). These three
phases, together with some of the optimization phase, are
called the front end of the compiler (programming
language-dependent), while the rest are called the
back end of the compiler (machine-dependent).
System Support
Symbol table handler
 The symbol table is maintained by a procedure known as
the symbol-table handler.
 The symbol table handler builds and maintains a symbol
table based on the defining occurrences and scopes.
 The symbol table is the central repository of information
about identifiers created by the programmer. For each, this
table contains its name and attributes and various other
information. It might take the following form:
Id Name Type Value Scope Location Line#
X
int
20
S1
main
20
Example:
int x=20; // Definition statement (insert x into
symbob table)
if x <= 5 cout << "ok";
// usage statement (check
in the symbol table)
 In strongly-typed languages, where every thing must be
declared by the programmer before use, the lexical
14
analyzer and the parser, working together, must make sure
that:
 every declared identifier (defining occurrence) is
entered in the symbol table, and
 every identifier used subsequently (applied
occurrence) has been declared.
Example:
X= y + z -5
T1 := y + z
// T1,T2 is not a user define variable
T2 := T1 -5// identified by the intermediate code gen.
X := T2
 One of the symbol table handler’s functions is to provide
temporary variables for the intermediate-code generator
upon demand and to add them to the symbol table.
Error Handler
 Error handling implements the compiler's response to
errors in the code it is compiling.
 When an error is detected, the error handler must tell the
user about it (what kind and where it occurred). It might
also apply some makeshift fix-up (fixing the error during
the compilation process to move forward in the source
program to check the rest of the program) in order to
enable the compiler to continue through the program (to
find whole the errors).
 The way errors are handled depends in part on how the
compiler is intended to be used (e.g., teaching purposes, or
Pascal’s integrated development environment)
15
Writing a Compiler:
 What is the suitable programming language we can
use to write the compiler as a program?
Old Compilers written Using Assembly Language:
 Writing the compiler:
Source for
FORTRAN
Assembler
(In Assembly)

FORTRAN
Compiler Object
Module
Linker
Executable
FORTRAN
Compiler
Linker
User’s
Executable
Program
Using the compiler:
User’s
source code
FORTRAN
Compiler
User’s Object
Module
Using High-Level Languages
 If you have a compiler for C and wish to write a compiler for
PL/I, you can write the PL/I compiler in C.
 Writing the compiler:
Source for
PL/I (in C)

C Compiler
PL/I Compiler
Object Module
Linker
Executable
PL/I
Compiler
Linker
User’s
Executable
Program
Using the compiler:
User’s
source code
(in PL/I)
PL/I
Compiler
User’s Object
Module
Using the Same Language:
 In some cases, a compiler may be written in its own
language. There are two ways of doing this:
I. Bootstrapping Compilers: Suppose you require a
compiler for C, you start off by writing a minimal compiler
in assembly language. This compiler supports only those
operations needed in writing a compiler. Then, in a second
16
step you write a full compiler in C, using only those
features in the language supported by the minimal
compiler. You then compile the full compiler using the
minimal compiler. Are used for producing new compiler
versions for the same target machine.

Writing the minimal compiler (first step):
Minimal C
Compiler
Assembler
(In Assembly)

Linker
Executable
Minimal C
Compiler
Writing the full compiler (second step):
Full C
Compiler
(In Minimal
c language)

Minimal
Compiler Object
Module
Minimal c
Compiler
Full C
Compiler
Object Module
Linker
Executable
Full C
Compiler
Using the compiler:
User’s
source code
(Full C)
Full C
Compiler
User’s Object
Module
Linker
User’s
Executable
Program
II. Cross Compilers: Suppose you have a satisfactory C
compiler on the VAX and wish to write a C compiler for the
Macintosh, you can write the compiler in C and run it on the
VAX as a cross compiler( run into VAX produces object code in
Macintosh).

Writing the cross compiler (on VAX):
Source for
MAC C
Comp. (C)

Compiler
Mac C Compiler
Object Module
Linker
Mac C
Compiler
(for VAX)
The compiler we get from this step is the cross-compiler.
To get a compiler we can actually use on the Macintosh, we
feed the very same source code into the cross-compiler.
 Using the Cross-Compiler:
17
Source for
Mac C
Comp. (C)

Mac C Cross
Compiler
Mac C
Compiler
Object Module
Linker
Mac C
Compiler
(for Mac)
Using the compiler (on Macintosh):
User’s
source code
(in C)
Mac C
Compiler
User’s Object
Module
Linker
User’s
Executable
Program
 A useful tool for many of these approaches is software for
generating parts of the compiler. These are sometimes
called “Compiler Compilers”. Lexical analysis and syntax
analysis are supported by these tools (ex., Lex, and Yacc).
 Lex constructs tables for lexical scanners; the user gives it
definitions of tokens and Lex returns the tables needed for
the scanner.
 Yacc (Yet Another Compiler Compiler) accepts a grammar
for the programming language and generates an LR ( Left
Recursion ) parser for the language.
Retargetable Compilers
 A compiler that can be modified to be used with a new
target machine is said to be “retargetable”.
 Approaches for doing retargetable compilers:
 The cross-compiler approach (was explained).
 The front-end back-end approach.
 Writing a compiler for an imaginary machine: The
imaginary machine was a stack-based machine whose
language was known as p-code. The entire compiler
was written in p-code, and the compiler compiled
Pascal to p-code. To install this system on a given
machine, you wrote a p-code interpreter for the
machine. The result, after installation of the p-code
interpreter, was a stack-based virtual computer whose
machine language appeared to be p-code.
18
(Examples: P-Code for Pascal, UCSD p-system
(University of California, San Diego (UCSD) Institute
for Information Systems developed UCSD Pascal), Java
virtual machine)