Download Assemblers, linkers, and loaders Assemblers

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Assemblers, linkers, and loaders
• You need to know
– A little more about how an assembler works
• Standard “two pass” assembler generating code
• “Lexical Analysis”
Assemblers, linkers, and loaders
– “lexical analysis is the process of converting a sequence of characters into a sequence of tokens, i.e. meaningful character strings”
» In an assembler, you are reading text and identifying instruction mnemonics, directives, octal numbers, …
» Compilers also start with lexical analysis – but have a much richer vocabulary of “meaningful character strings”
Mostly an introduction to how an assembler works
– A few more details of linking and loading
• Quite a lot already covered indirectly when considering .a archives and .so libraries in last section
1
2
Assembler
Absolute assembler
• That assembler added to the PDP‐11 simulator was an “absolute assembler”
• You used a simple assembler in those PDP‐11 exercises
Magic?
; Program to copy and determine length of string
.origin 1000
001000 012701
start: mov #msga, r1
001002 001024
001004 012702
mov #msgb, r2
001006 001076
001010 005000
clr r0
001012 112122
l1: movb (r1)+, (r2)+
001014 001402
beq done
001016 005200
inc r0
001020 000774
br l1
001022 000000
done: halt
msga: .string "A string whose length is to be determined"
001024 020101
001026 072163
…
001074 000144
msgb: .string "Different string that should get overwritten"
001076 064504
001100 063146
…
001150 067145
001152 000000
.end start
Source text of program
– Which means that it works with explicit addresses for where code and variables go.
– Working with explicit addresses makes things simpler – but is very limiting
• It really is only appropriate when
1.
2.
Coding for a simple machine with no operating system (like the PDP‐11 in the simulator)
The entire program exists as a single source file – no use of libraries
• Will start by looking at how an absolute assembler works
– Subsequently look at a “relocatable assembler” that allows for use of libraries and OS facilities.
“Assembled” code
3
What did the “absolute assembler” do?
How is the assembler performing magic working?
• Read a text file
• It starts with a “symbol table” that has entries with details of instruction mnemonics
– Text!
• Things like
– Comments
– Assembler directives (e.g. the origin directive that set the start address for the subsequent group of instructions or data definitions)
– Mnemonic instruction names – mov, beq, …
– Combinations of register names, r1, and addressing mode tags – r1+, (r1), @(r1), …
– Names invented for variables, label names on instructions
• But as seen in the input they are all simply character sequences
• Perform magic!
• Output a representation of the bit patterns, representing instructions or initialized data variables, that are to be placed in specified memory locations
5
nabg
4
– mov
• “2 address instruction”
• Bit pattern 0001 ssssss dddddd – a 4‐bit op‐code, 6‐bits for addressing mode of source address, 6‐bits for addressing mode of destination
• May use one or two following words in memory to hold address data
– br
• “instruction with branch offset bits”
• Bit pattern 00000001 oooooooo – an 8‐bit op‐code, 8‐bits for an offset address
• Will require a name of a label as next input element
– inc
• “1 address instruction”
• Bit pattern 0000101001– an 10‐bit op‐code, 6‐bits for addressing mode of destination
• Will require addressing data as next element
6
1
How is the assembler performing magic working?
How is the assembler performing magic working?
• It has a set of built in processing rules
• It has a set of built in processing rules
1. Discard comments
2. Use origin directive to set a “location counter” whose value determines the address where the next instruction or data element is to go
3. Process other directives
• E.g .string
– Convert quoted string into a sequence of bytes (rembering to swap byte order!), and add 1 or 2 nul bytes;
– Work out how many bytes used, and update the location counter
4. Recognize instructions and check addressing modes
•
•
•
• .blkw
– Consume next element (should be an octal constant defining number of words to be left empty)
– Update location counter
E.g. clr r0
–
–
clr ‐ a one address instruction
r0 – mode and register information;
it’s register mode on register r0
This is a correctly formed instruction that with its data will occupy one‐word of memory, update location counter by 2
Could (but will not at this stage) generate the bit pattern
–
–
–
clr ‐ 10‐bit instruction format 0000101000
r0 – register mode 000, register r0 000
So instruction word would be 0000101000000000
7
Instructions referencing labels
How is the assembler performing magic working?
• Assembly code has labels on instructions and data elements
• Instructions reference these labels
• Bit pattern generated for the instruction will depend on address of referenced label
• It has a set of built in processing rules
4. Recognize instructions and check addressing modes
•
E.g. beq done
–
–
–
8
beq – an instruction that will need an 8‐bit offset ‐ the amount to branch
Value of offset will be difference in location value of “done” and current location counter
Ooops!
» Don’t know where done is!
» Haven’t met this symbol!
– beq done –
• need the offset, the number of bytes between current location counter and the still unknown address associated with label done
– mov #msgb, r2
• #msgb – immediate addressing, next word following the mov instruction is supposed to contain the address associated with label msgb – but that is still unknown!
9
Assembler must use 2 passes
10
Assembler must use 2 passes
• First pass
• Second pass
– Basically simple checking of input
– Generate the bit patterns that represent the code and data
• Are all instructions and addressing modes appropriate
• Keep track of location counter
– So have to carefully consider addressing modes of each instruction along with space allocation by directives like .string.
– Recognize labels and add them to symbol table
start: mov #msg1, r1
• Add entry: symbol type = “label”, name = “start”, value=1000
l1: movb (r1)+, (r2)+
• Add entry: symbol type = “label”, name = “l1”, value=1012
11
nabg
12
2
Line‐by‐line processing
Magic: step‐by‐step
• The example program
• Of course, assembly language code is much simpler than code in a high level language
• Each line of code can be processed in isolation –
don’t have to maintain any context (apart from location counter)
– (Think about a high level language with if … then … elsif … then … elsif
… then … else … constructs that span multiple lines – need to remember that have to match up if and else etc)
; Program to copy and determine length of string
.origin 1000
start: mov #msga, r1
mov #msgb, r2
clr r0
l1: movb (r1)+, (r2)+
beq done
inc r0
br l1
done: halt
msga: .string "A string whose length is to be determined"
msgb: .string "Different string that should get
overwritten"
.end start
13
14
Magic: step‐by‐step : Pass 1, step 1
; Program to copy and determine length of string
start: mov #msga, r1
• First non‐white space character is “;”
start:
– this is a comment;
– Discard entire line
.origin 1000
• This is an “origin” assembler directive
– Consume next input token which must be an octal number (not requiring a leading 0)
– Set a variable “location” to this value,
this “location” variable defines the address where the next instruction or data element will be placed
Processing rules: condition (matched lexical element) and action
Magic: step‐by‐step : Pass 1, step 2a
• This is a label, it will get referenced elsewhere in program;
it’s associated with current address location i.e. 1000
– Create an entry in symbol table
• Symbol: Type=label, Name=start, Value=1000
15
16
Magic: step‐by‐step : Pass 1, step 3
Magic: step‐by‐step : Pass 1, step 2b
start: mov #msga, r1
start: mov
mov #msgb, r2
mov
• mov is a symbol, •
– Look it up in symbol table
mov is a symbol, –
Symbol: –
–
–
– Type=instruction with 2 addresses, – Name=mov, – Value=0001 (4‐bit)
–
– Instruction symbol, that’s ok; check the two addresses
start: mov #msga,
Type=instruction with 2 addresses, Name=mov, Value=0001 (4‐bit)
Instruction symbol, that’s ok; check the two addresses
mov #msgb,
• #msga, that would be “immediate addressing mode” – there will be an address placed in the next word, remember to update the location counter appropriately
• r1, that would be “register addressing”
• Location counter will now be 1004
17
nabg
Look it up in symbol table
•
• Symbol: • #msgb, that would be “immediate addressing mode” – there will be an address placed in the next word, remember to update the location counter appropriately
• r2, that would be “register addressing”
• Location counter will now be 1010
18
3
Magic: step‐by‐step : Pass 1, step 4
clr r0
l1: movb (r1)+, (r2)+
l1:
clr
•
clr is a symbol, • This is a label, it will get referenced elsewhere in program;
it’s associated with current address location i.e. 1012
– Look it up in symbol table
•
Magic: step‐by‐step : Pass 1, step 5a
Symbol: – Type=instruction with 1 address, – Name=clr, – Value=0000101000 (10‐bit)
– Instruction symbol, that’s ok; check the one addresses
clr r0
– Create an entry in symbol table
• r0, that would be “register addressing”
• Location counter will now be 1012
• Symbol: Type=label, Name=l1, Value=1012
19
20
Magic: step‐by‐step : Pass 1, step 6
Magic: step‐by‐step : Pass 1, step 5b
l1: movb (r1)+, (r2)+
l1: movb
beq done
beq
• movb is a symbol, •
beq is a symbol, – Look it up in symbol table
– Look it up in symbol table
•
• Symbol: Symbol: – Type=instruction with offset, – Name=beq, – Value=00000011(8‐bit)
– Type=instruction with 2 addresses, – Name=movb, – Value=1001 (4‐bit)
– Instruction symbol, that’s ok; check the one addresses
– Instruction symbol, that’s ok; check the two addresses
l1: movb (r1)+, (r2)+
beq done
• done hopefully we will find this label before we have to generate code, but it looks ok for now
• Location counter will now be 1016
• (r1)+, that would be “auto increment mode” • (r2)+ that would be “auto increment mode” • Location counter will now be 1014
21
Magic: step‐by‐step : Pass 1, step 7
22
Magic: step‐by‐step : Pass 1, step 8
inc r0
br l1
inc
•
br
•
inc is a symbol, – Look it up in symbol table
•
br is a symbol, – Look it up in symbol table
•
Symbol: – Type=instruction with 1 address, – Name=inc, – Value=0000101010 (10‐bit)
– Type=instruction with offset, – Name=br, – Value=00000001(8‐bit)
– Instruction symbol, that’s ok; check the one addresses
– Instruction symbol, that’s ok; check the one addresses
inc r0
br l1
• r0, that would be “register addressing”
• Location counter will now be 1020
• l1 hopefully we will find this label before we have to generate code, but it looks ok for now (actually, we have already found it, but makes no difference)
• Location counter will now be 1022
23
nabg
Symbol: 24
4
Magic: step‐by‐step : Pass 1, step 9a
Magic: step‐by‐step : Pass 1, step 9b
done: halt
done:
done: halt
done: halt
• This is a label, it will get referenced elsewhere in program;
it’s associated with current address location i.e. 1022
• halt is a symbol, – Look it up in symbol table
• Symbol: – Type=instruction with 0 addresses, – Name=halt, – Value=0000000000000000 (16‐bit)
– Instruction symbol, that’s ok; no addresses to check
– Create an entry in symbol table
• Location counter will now be 1024
• Symbol: Type=label, Name=done, Value=1022
25
26
Magic: step‐by‐step : Pass 1, step 10a
Magic: step‐by‐step : Pass 1, step 10b
msga: .string "A string whose length is to be determined"
msga:
msga: .string "A string whose length is to be determined"
msga: .string
• This is a label, it will get referenced elsewhere in program;
it’s associated with current address location i.e. 1024
• String directive
– Create an entry in symbol table
– Require next element to be a quoted string
– Determine length of string
• If odd, add 1 to have null byte at end
• If even add 2 (to have null word at end)
– Update location counter, now 1076
• Symbol: Type=label, Name=msga, Value=1024
27
28
Magic: step‐by‐step : Pass 1, step 11a
Magic: step‐by‐step : Pass 1, step 11b
msgb: .string "Different string that should get overwritten"
msgb:
msgb: .string "Different string that should get overwritten"
msgb: .string
• This is a label, it will get referenced elsewhere in program;
it’s associated with current address location i.e. 1076
• String directive
– Create an entry in symbol table
– Require next element to be a quoted string
– Determine length of string
• If odd, add 1 to have null byte at end
• If even add 2 (to have null word at end)
• Symbol: Type=label, Name=msgb, Value=1076
29
nabg
30
5
Magic: step‐by‐step : Pass 2, step 1
Magic: step‐by‐step : Pass 1, step 12
.end start
; Program to copy and determine length of string
• End directive
• 1st pass complete
• First non‐white space character is “;”
– Some limited syntax checking done, all those input statements looked ok
– this is a comment;
– Discard entire line
.origin 1000
• This is an “origin” assembler directive
– Consume next input token which must be an octal number (not requiring a leading 0)
– Set a variable “location” to this value,
this “location” variable defines the address where the next instruction or data element will be placed
• (Would have stopped on an error such as an instruction requiring one address being given two address elements)
– Labels all defined and added to symbol table.
31
32
Magic: step‐by‐step : Pass 1, step 2a
Magic: step‐by‐step : Pass 2, step 2b
start: mov #msga, r1
start: mov #msga, r1
start: mov
start:
•
mov is a symbol, –
• This is a label, skip over
Look it up in symbol table
•
Symbol: –
–
–
Type=instruction with 2 addresses, Name=mov, Value=0001 (4‐bit)
– Start composing instruction word 0001ssssssdddddd
start: mov #msga,
•
•
•
•
•
#msga, that would be “immediate addressing mode” – that was mode 2 register 7 so ssssss becomes 010111
r1, that would be “register addressing”, mode 0 register 1 so dddddd becomes 0000001
Complete instruction word is now 0001010111000001, output into location 1000;
msga Look up value in symbol table for label named msga – it’s 1024; so output 1024 into location 1002
Location counter will now be 1004
33
34
Magic: step‐by‐step : Pass 2, step 6
Magic: step‐by‐step : Pass 2, steps 3, 4, 5
mov #msgb, r2
beq done
beq
• Similar to last line
•
clr r0
•
Symbol: – Type=instruction with offset, – Name=beq, – Value=00000011(8‐bit)
– 0000101000 dddddd
• Work out address mode and register, mode 0, register 1 – Instruction symbol, that’s ok; check the one addresses
– 0000101000 000001
beq done
• Store instruction word in next address
• done get value from symbol table, work out signed 8‐bit offset, combine with op‐code; store instruction in next address
• Location counter will now be 1016
l1: movb (r1)+, (r2)+
Similar
35
nabg
beq is a symbol, – Look it up in symbol table
• Look up clr in symbol table –
36
6
Magic: step‐by‐step : Pass 2, step 7
Magic: step‐by‐step : Pass 2, step 8
br l1
inc r0
br
inc
•
•
inc is a symbol, •
br is a symbol, – Look it up in symbol table
– Look it up in symbol table
•
Symbol: Symbol: – Type=instruction with offset, – Name=br, – Value=00000001(8‐bit)
– Type=instruction with 1 address, – Name=inc, – Value=0000101010 (10‐bit)
– Instruction symbol, that’s ok; check the one addresses
– Instruction symbol, that’s ok; check the one addresses
br l1
inc r0
• l1 another of these offset instructions, now can get address of l1 from symbol table so can work out how many bytes to jump back, this is signed offset
• Compose instruction word and store
• Location counter will now be 1022
• r0, that would be “register addressing”
• Compose instruction word and save in next address
• Location counter will now be 1020
37
Magic: step‐by‐step : Pass 2, step 9
38
Magic: step‐by‐step : Pass 2, step 10
msga: .string "A string whose length is to be determined"
msga:
done: halt
done:
• Ignore the label
• Process the string by copying bytes from string into successive memory locations,
remembering to swap the byte order before storing them!
Keep updating location counter as go along, and remember to zero out the nul byte (odd length strings) or nul word (even length strings)
• Ignore the label
done: halt
• Look up 0‐address halt instruction in symbol table and output appropriate bits
39
Magic: step‐by‐step : Pass 2, step 11
40
Magic: step‐by‐step : Pass 2, step 12
msgb: .string "Different string that should get overwritten"
.end start
• Similar processing to last string
• End directive
• 2nd pass complete
– Code all generated
– Record start address.
41
nabg
42
7
Writing an assembler
Pseudo‐code : 1
• Most of the code of an assembler program has to be crafted,
– it is inevitably very specific to the machine and its instruction set and the assembler directives that provide features such as string encoding
– PDP‐11 issues
function assemble() {
initializeSymbolTable(); // Loads predefined symbols
try {
pass1();
}
catch (errormessage) { /* error report and terminate */ … }
showSymbols(); // Print symbol table – useful reference for programmer debugging code
try {
pass2();
}
catch (errormessage) {/* error report and terminate */ … }
• Lots of different instruction formats and addressing modes
–
–
–
–
–
Branch instructions (8‐bit op‐code, 8‐bit offset)
“sob” (looping) instruction – a unique format applying only to this instruction
“register/address” instructions (e.g. MUL)
Single address instructions
Double address instructions
}
• Odd directives like that .string that had to re‐order bytes
43
44
Pseudo‐code : 2
Example input
• Pass‐1, processing code line by line
function pass1() {
stop = false;
phase = 1;
initializeInput();
getNextLine();
while (!stop) {
processElements1();
getNextLine();
}
}
loop: cmp r3,count
; while(r3<count) …
• Has:
– Label, “loop”
– Instruction cmp
• Need to check 2 address elements to determine whether following memory locations needed for address data
– cmp r3,r4
» That would be a single 2‐byte instruction
– cmp r3,count
» It’s going to need 4‐bytes, 2 for the instruction and 2 for the address of “count”
– cmp count,limit
» That would require 6‐bytes!
– Here it’s r3,count; so 4‐bytes
– Comment
45
46
Pseudo‐code : 3
Pseudo‐code : 4 : handling a label
function processElements1() {
function p1_process_label(anobj) {
var element = getNextElement();
badAddress(); // Trying to define a label – first check that have a current location defined!
// Add the label to the symbol table
var name = anobj.value;
var ndx = findSymbol(name); // Better not exist already!
Process the loop: label
if (ndx >= 0) { throw "Redefining label " + name; }
definition, putting an entry
var obj = new Object();
into symbol table.
obj.type = "label";
obj.name = name;
obj.value = currentaddress;
addSymbol(obj);
// Now process rest of line ‐
// meaningful things are some directives (e.g. .string), instructions, or nothing
// how about recursive call
processElements1();
if (element === null) return;
if (element.type === "comment") { return; }
else
if (element.type === "constant") { p1_process_constant(element); }
else
if (element.type === "directive") { p1_process_directive(element); }
else
if (element.type === "label") { p1_process_label(element); }
else
if (element.type === "number") { p1_process_comma_values(); }
else
if (element.type === "name") { p1_process_comma_values(); }
else
if (element.type === "instruction") { p1_process_instruction(element); }
}
}
loop: cmp r3,count ; while(r3<count) …
getNextElement() will return a “label” element consuming “loop:” from input
47
nabg
Recursion: call to getNextElement() in processElements1() will
find cmp – an element that is “instruction” type;
resulting in call to { p1_process_instruction(element); 48
8
Pseudo‐code : 5a : handling an instruction
Pseudo‐code : 5b
if (addrtype === "special") { // deal with instructions like rti, rts, trap, … that have special formats and which
// may require further analysis in pass 1
p1_handle_special(elem);
}
if (addrtype === "sob") {
// here treat as a simple two byte instruction, no processing
// in pass 2 will need to check register and offset
return;
}
// Remaining instructions must have address data
// Several possibilities like one‐address, register and address, two address
// consume all input that could be part of address specification
function p1_process_instruction(elem) {
badAddress();
var name = elem.value;
var ndx = findSymbol(name);
var instruction = getSymbol(ndx);
var addrtype = instruction.addrtype;
// Instructions are two or more bytes ‐ it is the "more" that is hard work
updateCurrentAddress(2);
if ((addrtype === "0addr") || (addrtype === "offset")) {
// Easy ‐ it was simply a two byte instruction, so have completed dealing with it
// Not checking any offset address part in this pass.
return;
}
loop: cmp r3,count ; while(r3<count) …
…
cmp will be recognized as a 2‐address instruction
var addrpart = grabAddressPart();
if (addrpart === "") { throw "Missing address part of instruction on line " + linenum; }
49
loop: cmp r3,count ; while(r3<count) …
The substring r3,count will be taken as the addrpart
Pseudo‐code : 5c
Pseudo‐code : 5d
loop: cmp r3,count
• Got to check those address parts in r3,count
to determine whether any extra words needed for address data
// And a safety check, input line should now be empty or a comment
var elem2 = getNextElement();
if (!emptyOrComment(elem2)) { throw "Extraneous data after instruction on line " + linenum; }
if (addrtype === "regaddr") p1_process_regaddr_instruction(addrpart);
else
if (addrtype === "addrreg") p1_process_addrreg_instruction(addrpart);
else
if (addrtype === "1addr") p1_process_1addr_instruction(addrpart);
else
function p1_process_2addr_instruction(addrpart) {
// Should have two address parts, separated by comma
var parts = addrpart.split(",");
// Process each individually
p1_process_1addr_instruction(parts[0]);
p1_process_1addr_instruction(parts[1]);
if (addrtype === "2addr") p1_process_2addr_instruction(addrpart);
else {
throw “Unrecognized address type " + addrtype;
}
}
}
loop: cmp r3,count ; while(r3<count) …
The substring ; while(r3<count) … will be recognized as a comment
51
Pseudo‐code : 5d :
Examining a single address part
52
Lots of special cases
• Series of tests for all possibilities allowed by assembly language syntax, including:
– Simple register, e.g. r0: no extra memory words
– Register indirect, e.g. (r1): no extra memory words
– Register indirect (!) @r1: no extra memory words
– Auto‐increment (r1)+: no extra memory words
– Immediate #msga: one extra memory word
– Indexed data(r0): one extra memory word
–…
53
nabg
50
• With all the different addressing modes to consider the code, for both Pass‐1 and Pass‐2, have to have lots of special cases to correctly determine locations and later generate code.
• Custom coding unavoidable.
54
9
getNextElement
• In contrast to rest of assembler with its highly customized code, the code for something like getNextElement() is relatively easy to standardize.
• What it has to do:
Lexical Analysis
getNextElement()
– Consume characters from input line, packing them into some other string variable
– Recognize what type of element they are from the characters read
Used when processing lines (in both Pass 1 and Pass 2)
Returns “label” | “instruction” | “comment” | “directive” …
• Possibly identified by 1st character
• Possibly identified by last character
• …
55
56
Recognising a comment
Recognising a directive
•
•
•
•
• Skip any leading white space characters
• Find “;” – know that dealing with comment
• Consume all characters until end‐of‐line
Skip any leading white space characters
Find “.” – know that possibly dealing with directive
Consume all letters until first non‐letter
Check that the substring built matches a name in a table of directives
“a‐z”
Whitespace
“not \n”
Whitespace
Start
Start
Handling comment
;
Return
“ERROR”
Any
Handling possible
directive
.
Not found
Whitespace
\n
Check in list of directives
Return
comment
Return
“directive”
Found
57
Recognizing a name
as label‐declaration | instruction | label
Recognizing an octal number
Whitespace
Error
0‐7
58
Whitespace
Start
0‐7
any
(building number)
Start
(building name)
a‐z
Whitespace
Return label‐
declaration
Listed as
instruction
Return “instruction”
59
Not present in table
Whitespace
:
Return number
nabg
Error
a‐z0‐9_
any
Lookup in symbol table
Listed as known label
Return name
60
10
Finite state machines (almost)
Finite state machine
• Those element recognizers are variants on a “finite state machine”
• Wikipedia says:
– “A finite‐state machine (FSM) or finite‐state automaton, or simply a state machine, is a mathematical model of computation used to design both computer programs and sequential logic circuits. It is conceived as an abstract machine that can be in one of a finite number of states. The machine is in only one state at a time; the state it is in at any given time is called the current state. It can change from one state to another when initiated by a triggering event or condition; this is called a transition. A particular FSM is defined by a list of its states, and the triggering condition for each transition.”
– A computing device that can be in one of a number of states which makes transitions between states based on inputs
– Initial state:
• Input is space – stay in same state
• Input is 0‐7 – transition to state where try to recognize an octal number
• Input is . – transition to state where try to recognize a directive
• Input is ; ‐ transition to state where try to recognize a comment
• Input is a‐z – transition to state where try to recognize label‐
declaration, or instruction, or label name
• Input is anything else – oops an error
They aren’t true finite state machines as some of the states involve complex processing – such as checking whether a name is in the symbol
table – to make decision on next state. A true finite‐state‐machine works entirely on inputs determining state changes.
61
62
Or, for the more mathematically inclined, …
Creating the getNextElement() function
• Typically, someone writing an assembler program would have sketched out the complete state chart and used this to guide coding
Not \n
Consuming comment
;
Return comment
\n
0‐7
Whitespace
Building a number
whitespace
Return number
0‐7
other
Start
.
a‐z
63
Creating the getNextElement() function
• A high‐level pseudo‐code outline could then be composed (using conditionals and “gotos” that would easily convert to “cmp” and “br” or “jmp” instructions in the assembly language implementation)
Start: ch = getNextCharacter();
if(isWhiteSpace(ch))goto Start;
if(ch in 0-7) goto Number;
if(ch is “;”) goto Comment;
if(ch is “.”) goto Directive;
if(ch in a-z) goto Name;
goto Error;
Comment: ch = getNextCharacter();
if(ch not “\n”) goto Comment;
return Comment-symbol;
Number: ch = getNextCharacter();
if(ch in 0-7) goto Number;
if(isWhiteSpace(ch)) return Number-symbol;
goto Error;
65
(Assembler programs were generally written in assembly language)
nabg
a‐z
Return error
other
Possible directive
Not found
whitespace
Label‐declaration | instuction | name
Check in list of directives
etc
64
Return directive
Auto‐generated lexical scanner?
• Code for such a “lexical scanner” is highly stereotyped.
• It is relatively easy to write a program that takes as input a set of definitions of the expected input elements (octal‐
number, comment, directive, …) and generates the scanner code that recognizes elements.
• Lexical scanner generators were one of the first automated programming systems created –
– We briefly look at one (flex – the Linux version of lex from Unix (~1975)) in the next section on compiled languages
66
11
Lexical Scanner in an assembler
• Because lexical scanners can have a nice mathematical basis –
a finite state machine – they often get a disproportionate amount of attention in “Computer SCIENCE” presentations on assemblers
• The real hard work is in the custom code that deals with bizarre instruction formats and instructions that take 2‐, 4‐, 6‐, 8‐ or more bytes
Generated binary code
67
68
Assemblers and
Loaders
Assembler outputs generated binary
• The assembler that has been added to Schmidt’s PDP‐11 simulator places the generated bit‐patterns directly into the integer array that the simulator uses to represent the memory of the computer.
• In a real system, a binary output “file” (well, in the old days, a binary paper tape) would be generated containing address information and bit patterns that would later be used by the loader program that places the program in memory
Assembly – pass 1 “Create Symbol Table”
Assembler program
Your assembly language program source tape
Start
Loop
Lft
0200
0210
0216
Symbol table generated and
saved in memory
This area of memory typically used for data
by current program
Real loader program – can be overwritten
Rim boot loader – hopefully never overwritten
69
Assemblers and
Loaders
Assemblers and
Loaders
Assembly – pass 2 “Generate code”
Load your program
010000110101010
001101011101010
010101001010101
011101010101010
Assembler program
Your assembly language program source tape (read for 2nd time)
Start
Loop
Lft
0200
0210
0216
70
Computer Memory
Binary tape with bit
patterns for the instructions
making up your program
Binary tape with bit
patterns for the instructions
making up your program
Bit patterns for your instructions
placed in memory
This area of memory typically used for data
by current program
Real loader program – can be overwritten
Real loader program – can be overwritten
Rim boot loader – hopefully never overwritten
Rim boot loader – hopefully never overwritten
71
nabg
72
12
No fixed address
• The .origin directive, as used in the code for our assembly language exercises, fixes the memory locations where the code and data elements will be located.
• But on a computer with a real (multi‐tasking) operating system, the location for the program isn’t determined until execution requested.
• So, the generated binary files cannot include actual addresses – and the bit patterns for some of the instructions must remain incomplete until load time.
Relocatable code and linking with libraries
73
That example program again
Loading the relocatable file
• Suppose the start address wasn’t known, what could an assembler fill in?
Absolute code
(Note how branch instructions that use offsets won’t require any more work at load time – an advantage of ‘position independent code’)
74
012701
??????
012702
??????
050000
112122 Relocatable code
001402
005200
000774
000000
020101
072163
…
msga=start+24;
insert at start+2
msgb=start+76;
insert at start+6
75
• If a file were created with that relocatable code represented in some manner,
a “loader” could
012701
– Pick a start address where code to go
• E.g. 2400
– Copy the binary data into that location
and successive memory words
– Process the relocation data:
• msga must have value 2424
– Insert 2424 in 2402
• msgb must have value 2476
– Insert 2476 in 2406
– Code could then be executed
Relocatable binary file
Changes to the assembler
• Files with generated “relocatable” code will have such a structure
– Segment with the code –
• 0‐valued bytes filling in places where addresses have to be filled in by the loader
– Segment with relocation data –
• ~names of labels (labels on instructions as in loops, branches, subroutine calls; labels (names) of static variables)
• Their values as offsets from a start address that only the loader will know
• A set of values identifying the places where the address must be written when known (usually, variables and subroutines are referenced at multiple places in the code – so it’s a set of places);
again these locations would be defined as offsets from the start.
77
nabg
000000 → 0002424
012702
000000 → 0002476
050000
112122 Relocatable code
001402
005200
000774
000000
020101
072163
…
msga=start+24;
insert at start+2
msgb=start+76;
insert at start+6
76
• Pass‐1
– It no longer has actual addresses to work with
– Now, its “location” variable is not an address, it is a byte offset from the start of the code.
• Pass‐2
– Label:
• Don’t discard (as did previously); instead put an entry in the relocation data that is being generated separately from the code (and which will be appended to the binary code that is written to a file)
– Record in relocation data: label‐name, value offset from start
– Any address (address modes such as indexed – data(r1); immediate addressing #msg)
• Don’t attempt to insert an address, leave a zero word
• Add an entry in relocation data
– Record in relocation data : label‐referenced, value offset from start
– At end of pass‐2, append all relocation data to binary data as written to file
78
13
That example program again
But what about using libraries
Pass‐2
Output code to file
• Building a relocatable code file
(No .origin directive)
Pass‐1
Construct symbol table working with offsets
start @ offset 0
l1 @ offset 12
done @ offset 22
msga @ offset 24
msgb @ offset 76
(values in branch instructions can be determined
from differences in offsets of current location and
referenced location)
Append necessary relocation data
012701
??????
012702
??????
050000
Relocatable code
112122
001402
005200
000774
000000
020101
072163
…
msga=start+24;
insert at start+2
msgb=start+76;
79
insert at start+6
• Assume that we have a slightly more sophisticated environment on our PDP‐11 (it’s got the OS that you are going to write as an Xmas holiday exercise)
– There is a minimal OS kernel that provides ‘supervisor calls’ for things like reading or writing a byte
– There are a bunch of standard functions written in assembler (which utilize the OS’s supervisor calls where necessary)
• Readline – called with address of program buffer where input to be placed (use read‐svc)
• Writeline – called with address of message to be printed (use write‐svc)
• Itoa – called with integer value and address of buffer where ascii
representation to go
• Atoi – called with address of buffer with ascii string, returns integer
• …
80
Your code using the libraries
Simple way
• The simplest way to use the library functions was to append the source of the libraries onto your program text!
• The assembler would see both your code and the library functions, and calls to library functions would be handled just the same as calls to subroutines defined in your code.
• Assembler could build either an absolute binary file (if permitted and origin directives set) or a relocatable file which would be sorted out by the loader
• So, you are writing your program using the libraries –
start: call writeline
msg1
call readline
buff1
call atoi
buff1
…
• How is the assembler to deal with those calls to the library functions?
– All subroutines from library would appear at offsets from the start of program, loader could determine their addresses and fill these address into code as needed.
• Obviously, not the most desirable approach.
– (It was used in the 1950s)
81
82
But if the source code is not appended …
Declaring “externals”
• In pass‐1, the assembler wouldn’t be troubled by the fact that “readline” and “writeline” were not defined, it would easily determine that “call writeline” would occupy two words – one for the instruction, one for the address of the writeline
subroutine – which was all it needed to do.
• Pass 2 would be troubled!
• Programmer had to identify the names of library functions so that assembler could process them correctly (rather than see them as errors)
• Assembly program source code would start with a set of “extern” declarations
– It looks up the name “writeline” and finds that this label has never been defined;
probably a programmer’s error, terminate.
extern
extern
extern
extern
readline
writeline
atoi
itoa
• Assembler would simply add these names to its symbol table as being labels for subroutines at unknown offsets
83
nabg
84
14
Resulting relocatable file –
will need to link with library binaries
The library code
• Your code would be assembled to give a relocatable file as before
• The relocatable files with the library code would have been created in much the same by assembling the library source.
• There is one small refinement needed –
– Extra entries in the relocation table
•
•
•
•
– The library code provides those subroutines like readline, and atoi –
msga – start+40, used at start+2,
count – start+70, used at start+6, start+22, start+30
writeline – unknown, used at start+10
…
• these names will appear as labels in the source of the library code
• These names are referenced as externals in your code and need to be handled by the linker
– Of course the library source code will have other labels –
• You must then “link” this relocatable file with relocatable files that contain the library code in order to get a relocatable executable with all the code that the load can then place in memory.
• labels on instructions referenced in branches or jumps, • entry points to private auxiliary subroutines, • names for static variables and constants belonging to the library
– These will need to be in the relocation data for the library code
– But we don’t want these (private implementation only) names getting confused with names of things in your program!
• Everybody (including you and the author of the library code) uses loop:, l1:, l2:, done:, temp:, …
85
More elaborate relocation data
86
Using libraries : the “code”
• Your code
• The information stored in the “relocation data” section of a binary file generated for a library (and actually for your code as well) needs to distinguish which names are “local” to that file and which are “global” and are to be used when linking multiple files
– Names that are “local” to files can appear in multiple files – the linker will not get them confused
– Names that are “global” must be unique (so don’t use the name of a subroutine in the library for one of your subroutines)
• Assembly language systems had various ways of allowing a programmer to identify which names were to be “global”
– One approach – declare them as “entry” points
extern writeline
extern readline
…
start: call init
…
loop: call writeline
msg1
…
…
init:mov count,r0
…
…
msg1: .string “…”
count: 400
…
• Library file
entry writeline
entry readline
…
writeline: mov (sp),r0
…
loop: cmp (r0),endline
…
…
…
endline: .string “\n”
…
87
Using libraries : the files with relocatable binary
Yours: Library: Code
004767
000000
…
…
Relocation data
Code
012600
…
…
Relocation data
local1 (a.k.a. “count”) @offset 40
used @offset 10, @offset 16
local2 (a.k.a. “msga”) @offset 66
used @offset 2
writeline @unknown-extern
used @offset 2
…
Linker builds a “relocatable executable”
• The linker will build a new file by appending the code sections, yours and that of each library you use, and constructs a new relocation table
local1 (a.k.a. “loop”) @offset 160
used @offset 10, @offset 16
local2 (a.k.a. “endline”) @offset 364
used @offset 2
writeline entry@offset 0
not referenced in this code
readline entry@offset 204
not referenced in this code
– Still need a relocation table, because we still don’t know the “start” address where the program is going to be loaded
– The linker works out new offsets relocation symbols for all appended files
• Say your code and data take up 640 bytes
• Library with writeline placed immediately after your code – so offset for writeline is now 640 rather than 0
…
89
nabg
88
90
15
Loader again
Relocatable executable
Code
004767
000000
…
…
012600
….
Relocation data
local1 (a.k.a. “count”) @offset 40
used @offset 10, @offset 16
local2 (a.k.a. “msga”) @offset 66
used @offset 2
writeline @offset 640
used @offset 2
Yours
I/O library
local1b (a.k.a.
used @offset
local2b(a.k.a.
used @offset
…
“loop”) @offset 1020
650, @offset 656
“endline”) @offset 1224
642
• Loader can work as before with combined “relocatable executable”
– Takes actual start address and uses relocation data entries to resolve final address – Insert these finalized addresses into actual code where needed
Both are included in the one binary file.
91
92
Mixing code and data?
• The approach described so far would have been used back in the 1970s
• One consequence –
– Code and data are intermixed
One more refinement
• Consider your assembly language code
–
–
–
–
You had your code
You might have some memory locations for integer constants
You would have had fixed message strings
You would have had memory locations for variables such as counts
• The library would be similar
– Code, then maybe space allocated for counts or for buffer arrays
• When simply appended together to form a “relocatable executable” you end up with
– Code, data, code, data, …
93
Bad idea!
More elaborate relocatable files
• Programmers have learnt by bitter experience that code and data should be kept separately
– Then if you do overrun an array, you don’t store data over code – data that might later be interpreted as an instruction sequence!
• As explained in the earlier section on operating systems, now expect to have different “segments”
– “text” segment for the code
– “static” segment for the kinds of data elements that we’ve been considering
– (and, of course, stack and heap segments)
95
nabg
94
• Relocatable binary files for code and libraries and relocatable executable files now have more complex structures with separate sections for code and data
• Assembler creates these when writing the file
– Code – output to code section along with relocation table for “entries” and local labels for loops etc
– Data – all the variables, strings etc kept aside and then written to a “data” segment in the relocatable file – followed by the relocation data for these elements
• Linker combines files when creating relocatable executable
– First combines all the code to get code section and its relocation data
– Then combines all the data to get static data section and its relocation data
96
16
ELF
ELF
• ELF ‐ Executable and Linkable Format
– Structure of relocatable code files and relocatable executables is pretty much standardised for Unix/Linux environments: Linux, Solaris, NetBSD, MINIX, ..., also Android and others
– .o files from C/C++ compilations
.a files with library code for static linking
and linked executables
all use “ELF” files
• The structure of an ELF file is a more sophisticated variant of the structure just suggested
– http://www.skyfree.org/linux/references/ELF_Format.pdf
– http://wiki.osdev.org/ELF
– https://refspecs.linuxbase.org/elf/gabi4+/ch4.intro.html
–
– “code” and relocation information
– “static data” and associated relocation information
97
nabg
• You may learn a little more about the structure and use of ELF files in CSCI212
• Lots of references on the web:
(I’ve never needed that detailed a view) 98
17