Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
UNIX 내부 구조
(LINUX Kernel을 중심으로)
Contents
Part I. UNIX Operating System
1. Introduction
2. Process Management
3. Memory Management
4. File System
5. Synchronization & IPC
6. I/O System (Device Driver)
Part II. Detailed Study: LINUX Kernel Internals
1. Where is everything?
System call Implementation
Device Driver using Module Programming
2. Linux internals
2
References
U. Vahalia, “Unix Internals, The New Frontiers”, Prentice Hall, 1996.
H. M. Deitel, “Operating Systems”, 2nd edition, Addison-Wesley, 1990
Silberschatz and Galvin, “Operating System Concepts (5th edition)”, AddisonWesley, 1998
Mukesh Singhal and Niranjan G. Shivaratri, “Advanced Concepts in Operating
Systems”, McGraw-Hill, 1994.
Maurice J. Bach, “The Design of the UNIX Operating System”, Prentice Hall,
1986.
M. Beck, etc, “Linux Kernel Internals, 2nd Ed”, Addison-Wesley, 1997
Marshall K. McKusick, K. Bostic, M. Karels and J. Quarterman, “The Design
and Implementation of the 4.4 BSD Operating System”, Addison-Weseley Pub.
Co., 1996.
Benry Goodheart and James Cox, “The Magic Garden Explained”, Prentice Hall,
1994.
3
I. Introduction
What is UNIX Operating System?
Brief History
Kernel Architecture
Features of UNIX Operating System
4
What is UNIX Operating System?
X window
csh
vi
du
who
kernel
Network
Admin.
Package
wc
telnet
Hardware
ps
grep
a.out
sort
gcc
ls
What’s the similarity between Onion and UNIX?
5
RDBMS
What is UNIX Operating System? (Cont`)
User Programs
User Programs
Trap
User level
Libraries
Kernel level
System Call Interface
File System Management
Process
Management
Buffer Cache
IPC
Context
Device Drivers
Memory Management
Hardware Control (Interrupts handling, etc)
HW level
Hardware
(Source : The design of the UNIX OS)
6
What is UNIX Operating System? (Cont`)
UNIX Operating System is a Resource Manager
Physical Resource
CPU, Memory, Disk, Network…
Abstract Resource
process, thread, page, file, inode, message, security, …
UNIX Operating System is the Computing Environments
provide resources’ service to users
system call, API
abstraction is just a set of data structure in kernel level
7
Brief History
Before UNIX
Multics: 1965, AT&T (Bell Lab), General Electronic, MIT
Epoch
1969, Ken Thompson, “Space Travel” on PDP-7
Dennis Ritche
s5fs, ed, shell (Bourn shell의 조상)
1973년 “The UNIX Time Sharing System” in CACM
BSD
Billy Joy, Chuch Haley (대학원생)
ex, csh, paging based virtual memory system, TCP/IP, ffs, socket
1993년 4.4BSD (final version, 이후 BSDI 회사 )
AT&T System V
Version 1,2,…,7, System III, System V, … SVR4.2/ESMP
region based virtual memory, IPC, remote file sharing, STREAM,
8
Brief History (Cont`)
Commercial UNIX
XENIX (MS, SCO), SCO UNIX (SCO), AIX (IBM, Journaling FS),
HP-UX (HP), ULTRIX (DEC, 최초의 MP), OSF/1 (Digital), ….
SunOS (Sun Microsystems, VFS, NFS), Solaris, Unixware
(Novell)
Mach
최초의 micro-kernel
chorus, Exo-kernel, SPIN, L4, ….
http://ssrnet.snu.ac.kr/~choijm/current_os.html
standard
SVID(System V Interface Definition), POSIX (IEEE), X/OPEN (Inc.)
UI (SUN, AT&T : Solaris), OSF (OSF/1)
Linux
Performance oriented
Philosophy of COPYLEFT
9
Kernel Architecture
Monolithic Kernel
traditional UNIX, SVR4, Solaris, Linux, ….
process
process
process
System Call
OS Functionality
Integrated Kernel
OS Personality
Hardware
10
Kernel Architecture (Cont`)
Monolithic Kernel
process
process
read()
fork()
System Call
sys_read()
sys_fork()
File System
bread()
Buffer Cache
Process Management
copy_mm()
OS Personality
hd_request()
Disk Device Driver
Memory Manager
copy_thread()
do_hd_io()
Hardware
11
CPU
Kernel Architecture (Cont`)
Micro-Kernel
Mach, Chorus, L3/L4, SPIN, QNX, Window-NT …
process
Server
Server
System Call
Microkernel
Hardware
12
Server
OS Functionality
Kernel Architecture (Cont`)
Micro-Kernel
process
read()
File System
Server
Process
Server
System Call
sys_read()
hd_request()
Microkernel
Hardware
what is the advantage of micro kernel ?
13
….
Windows-NT Architecture
Windows-NT
Applications
OS/2
Client
Logon
Process
NT Executive
POSIX
Server
Win32
Server
Security
Server
Object
Manager
POSIX
Client
OS/2
Server
Message
Protected
Subsystem
(Servers)
Win32
Client
Security
Ref. Monitor
Trap
User mode
System Services
I/O Manager
Kernel mode
Process
Manager
File System
Cache Manager
Device Drivers
LPC
Facility
VM
Mgt.
Network
Drivers
Kernel
Hardware Abstraction Layer(HAL)
HW Control
Hardware
(Source : Inside Windows NT)
14
Features
What is Good about UNIX
Open system
free
Small is beautiful philosophy
file: just stream of bytes
Simple and Coherent
data, device, pipe, socket, memory, process, … can be treated as a single
abstraction (file)
Portability
high-level language
new paradigm: OO, client-server model, clustering, PDA, MM Server
True Parallelism
Multitasking (Time Sharing), Multiprogramming, Multiprocessor, MPP
15
Features (Cont`)
What is Wrong with UNIX
Too many variant
dumping ground
Not small and simple any more
uncontrolled growth
Building-block approach
inappropriate for beginner
Lack of GUI
not now
Ritche’s words, “It takes a genius to understand and appreciate the
UNIX’s simplicity”
16
II. Process Management
17
Overview
What is process?
process state transition
context
scheduling
kernel entry point
interrupt, trap, system call
signal
18
What is Process?
Definition
an instance of a running program (runnable program)
an execution environment of a program
scheduling entity
a control flow and address space
PCB (Process Control Block) : proc. table and U area
Manipulation of Process
create, destroy
context
state transition
dispatch (context switch)
sleep, wakeup
swap
19
Process State Transition
user
running
syscall,
interrupt
fork
initial
(idle)
return from
syscall or
interrupt
kernel
running
fork
swtch
zombie
exit
wait
sleep, lock
swtch
ready
to run
wakeup, unlock
swap
asleep
swap
suspended
ready
suspended
asleep
(Source : UNIX Internals)
20
Process State Transition (Cont`)
Flow of execution : execution mode (cf: address space)
Kernel execution
process A execution
Kernel execution
process B creation
Interrupt or Trap
cause change of
execution modes
process C execution
Kernel execution
process B execution
Kernel execution
(Source : Magic Garden)
21
Context
context : system context, address (memory) context, H/W context
memory
proc table
file table
segment table
page table
fd
Registers (TSS)
eip
sp
eflags
eax
swap
cs
disk
….
U area
….
22
Context : system context
System context
proc. Table
identification: pid, process group id, …
family relation
state
sleep channel: sleep queue
scheduling information : p_cpu, p_pri, p_nice, ..
signal handling information
address (memory) information
U area
stores hardware context when the process is not running currently
UID, GID
arguments, return values, and error status for system call
signal catch function
file descriptor
usage statistics
May it be different according to the version and variant of UNIX
23
Context : address context
fork example
int
char
glob = 6;
buf[] = “a write to stdout\n”;
int main(void)
{
int var;
pid_t pid;
var = 88;
write(STDOUT_FILENO, buf, sizeof(buf)-1);
printf(“before fork\n”);
if ((pid = fork()) == 0) {
glob++; var++;
} else
sleep(2);
/* child */
/* parent */
printf(“pid = %d, glob = %d, var = %d\n”, getpid(), glob, var);
exit (0);
}
(Source : Adv. programming in the UNIX Env., pgm 8.1)
guess what can we get from this program?
24
Context : address context (Cont`)
fork internal : compile results
gcc
test.c
header
text
0xffffffff
0xbfffffff
…
movl %eax, [glob]
addl %eax, 1
movl [glob], %eax
...
glob, buf
data
kernel
bss
stack
var, pid
stack
0x0
data
text
a.out : ELF format
Executable and Linking Format
user’s perspective (virtual address)
25
Context : address context (Cont`)
fork internal : before fork (after run a.out)
memory
proc T.
segment T.
text
var, pid
pid = 11
stack
glob, buf
data
cf) we assume that there is no paging mechanism in this figure.
26
Context : address context (Cont`)
fork internal : after fork
proc T.
memory glob, buf
segment T.
data
pid = 11
text
var, pid
stack
proc T.
segment T.
glob, buf
pid = 12
data
stack
address space : basic protection barrier
27
var, pid
Context : address context (Cont`)
fork internal : with COW (Copy on Write) mechanism
after “glob++” operation
after fork with COW
memory
proc T.
segment T.
pid = 11
proc T.
text
segment T.
pid = 11
text
stack
proc T.
data
stack
segment T.
proc T.
pid = 12
segment T.
pid = 12
data
data
28
Context : address context (Cont`)
execve internal
memory
proc T.
segment T.
data
pid = 11
text
a.out
stack
text
data
stack
29
header
text
data
bss
stack
Context : hardware context
time sharing (multitasking)
Where am I ??
time quantum
process 1
…
process 2
process 3
30
Context : hardware context (Cont`)
brief reminds the 80x86 architecture
ALU
Control Unit
IN
OUT
Registers
• eip, eflags
• eax, ebx, ecx, edx, esi, edi, …
• cs, ds, ss, es, ...
• cr0, cr1, cr2, cr3, GDTR, TR, ...
31
Context : hardware context (Cont`)
context swtch
save
context
Proc T.
TSS
eip
sp
eflags
eax
CPU
Proc T.
cs
U area
restore
context
TSS
eip
sp
eflags
eax
cs
U area
32
Context : hardware context (Cont`)
context swtch : pseudo-code in UNIX
…
/* need context swtch */
if (save_context())
{
/* pick another process to run from ready queue */
….
restore_context(new process)
/* The control does not arrive here, NEVER !!! */
}
/* resuming process executes from here !!! */
…...
(Source : The Design of the UNIX OS)
trick : register (eg, eax in 80*86 CPU)
Think about the difference between context switch and system call.
33
Process Scheduling
Process scheduling
allocate CPU resource among the competing processes
criteria : fairness, efficiency (response time vs. throughput)
types of processes
Interactive
Batch (Computation-Intensive)
Real-time
video,hospital
types of scheduling
Preemptive scheduling
other processes can take CPU away from the current running process
Non preemptive scheduling(Windows98)
other processes can not take CPU away from the current running process
34
스케줄링 기준
중앙처리장치 이용률(utilization)
처리율(throughput)
완료프로세스/시간
반환 (turnaround) 시간
프로세스 시작->끝
대기(waiting)시간
준비 큐에서 보낸 시간의 합
응답(response)시간
작업제출 후 응답이 시작될 때까지 걸리는 시간
35
Process Scheduling (Cont`)
Existing Policies
FCFS (First Come First Served)
RR (Round-Robin)
SJF (Shortest Job First)
Multilevel Feedback Queue
EDF (Earliest Deadline First)
RM (Rate Monotonic)
Fair Queuing
Gang Scheduling
Causality Scheduling
Process migration
36
은행
time quantum(10-100milisec)
여러 개의 큐
Process Scheduling (Cont`)
UNIX : Round Robin with multilevel Feedback Queue
Round-Robin
Ready Queue
P3
P2
P1
37
CPU
Process Scheduling (Cont`)
Multilevel Feedback Queue
Ready Queue 1
P8
P7
P6
CPU
P4
CPU
Ready Queue 2
P5
•higher priority
•less time quantum
…….
Ready Queue n
P3
P2
CPU
P1
38
Process Scheduling (Cont`)
Round-Robin : real implementation
scheduling information in proc. table : p_pri, p_cpu, p_nice
every clock tick : increments p_cpu for current running process
every second : p_cpu = p_cpu * decay factor (generally 1/2)
p_pri = PUSER + p_cpu/2 + p_nice
Example of System III
3 process, PUSER=50, p_nice = 0, clock ticks 60 at every second
P1
P2
P3
p_pri p_cpu
p_pri p_cpu
p_pri p_cpu
second
0
50
0
50
0
50
0
1
65
30
50
0
50
0
2
57
15
65
30
50
0
3
53
7
57
15
65
30
4
66
33
53
7
57
15
39
Process Scheduling (Cont`)
Example of BSD
decay factor : (2*load_average) / (2*load_average + 1)
p_pri = PUSER + (p_cpu/4) + (2*p_nice)
clock tick is 10msec
time quantum is 10 clock ticks
Example of Mach
decay factor : 5/8
p_usrpri = PUSER + (3.8*(max(1,M/P) ) * p_cpu )/T + 0.5 * p_nice
Example of SVR4
support REAL-TIME class process
class independent scheduler / class dependent scheduler
Example of LINUX
support REAL-TIME process
select a process that has the highest value of “priority + counter”
“counter” of the current process decreases at each clock tick.
40
Process Scheduling (Cont`)
Range of Process Priorities
Kernel Mode
Priority
Swapper
P
Waiting for Disk I/O
P
P
P
Waiting for Buffer
Waiting for Inode
P
Waiting for TTY IO
User Mode
Priority
Waiting for Child Exit
P
P
User Level 0 (50)
P
P
User Level 1
P
……
P
User Level n
41
P
P
(Source : The Design of the UNIX OS)
Kernel Entry Point
Interrupt
Trap
system call
device
kernel
process
MM
HWM
PM
FS
42
DD
Interrupt Handling
Interrupt
a mechanism that peripheral devices inform an asynchronous event to
UNIX Operating System
Real time Clock
Kernel
CPU
IVT
disk
PIC
tty
network
interrupt handlers
0
clock()
clock()
1
nmi()
disk_intr()
2
tty_intr()
3
disk_intr()
4
net_intr()
cdrom
….
what’s the difference between polling and interrupt?
43
합격자 발표
Interrupt Handling (Cont`)
interrupt handling mechanism
similar to the step of receiving a letter while telephoning
step
if user mode, change kernel mode
save context of current process (make new context layer)
determine interrupt source
find interrupt vector and call interrupt handler
…. interrupt handling…..
restore saved context
what if another interrupt is triggered while handling a interrupt?
44
Interrupt Handling (Cont`)
clock interrupt handler ( timer_interrupt() in Linux )
clock()
{
restart clock
/* will interrupt again */
if (callout table not empty) (eg) timer_list in LINUX)
adjust time and schedule callout function if necessary
if (profiling on)
count program counter at time of interrupt
gather statistics per process and system
update CPU usage for the current running process
if (one second elapsed) {
alarm handling
calculate the p_pri for all process
reschedule if necessary
wake up swapper or page daemon if necessary
}
}
(Source : The Design of the UNIX OS)
45
Trap Handling
trap : an asynchronous software event
IVT
0
div_by_zero()
1
invalid_opcode()
2
overflow()
3
segment_fault ()
4
page_fault ()
….
20
clock()
21
nmi()
22
tty_intr()
23
disk_intr()
24
net_intr()
….
80
system_call()
….
46
System Call Handling
system call : an example of trap
Kernel
trap
sys_call_table (sysent[])
IVT
0
div_by_zero()
0
sys_no_syscall()
1
invalid_opcode()
1
sys_exit()
2
overflow()
2
sys_fork()
3
segment_fault ()
3
sys_read ()
4
page_fault ()
4
system_call()
sys_write ()
….
80
system_call()
47
….
….
sys_getpid()
….
255
sys_no_syscall()
47
sys_fork()
sys_read()
System Call Handling (Cont`)
invoke system call
Kernel
process
main()
{
….
fork()
}
libc.a
….
fork()
{
….
movl $2, eax
trap $80
….
}
….
read()
{
…
}
IVT
sys_call_table (sysent[])
0 div_by_zero()
0 sys_no_sys()
1 in_opcode()
1 sys_exit()
2 overflow()
2 sys_fork()
sys_fork()
3 seg_fault ()
4 page_fault ()
3 sys_read ()
4 sys_write ()
sys_read()
….
….
80 system_call()
….
47 sys_getpid()
….
255 sys_no_sys()
48
System Call Handling (Cont`)
how to make a new system call
coding new system call function in kernel space
allocate syscall_number (and an empty slot in sys_call_table[])
and registering
kernel rebuild
reconfigure library
ar, ranlib
coding your program with new system call
49
Signal
a mechanism to inform an asynchronous event to process
types of signal : SIGKILL, SIGINT, SIGBUS, SIGUSR1, ….
action : abort, exit, ignore, stop, user level catch function
void sig_handler(signo)
int signo;
{
signal (SIGUSR1, sig_handler);
printf(“received signal %d\n”, signo);
…..
}
/* reinstall */
/* handle the signal */
main ()
{
signal (SIGUSR1, sig_handler);
….
for ( ; ; )
pause();
/* install the handler */
}
what’s the difference among interrupt, trap, and signal?
50
Signal (Cont`)
register signal handler (signal catch function )
send signal
signal detection : state transition from kernel running to user running
call signal handler
variables for signal in task structure in LINUX
int sigpending : is signal received or not?
struct signal_struct *sig
sigset_t signal, blocked
typedef struct {
unsigned long sig[_NSIG_WORDS];
} sigset_t; /* asm-i386/signal.h */
struct sigaction /* asm-i386/signal.h */
struct signal_struct /* sched.h */
count
action[_NSIG]
siglock
51
sa_handler
sa_flags
sa_restorer
sa_mask
III. Memory Management
52
Memory Hierarchy
hierarchy
register
CPU cache
Main Memory
• larger capacity
• lower speed
• lower cost
Secondary Storage
Server (or INTERNET)
caching is more and more important (how to keep consistency?)
53
Memory Management Strategy
Three strategies
Fetch strategy: when a process (page) is brought into memory?
demand fetch
prefetch (agent in Web)
Placement strategy: where a process (page) is put on memory?
first fit, best fit, worst fit
replacement strategy: which process (page) is evicted from memory?
LRU, LFU, MRU, …
54
History of Memory Management System
single user system (stone age of memory management)
overlay
fixed partition multiprogramming system
absolute assembler, relocating assembler
variable partition multiprogramming system
coalescing , compaction
virtual memory system
paging
segmentation (segment, region, vm_object)
paging/segmentation
55
중첩(Overlay)
할당된 기억장치보다 큰 프로세스를 위해
예) 2-pass 어셈블러
심볼테이블(20K)
공통루틴(30K)
중첩드라이버(10K)
pass 1 (70K)
pass2 (80K)
56
History (Cont`)
variable partition multiprogramming system
Scenario
• fork P1 (40K)
• fork P2 (20K)
• fork P3 (10K)
• fork P4 (20K)
• fork P5 (40K)
• fork P6 (20K)
• fork P7 (70K)
• exit P1
• exit P3
• exit P4
• exit P6
memory and kernel internals
0
kernel
free memory map
100
P1
140
P2
P3
P4
P5
160
170
190
230
100
140
40
160
190
30
230
250
20
320
400
80
P6
250
P7
320
400
57
Memory Management Strategy : Placement
memory and kernel internals
0
kernel
free memory map
100
P1
140
160
170
190
P2
P3
P4
P5
100
140
40
160
190
30
230
250
20
320
400
80
230
250
P6
P7
320
Where to go??
400
58
Scenario
• fork P1 (40K)
• fork P2 (20K)
• fork P3 (10K)
• fork P4 (20K)
• fork P5 (40K)
• fork P6 (20K)
• fork P7 (70K)
• exit P1
• exit P3
• exit P4
• exit P6
• fork P8 (25K)
Memory Management Strategy : Placement
memory
0
kernel
kernel internals
100
free memory map
P1
140
160
170
190
P2
P3
P4
P5
first fit
best fit
Scenario
100
140
40
fork P8 (25K)
160
190
30
230
250
20
320
400
80
230
250
P6
P7
worst fit
320
400
issue : fragmentation
employed at swap management, KMA (kernel memory allocator)
59
Virtual Memory
virtual memory : separate virtual address and physical address
virtual address
kernel stack
0xffffffff
kernel
kernel bss
stack
kernel data
kernel text
bss
data
text
page
0x0
60
Virtual Memory (Cont`)
virtual address : Linux case
0xffffffff
kernel
0xc0000000 env_end
arg_end
arg_start
start_stack
stack
shared memory
bss
data
text
bss
end_bss
end_data
data
text
end_code
start_code
brk
shared C library
bss
end_data
data
end_code
0x0
other shared library
program
text
start_code
(Source : Linux Internals)
61
Virtual Memory (Cont`)
physical memory
consists of kernel and a set of processes
physical memory
0x4ffffff
P4
P3
P2
P1
kernel
0x0
62
Virtual Memory (Cont`)
physical memory
a collection of page frame (4K or 8K)
physical memory
P1
page frame n
page frame n-1
….
P2
page frame 5
page frame 4
page frame 3
page frame 2
page frame 1
63
P3
Virtual Memory (Cont`)
address translation
segment table
origin register
virtual address
v = (s, p, d)
offset
segment page
number
number
p
d
s
b
+
s'
+
segment table
p'
page frame
number
p'
page table
64
offset
d
physical address
Virtual Memory (Cont`)
address translation : table structure
V segment start address (s’) L R W E A
segment table
V page frame number (p’)
D R U W COW
page table
cf) disk block descriptor per each page table entry
swap (fs) number block number type (fill 0, demand fill)
65
Virtual Memory (Cont`)
execve (final)
memory
nK
n-1 K
proc T.
segment T.
1
1
0
1
0
0
0
0
0
0
1
0
4K
28 K
20 K
12 K
32 K
28 K
24 K
20 K
16 K
12 K
8K
4K
0K
page T.
66
T2
a.out
0K
text
D1
12 K
S1
T1
header
48 K
data
stack
Virtual Memory (Cont`)
anonymous
pages of segment
SVR 4.0 virtual memory structure
struct proc
p_as
struct as
seg_list
hint
struct hat
struct seg
as_ptr
private
s_ops
base
size
struct
segvn_data
private
data
as_ptr
private
s_ops
base
size
anon_map
vnode
resident
pages of file
virtual address space
as_ptr
private
s_ops
base
size
text
data
stack
as_ptr
private
s_ops
base
size
u area
67
Virtual Memory (Cont`)
BSD (Mach) virtual memory structure
struct task
vm_map
struct vm_map
first hint last
struct vm_map_entry
struct vm_object
struct vm_page
resident
page list
68
struct pmap
Virtual Memory (Cont`)
Linux virtual memory structure
task_struct
mm
mm_struct
count
pgd
mmap
vm_area_struct
vm_end
vm_start
vm_flag
vm_inode
vm_end
vm_area_struct
vm_end
vm_start
vm_flag
vm_inode
vm_end
69
Data
Code
Virtual Memory (Cont`)
advantage of virtual memory
large address space
no need of placement strategy
flexible memory object sharing among the processes
P1
segment T.
1
1
0
1
0
4K
28 K
memory
20 K
page T.
P2
segment T.
1
1
0
1
8K
28 K
40 K
page T.
no free lunch : disadvantage of virtual memory
address translation
70
Virtual Memory (Cont`)
address translation with TLB (Translation Lookahead Buffer)
segment table
origin register
virtual address
v = (s, p, d)
offset
segment page
number
number
p
d
s
b
+
s p
p'
s'
TLB (associative memory)
+
segment table
p'
page frame
number
p'
page table
71
offset
d
physical address
Virtual Memory (Cont`)
HAT (Hardware Address Translation)
isolate all hardware dependent code
HAT in SVR4, pmap in BSD, pgd in Linux, ...
responsible all address translation transparently
case study : 80*86 CPU
segment descriptor
table (GDT, LDT)
virtual address
16bit
segment descriptor
32bit
offset
segment translation
32bit
linear address
72
cf) 80*86 reminds
GDT - available for all tasks
- segment for OS code data
- descriptor for LDT, TSS
LDT - for a specific task
IDT - interrupt service routine
Virtual Memory (Cont`)
HAT (Hardware Address Translation):Paging
case study : 80*86 CPU
31
linear address
22 21
12 11
0
DIR PAGE offset
31
11 0
31
11 0
31
PFN
11
PFN
PFN
0
offset
physical address
page directory
page table
CR3
control register:
Page Directory Base Register
31
11
page table entry
PFN
73
6 5
2 1 0
DR
UWP
•D: Dirty
•R: referenced
•U:User/Supervisor
•W:Read/Write
•P:Present(valid)
Replacement Strategy
Which page can be evicted from memory ?
memory
replacement policy
p2
p4
p1
p3
p7
page fault for p8
p8
disk
goal : reduce the number of page fault and thrashing
74
Replacement Strategy (Cont`)
basic principle of replacement : locality
temporal locality : stack, tree traverse, counting variable
spatial locality : array, sequential code, file reference
replacement policy
FIFO (First In First Out)
LRU (Least Recently Used)
LFU (Least Frequently Used)
NUR (Not Used Recently)
MRU (Most Recently Used)
Working Set
Second Chance(FIFO+reference bit)
75
Replacement Strategy (Cont`)
example : FIFO, LRU, LFU
scenario : page reference order
system internals
p1, p2, p3, p1, p4, p2, p1, p3, p4, p7, p8
memory
p2
p4
p1
p3
p7
disk
p8
guess which page will be evicted from memory under the LRU policy?
which policy is the best policy?
76
Replacement Strategy (Cont`)
Project I : program a simulator for FIFO, LRU, and LFU policy and
compare their performance.
assume
- memory consists of 20 page frames
- a range of page number is 0 ~ 49
- number of references is 300
program the 3 policies - use linked list for FIFO and LRU
- use priority tree for LFU if possible
- use hash to fast find a page
compare the performance and discuss it
77
Replacement Strategy (Cont`)
Example of real implementation in UNIX : buffer cache
head
lru list header
hash queue header
tail
(page_no % 5 ) = 0
10
45
(page_no % 5 ) = 1
21
26
(page_no % 5 ) = 2
2
(page_no % 5 ) = 3
33
28
(page_no % 5 ) = 4
24
19
30
3
43
(Source : The Design of the UNIX OS)
78
Replacement Strategy (Cont`)
example : NUR
used by pagedaemon (two-handed clock algorithm)
V page frame number (p’)
possible
combination
D R U W COW
0
0
1
1
0
1
1
0
79
replace page having (0,0)
combination first
Swapper vs. PageDaemon
swapping and paging
replace some object from memory when memory is almost full.
swapping
object : process
swap in/ swap out
swap space management
similar to variable partition multiprogramming
paging
object : page
page fault handling
80
IV. File System
81
Overview of File System
process 1
….
process 2
process n
User mode
System mode
Virtual File System
ffs
nfs
ext2fs
ntfs
buffer cache
….
mmfs
procfs
File System
device driver
82
User Interface
System call
open
read/write
close
dup
link
pipe, mkfifo
mkdir, readdir
mknod
stat
mount
sync, fsck
83
User Interface (Cont`)
file descriptor, file table, inode (vnode)
proc table
fd
segment table
file table
vnode
inode
TSS
U area
84
User Interface (Cont`)
fork vs open
fork
proc table
open same file
fd
vnode
proc table
file table
vnode
file table
parent
proc table
fd
parent
fd
file table
child
how about dup?
85
Disk system
physical view
plotter, arm, head
cylinder, track, sector
seek time, rotational latency, transmission time
logical view (a viewpoint of UNIX)
disk is a collection of disk blocks
the disk block size is usually equal to the page frame size
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
….
86
Structure of File
disk block allocation
want to create a file with size of 14 K
assume - disk block size is 4 K.
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
sequential allocation
non sequential allocation
block chain, indexed block, FAT
87
..
Structure of File (Cont`)
non sequential allocation
block chain
new file name
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
88
..
Structure of File (Cont`)
non sequential allocation
index block
new file name
…...
index block
what if the index block is full ?
89
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
..
Structure of File (Cont`)
non sequential allocation
FAT (File Allocation Table)
FAT
new file name
4
5
NIL
12
11
6
9
21
34
NIL
UN
NIL
7
UN
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
..
what is the adv. and disadv. among block chain, index block, and FAT ?
90
Structure of File (Cont`)
sequential allocation
new file name
start
size
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
..
what is the adv. and disadv. between sequential and non sequential allocation ?
91
Structure of File (Cont`)
inode in Unix File System
inode
type (4bit) u g s r w x r w x r w x
i_inode_number
i_mode
i_nlink
i_uid, gid
i_rdev
i_atime, ctime, mtime
S_IFSOCK
S_IFLNK
S_IFREG
S_IFBLK
S_IFDIR
S_IFCHR
S_IFIFO
direct
….
indirect
92
Structure of File (Cont`)
inode in Unix File System: find block
assume the size of disk block is 4K
which block is related if f_offset is 10000 ? (or 47000 )
file table
inode
4
7
12
18
f_offset
24
direct
33
….
41
indirect
165
93
169
Structure of Directory
connect file name to disk block(s)
directory entry in UNIX FS
inode number
file name
directory entry in DOS
file name extension attributes time first block number
provide hierarchical structure for file system
inode 1 disk block 1
inode 3
i_mode
time
….
1
i_mode
time
….
7
1
1
3
4
5
6
7
9
..
.
usr
dev
etc
vmunix
var
mnt
disk block 7
1 ..
3 .
12 src
16 include
17 lib
20 bin
23 member
25 local
94
inode 23 disk block 39
i_mode
time
….
39
3 ..
23 .
32 jim
33 tom
37 mark
41 sooni
42 mjc
Structure of Directory (Cont`)
hierarchical view
/
usr
src
dev
include
etc
lib
jim
var
bin
member
tom
mark
95
mnt
vmunix
local
sooni
mjc
Structure of Directory (Cont`)
open example
open(“/usr/member/sooni/test.c”, O_RD)
find inode using directory structure (namei())
allocate fd, file table and initialize
proc table
fd
file table
inode
f_offset
….
96
Structure of File System
file system: boot, super, inode, data block
/dev/hda
/dev/hdb
system
/dev/hda1
/dev/hda3
/dev/hda2
boot
super
i-node
disk blocks
97
Structure of File System (Cont`)
super block : manage information for file system
(cf: inode for file)
struct superblock
s_type
s_flag
s_dev
s_blocksize
s_magic
s_name
….
s_free_inode []
s_free_disk block []
free inode list (map)
...
free disk block list (map)
...
iget, iput
balloc, bfree
98
Structure of File System (Cont`)
super block
struct superblock
s_type
s_flag
s_dev
s_blocksize
s_magic
s_name
….
s_free_inode []
s_free_disk block []
29 27 26 24 21 20 19
61 57 56 54 51 50 48 46 45 43 42 41 39 38 37 34
disk block 29
disk block 61
……
99
Structure of File System (Cont`)
mount
vfsmntlist
“mount /dev/hda3 /mnt”
super block
for /dev/hda3
inode for /mnt
inode for root on FS of /dev/hda3
open(“/mnt/test.c”, O_RD)
100
s_dev
s_blocksize
mounted point
root inode
...
vfsmount
mmt_sb
vfsmount
Inode for special file
inode structure for special file
pipe
no indirect block (unnamed pipe)
readers, writers, read pointer, write pointer
special device file
no direct, indirect block
device number : major number + minor number
major number : corresponding device type
used as index for device switch table
minor number : corresponding device unit
pass as argument to device driver
101
Existing File System
S5FS
first and conventional UNIX file system
FFS
support 255 characters file name
cylinder groups
fragments
LFS
small write optimize
suitable for RAID storage system
directory entry for ffs
i_no size file_name
fast file system structure
boot block
super block
cylinder group 1
(inode, disk blocks)
cylinder group 2
VxFS (Journaling File System)
fast recovery using internal logging
…...
102
Existing File System
ext2 File System
Linux default file system
similar to Berkeley’s FFS
inode : 12 direct block
used bitmap for free block and inode management
fault-tolerant features
Ext2 file system structure
super block
boot block
Group descriptor
Block group 0
Block bitmap
Block group 1
Inode bitmap
……
Inode table
Block group n
Data Blocks
103
Existing File System
NFS
stateless protocol
XDR (Extended Data Representation)
AFS, Coda File System
disconnected operation
Sprite File System
VFS
application
nfsd
VFS
to support various file system
nfs server
system call
strong consistency
nfs client
mfs
procfs
VFS
NFS
RPC stub
104
NFS
RPC stub
XDR
UFS
swap space management
swap space management
P1
stack
swap space
0
P1
P2
P3
P4
P5
P6
data
text
P2
stack
data
text
64M
105
swap space management
swap used map
Scenario
• swap out P1 (3M)
• swap out P2 (3M)
• swap out P3 (2M)
• swap out P4 (1M)
• swap out P5 (3M)
• swap out P6 (4M)
• swap in P2
• swap in P4
• swap in P5
swap used map
3
6
3
P1
8
12
4
P2
16 64 48
swap space
0
P3
P4
P5
P6
64M
why does UNIX manage swap space differently to the FS ?
106
V. Inter-Process Communication
107
Inter-Process Communication (IPC)
synchronization
pipes
communication via files
signal
System V IPC
message queue
shared memory
semaphore
IPC with sockets
108
synchronization
parallelism
multiprocessor (true parallelism) or time sharing (quasi-parallelism)
race condition : more than one process want to access a same resource
shared resource
mutual exclusion
only one process can exclusively access a shared resource at a time
critical section : a portion of a program that accesses a shared resource
representative mechanism: ipl, lock, semaphore, test&set
deadlock
109
synchronization (Cont’)
example of race condition I
int main(void)
{
pid_t pid;
if ((pid = fork()) == 0) {
/* child */
charatatime(“output from child\n”);
} else {
charatatime(“output from parent\n”);
}
exit (0);
}
void charatatime(char *str)
{
char *ptr; int c;
setbuf(stdout, NULL);
for (ptr = str; c=*ptr++; )
putc(c, stdout);
}
(Source : Adv. programming in the UNIX Env. pgm 8.7)
guess what the results are?
110
outpuot utfprut froom chmild
parent
synchronization (Cont`)
system internals
task structure
fd
file structure
inode
f_pos
shared resource
fd
111
synchronization (Cont`)
example of race condition II
scenario
process P1 is currently dispatching (removing from ready queue)
disk interrupt occurs
disk interrupt handler wake up process P2 and want to insert it into ready
queue
RQ
P2
RQ
RQ
P4
P1
P4
P1
P4
P1
P3
P3
P3
112
synchronization (Cont`)
ipl (interrupt priority level)
BSD
SVR4
Purpose
spl0
spl0
enable all interrupts
splsoftclock
spltimeout
disable functions scheduled by timers
splnet
disable network protocol processing
splstr
disable STREAMS interrupts
spltty
spltty
disable terminal interrupts
splbio
spldisk
disable disk interrupts
splclock
disable hardware clock interrupt
splhigh
spl7 or splhi
disable all interrupts
splx
splx
restore ipl to previously saves value
113
synchronization (Cont`)
lock
associate lock variable to each shared resource
lock before (unlock after) the critical section
spin_lock primitive
void spin_lock(spinlock_t *s) {
while (test_and_set (s) != 0)
;
}
void spin_unlock (spinlock_t *s) {
*s = 0;
}
(Source : UNIX internals)
114
synchronization (Cont`)
sleep_lock
process wants resource
lock the resource
No
is it locked?
Yes
use resource
sleep on resource
unlock resource
awakened by any process
Yes
wake up all waiting processes
does anyone want it?
No
continue other processing
spin lock or sleep lock, lock granularity, rw_lock (try_lock)
115
synchronization (Cont`)
semaphore
an object that can be accessed P and V (and sem_initialize) method.
semaphore primitive
void initsem (semaphore_t *sem, int val) {
*sem = val;
}
void P (semaphore_t *sem) {
*sem -= 1;
while (*sem < 0)
sleep;
}
void V (semaphore_t *sem) {
*sem += 1;
if (processes slept on sem queue)
wake up the processes slept on sem;
}
(Source : UNIX internals)
116
synchronization (Cont`)
semaphore : example
client
server
shared memory
remove an item from
shared memory
produce an item
put the item into
shared memory
consume the item
117
synchronization (Cont`)
semaphore : example
client
server
sem1, sem2
shared memory
produce an item
initsem(sem1, 5)
initsem(sem2, 0)
P(sem1)
P(sem2)
put the item into
shared memory
remove an item from
shared memory
V(sem1)
V(sem2)
consume the item
118
synchronization (Cont`)
semaphore in the linux kernel
widely used for ‘wait until condition meet’ (eg read disk blocks)
semaphore /* include/asm-i386/semaphore, kernel/sched.c */
declare semaphore for each shared resource
struct semaphore {
atomic_t count;
struct wait_queue *wait;
}
void down (struct semaphore *sem) {
while (sem->count <= 0)
sleep_on (&sem->wait);
sem->count--;
}
void up (struct semaphore *sem) {
sem->count++;
wake_up (&sem->wait);
}
119
down(x)
critical section
up(x)
down(x)
critical section
up(x)
process 1
process 2
shared resource
struct semaphore *x
synchronization (Cont`)
semaphore in the linux kernel
sleep, wakeup /* include/linux/wait.h kernel/sched.c */
struct wait_queue {
struct task_struct *task;
struct wait_queue *next;
}
void sleep_on (struct wait_queue *queue) {
struct wait_queue entry = {current, NULL};
current->state = TASK_UNINTERRUPTABLE;
add_wait_queue (queue, &entry);
schedule();
remove_wait_queue(queue, &entry);
}
void wake_up (struct wait_queue *queue) {
struct wait_queue *p = *queue;
do {
p->task->state = TASK_RUNNING;
add_runqueue(p); p->p->next;
} while (p != *queue);
}
interruptible_sleep_on(), wake_up_interruptible()
120
synchronization (Cont`)
Deadlock
system state that processes wait events that never occur.
process 1
resource 1
process 2
resource 2
process 3
resource 3
resource 4
process 4
121
synchronization (Cont`)
Deadlock
deadlock prevention
deadlock avoidance
deadlock detection and correction
reduction of resource allocation graph
R1
R1
R1
P2
P1
P2
P1
P3
R2
R1
P2
P2
P1
P1
P3
P3
R2
122
P3
R2
R2
pipe
named pipe, unnamed pipe
pipe(fd[]), mkfifo(path, mode), mknod(path, mode, dev_t)
process 1
process 2
write fd
write fd
read fd
pipe
kernel
no indirect blocks in inode
rd_pointer, wr_pointer, number of readers, number of writers
123
S_IFREG
S_IFCHR
S_IFBLK
S_FIFO
pipe
pipe(unnamed pipe)
limit
cannot broadcast
no object boundaries
cannot direct data to a specific reader
FIFO(named pipe)
FIFO file
must be explicitly deleted(unlink)
named
less secure than pipe
124
pipe (Cont`)
example of pipe : “% ls -l | more”
for (;;) {
read_command();
parsing_command();
pipe(fd[]);
if (fork()) {
close(stdin);
dup(fd[0]);
if (fork()) {
close(stdout)
dup(fd[1]);
exec(“ls”, …);
}
exec(“more”, …);
}
wait();
}
125
Communication via files
the oldest way of data exchanging among processes
P
P
file
race condition may be occurred
reading a data before the other has completed modifying it
mandatory or advisory locking
lockf, flock, fcntl
fcntl(fd, cmd, arg)
flock structure
l_type
l_whence
l_start
l_len
l_pid
F_GETLK, F_SETLK, …...
126
F_RDLCK, F_WRLCK,
F_UNLCK,
F_SHLCK, F_EXLCK
Communication via files (Cont`)
A deadlock scenario with file locking
file
P
P
In Linux, fcntl() returns the error EDEADLOCK
127
Signal
register signal handler (signal catch function )
send signal
signal detection : state transition from kernel running to user running
call signal handler
variables for signal in task structure
int sigpending : is signal received or not?
struct signal_struct *sig
sigset_t signal, blocked
typedef struct {
unsigned long sig[_NSIG_WORDS];
} sigset_t; /* asm-i386/signal.h */
struct sigaction /* asm-i386/signal.h */
struct signal_struct /* sched.h */
count
action[_NSIG]
siglock
128
sa_handler
sa_flags
sa_restorer
sa_mask
System V IPC
Message, Shared Memory, and Semaphore
Common properties
Key => id (cf: file name => fd)
In kernel, ***id_ds for System V IPC (eg: msqid_ds)
ipc_perm: key, uid, cuid, access mode, …
ipcs, ipcrm
Difference
message : suitable for Object-Orient Concept
shared memory : fast
semaphore : for user level synchronization
129
System V IPC (Cont`)
message queue
msqid = sys_msgget (key, flag)
sys_msgsnd (msqid, msgp, msgsz, flag)
sys_msgrcv (msqid, msgp, msgsz, msgtype, flag)
sys_msgctl(msqid, cmd, msqid_ds)
senders
struct
msqid_ds
P
/* create */
/* send */
/* receive */
/* control */
receivers
P
P
msg
msg
msg
P
P
130
System V IPC (Cont`)
struct msqid_ds
P
P
P
msg_perm
msg_first
msg_last
msg_stime
msg_rtime
msg_ctime
wwait_queue
rwait_queue
msg_cbytes
msg_qnum
msg_qbytes
msg_lspid
msg_lrpid
msg_next
msg_type
msg_spot
msg_ts
msg_next
msg_type
msg_spot
msg_ts
msgtype in sys_msgrcv()
=0 : receive the first msg in the queue
>0 : receive the given type msg in the queue
<0 : receive the msg having the smallest value
131
System V IPC (Cont`)
shared memory
shmid = sys_shmget (key, size, flag)
sys_shmat (shmid, shmaddr, shmflag, raddr)
sys_shmdt (shmaddr)
sys_shmctl(shmid, cmd, shmid_ds)
struct shmid_ds
shm_perm
shm_segsz
shm_atime
shm_dtime
shm_ctime
shm_cpid
shm_lpid
shm_nattach
shm_npage
shm_pages /* for page table entries */
attaches
/* struct vm_area_struct */
132
System V IPC (Cont`)
using shared memory
vm area of
process A
vm area of
process B
kernel
stack
kernel
stack
0xa27e8000
0x77ed000
0xa27e0000
heap
data
text
heap
data
shared memory
region
133
text
0x77e5000
System V IPC (Cont`)
semaphore
semid = sys_semget (key, nsems, flag)
semop (semid, sops, nsops)
semctl(semid, semnum, cmd, *arg)
struct sembuf sops;
struct sembuf {
unsigned short sem_num;
short sem_op;
short sem_flg;
}
if (sem_op > 0)
V() operation
else
P() operation struct
134
struct semid_ds
sem_perm
sem_otime
sem_ctime
sem_base
sem_pending
……
sem_nsems
socket
socket
common interface for IPC and networking
Protocol family: UNIX, INET, AX25, IPX, Appletalk
layer structure of a network
BSD socket
INET
TCP
UDP
IP
PLIP
SLIP
parallel
port
serial
port
ETHERNET
Ethernet
card
135
ARP
socket (Cont`)
information for communication
5-tuple {protocol, local-addr, local-process, foreign-addr, foreign-process
C library routines
socket() : protocol, make socket structure
bind()
: assign local-addr and local-process
connect() : foreign-addr, foreign-process
listen()
accept()
: waiting in server
: make connection to a client
read(), write()
send(), sendto(), recv(), recvfrom()
cf) system call: sys_socketcall
/* net/socket.c */
136
socket (Cont`)
socket structure
file
….
f_dentry
….
f_pos
f_op
/* net/socket.c */
sock_lseek
sock_read
sock_write
NULL
sock_poll
sock_ioctl
NULL
sock_no_open
….
/* include/linux/net.h */
struct socket {
state
type
flags
ops /* proto_ops */
sk /* struct sock */
files, inodes
next, wait, ….
}
137
/* include/net/sock.h */
struct sock {
...
}
/* include/linux/net.h */
struct proto_ops {
family
dup, release,
bind, connect,
accept, listen,
...
getsockops
setsockops
sendmsg
recvmsg
}
/* for INET operation */
socket (Cont`)
connection oriented protocol
server
socket()
bind()
listen()
client
accept()
socket()
blocks until connection from a client
connect established
connect()
write()
read()
data (request)
processing request
write()
data (reply)
138
read()
socket (Cont`)
connectionless protocol
server
socket()
client
bind()
socket()
recvfrom()
bind()
blocks until data received from a client
sendto()
data (request)
processing request
sendto()
data (reply)
139
recvfrom()
TLI
connection oriented protocol
server
client
t_open()
t_open()
t_bind()
t_bind()
t_listen()
t_connect()
wait for connection
connection request
t_accept()
t_rcv()
data (request)
t_snd()
data (reply)
t_rcv()
processing request
t_snd()
140
VI. I/O System (Device Driver)
141
Role of a device driver
handle data movement between memory and peripheral devices
usually written by a third-party
P
P
P
P
system call interface
kernel
file system
device driver interface (through devsw table)
tty
driver
disk
driver
142
network
driver
Peripheral Device: General Structure
H/W configuration
extremely hardware dependent
controller
CSR (Control and Status Register)
- driver writes to the CSRs to issue commands to the device and
reads CSRs to obtain completion status or error condition
- memory mapped I/O, special in/out instruction (eg) 80*86’s in/out command)
- programmed I/O (tty, modem, printer), DMA (disk)
internal buffer
device itself
143
Disk Driver
Disk I/O handling
convert logical disk block number into physical sector(s)
handle read/write requests, handle interrupt
disk scheduling
FCFS
SSTF (Shortest Seek Time First)
SCAN
C-SCAN
…..
DMA (channel)
RAID
144
Terminal Driver
interactive : line discipline
canonical mode, raw mode (stty)
cblock
process
raw queue (clists)
tty_read
canon queue
tty_write
out queue
tty driver
interrupt
xbuf
rbuf
145
CSR
in/out
General structure of Device Driver
well defined entry point
top half, bottom half
character device driver
block device driver
open
open
close
close
read
in/out
strategy
write
ioctl
in/out
size
intr
intr
mmap
what’s the difference between character and block device driver?
146
Device Switch Table
devsw: table for registering the entry points of device drivers
struct cdevsw {
int (*d_open) ();
int (*d_close) ();
int (*d_read) ();
int (*d_write) ();
int (*d_ioctl) ();
int (*d_mmap) ();
int (*d_segmap) ();
int (*d_xpoll) ();
int (*d_xhalt) ();
struct streamtab *d_str;
struct ttytab *d_tty;
….
} cdevsw[];
struct bdevsw {
int (*d_open) ();
int (*d_close) ();
int (*d_strategy) ();
int (*d_size) ();
int (*d_xhalt) ();
….
} bdevsw[]
(Source : UNIX Internals)
147
Device Switch Table (Cont`)
Example of switch table
bdevsw
cdevsw
hd_open
hd_close
hd_strategy
con_open con_close con_read con_write con_ioctl
ht_open
ht_close
ht_strategy
tty_open
tty_close
tty_read
tty_write tty_ioctl
cd_open
cd_close
cd_strategy
ed_open
ed_close
ed_read
ed_write ed_ioctl
nulldev
nulldev
mm_read mm_write nulldev
hd_open
hd_close
hd_read hd_write nulldev
dev file
#ls -l /dev/
brw-r--r-- 0 1
brw-r--r-- 0 2
….
brw-r--r-- 0 11
brw-r--r-- 1 0
….
crw-r--r-- 1 0
crw-r--r-- 1 1
….
crw-r--r-- 5 0
hda1
hda2
hdb1
tape
tty0
tty1
rhda1
why do we access disks through character interface?
148
Device Switch Table (Cont`)
example : open
open(“/dev/tty0”, O_RD)
proc table
fd
file table
inode
i_dev : c, 1,0
cdevsw
con_open con_close con_read con_write con_ioctl
tty_open
tty_close
tty_read
tty_write tty_ioctl
ed_open
ed_close
ed_read
ed_write ed_ioctl
nulldev
nulldev
mm_read mm_write nulldev
gd_open
gd_close
gd_read gd_write nulldev
(*cdevsw[getmajor(dev)].d_open) (dev, …)
149
Device Switch Table (Cont`)
install new device driver
make new device driver and linking kernel
my_open(), my_read(), my_write(), my_close(), ….
register devsw table
make special file
# mknod /dev/mydrv [b|c] major_number minor_number
150
Device Switch Table (Cont`)
control flow
user mode
read()
kernel
queue
devsw table
wakeup
sleep
interrupt
handler
driver
IVT
device
where the requesting process is slept?
151
STREAM
full-duplex data transfer and processing path
consists of a pair of queues
user application
STREAM head
user
kernel
W
R
W
R
STREAM module
W
R
W
R
STREAM driver
hardware
152
STREAM (Cont`)
user
user
STREAM head
STREAM head
TCP
UDP
IP
IP
token ring
ethernet
user
user
user
STREAM head STREAM head STREAM head
TCP
UDP
IP
ATM
Reusable Module
Multiplexing
153
DQDB
STREAM (Cont`)
STREAM features
transparency among the queues
reusable
multiplexing
message based communication
virtual copying
STREAM scheduler : priority bands
154
Part II. Detailed Study:
Linux Kernel Internals
155
Contents
why Linux?
where is everything (kernel source code) ?
kernel configure and compile
system call implementation
module programming
some important kernel date structures
156
References
M. Beck, H. Bohme, M Dziadzka, U Kunitz, R. Magnus, D. Verworner,
“Linux Kernel Internals, 2nd Ed”, Addison-Wesley, 1997
Fred Butzen, Christopher Hilton, “The LINUX Network”, The M&T
Books Slackware Series, 1998
Remy Card, etc, “the LINIX KERNEL Book”, John Wiley & Son, 1998
A. Bubini, “LINUX Device Driver”, O’REILLY, 1998
Anonymous, “Maximum Linux Security (A Hacker’s Guide To
Protecting Your Linux Server and WS)”, SAMS Publishing, 1999
http://www.linux.org/
http://www.kernel.org/
http://kldp.org/
/usr/src/linux
157
Why Linux?
freely available
Linus Torvalds, Copyleft
1991 version 0.01 (November 1999, version 2.2.13)
Redhat, Debian, Slackware, Alzza
supported many companies
Main characteristics
multi-tasking
multi-user access
multi-processor
support various architecture (80*86, sparc, mips, alpha, smp, ..)
demand load executables
paging
dynamic cache for hard disk
158
Why Linux? (Cont`)
main characteristics (cont`)
shared library
support for POSIX 1003.1
various formats for executable files
true 386 protected mode
emulating maths co-processor
support for national keyboards and fonts
support diverse file system (ext2, ..)
TCP/IP, SLIP, PPP
BSD sockets
System V IPC
Virtual Console
159
Why Linux? (Cont`)
drawbacks
monolithic kernel (currently micro kernerlize in many research)
not for beginners (for system programmers)
not well structured (performance-oriented)
Key attraction
‘experimenting’ with the system (handle the kernel by yourself)
supported many companies
free: solution business & add on features
thanks to the INTERNET & GNU (special thanks to Anti-MS feeling)
160
Where is everything?
Linux Operating System Structure
user level
application
System Calls Interface
Central kernel
File System
ext2fs xiafs
minix nfs
iso9660
kernel level
proc
msdos
Buffer Cache
task management
scheduler
signals
memory management
loadable modules
Peripheral Manager
block
hd
network
Network Manager
ipv4
ethernet
…….
character
cdrom isdn
scsi
pci
Machine Interface
Machine
H/W level
(Source : the LINUX KERNEL book)
161
Where is everything? (Cont`)
source structure
based on version 2.2.5
under development : the contents described below may be changed
ipc
kernel
lib
mm
scripts
Doc
cdrom
/usr/src/linux
driver
arch
alpha
fs
init
block
include
arm
char
net
net
802
pci
m68k
coda
asm-alpha
appletalk
pnp
mips
ext2
asm-arm
decnet
sbus
ethernet
ppc
sparc
i386
boot
kernel
lib
math-emu
mm
msdos
asm-i386
ipv6
scsi
sound
nfs
linux
unix
video
ntfs
net
sunrpc
ufs
scsi
x25
hpfs
video
162
Where is everything? (Cont`)
main subdirectory
arch/
architecture dependent codes : arch/i386, arch/alpha, ….
arch/i386/boot/
– bootstrapping
– configure devices, memory
arch/i386/kernel/
– kernel entry point handling (trap/interrupt handling)
– context switch
arch/i386/mm/
– machine dependent memory management code
init/
all the functions needed to start the kernel
hand-made process 0 (init_task or task[0])
fork process 1, 2, 3, ...
163
Where is everything? (Cont`)
main subdirectory
kernel/ (arch/i386/kernel)
central section of the kernel
main system call implementation (fork, exit, etc.)
time management
scheduler
signal handling
mm/
virtual memory interface
paging, kernel memory management
fs/
virtual file system interface
implementations of the various file systems (ext2, nfs,...)
164
Where is everything? (Cont`)
main subdirectory
drivers/
drivers for hardware components
drivers/block/ : block-oriented driver(hard disks)
drivers/cdrom/ : proprietary CD-ROM drives
drivers/char/ : character-oriented driver (serial ports, tty, modem, ..)
drivers/net : network cards
drivers/pci/ : PCI bus access and control
drivers/scsi/ : SCSI interface
drivers/sound/ : sound card drivers
ipc/
classical inter-process communication
semaphores, shared memory, message queues
165
Where is everything? (Cont`)
main subdirectory
net/
various network protocol implementations : TCP/IP, ARP, ...
code for sockets to the UNIX and Internet domains
lib/
some standard kernel library functions (printk)
modules/
kernel module files
modules can be added to the kernel later (insmod, rmmod)
include/
commonly included kernel-specific header files
include/asm-i386/ : architecture-dependent header files for Intel CPU
include/linux/ : Linux kernel internal structure (task, inode)
166
Kernel Configuration and Compile
new kernel is generated in three steps
1. configure (Documentation/Configuration.help, see chapter 3 of “The
LINUX Network”)
make config (menuconfig, xconfig)
make oldconfig
2. depend
make dep (make clean:optional)
3. compile
make zImage
cf) - make zdisk (#dd bs=8192 if=$(BOOTIMAZGE) of=/dev/fd0)
- make zlilo (#cat $(BOOTIMAGE) > $(INSTALL_PATH)/vmlinuz)
/etc/lilo.conf
- #mkbootdisk --device /dev/fd0 zImage
167
Add New System Call
System Call : Control flow in Linux
Kernel
user process
sys_call_table /* arch/i386/kernel/entry.S */
do system call
real system call function
libc.a
idt_table /* arch/i386/kernel/traps.c*/
push args
save system call number
make trap
system call handler
system_call () /*arch/i386/kernel/entry.S */
catch trap through IDT
call real handler function
using sys_call_table
168
Add New System Call (Cont`)
IDT (Interrupt Descriptor Table)
define : include/asm_i386/desc.h, arch/i386/kernel/traps.c, irq.h
constructed while kernel initialization /*arch/i386/kernel/traps.c, irq.c*/
idt_table
0x0 divide_error()
debug()
nmi()
….
segment_not_present()
….
page_fault ()
….
0x20 timer_interrupt()
common trap handler for 80*86
FIRST_EXTERNAL_VECTOR
device interrupt handler (IRQ)
hd_interrupt()
….
SYSCALL_VECTOR
0x80 system_call()
0xff
….
169
Add New System Call (Cont`)
sys_call_table
sys_call_table
syscall number : include/asm_i386/unistd.h
#define
#define
#define
….
#define
__NR_exit 1
__NR_fork 2
__NR_read 3
__NR_vfork 190
sys_call_table : arch/i386/kernel/entry.S
ENTRY(sys_call_table)
.long SYMBOL_NAME(sys_ni_syscall)
.long SYMBOL_NAME(sys_exit)
.long SYMBOL_NAME(sys_fork)
.long SYMBOL_NAME(sys_read)
….
.long SYMBOL_NAME(sys_vfork)
.rept NR_syscalls-190
170
0 sys_ni_syscall()
sys_exit()
sys_fork()
sys_read()
sys_write()
…..
190 sys_vfork()
….
255
/* 0 */
/* 1 */
/* 2 */
/* 3 */
/* 190 */
Add New System Call (Cont`)
put them altogether : example of fork
Kernel
user process
main()
{
….
fork()
}
IVT
0x0 divide_error()
debug()
libc.a
….
fork()
{
….
movl 2, %eax
int $0x80
….
}
….
ENTRY(system_call)
/* entry.S */
SAVE_ALL
….
call *SYMBOL_NAME(sys_call_table)(,%eax,4)
….
nmi()
sys_call_table
….
1 sys_exit()
0x80 system_call()
….
2 sys_fork()
sys_fork()
3 sys_read ()
4 sys_write ()
/* arch/i386/kernel/process.c */
….
171
/* kernel/fork.c */
Add New System Call (Cont`)
Syntax of real system call handler in Linux
asmlinkage int sys_fork(regs) /* arch/i386/kernel/process.c */
{
return do_fork(..);
}
int do_fork(..)
/* kernel/fork.c */
{
….
/* create new process */
}
asmlinkage int sys_read(fd, buf, count)
{
…..
/* read data */
}
172
/* fs/read_write.c */
Add New System Call (Cont`)
Example: add new system call1 (too simple example)
1. kernel modification
1-1. allocate syscall number : include/asm-i386/unistd.h
#define __NR_exit 1
….
#define __NR_vfork 190
#define __NR_mysyscall 191
1-2. register sys_call_table : arch/i386/kernel/entry.S
ENTRY(sys_call_table)
…..
.long SYMBOL_NAME(sys_mysyscall)
.rept NR_syscalls-191
173
/* 191 */
Add New System Call (Cont`)
1-3. coding new system call handler
asmlinkage int sys_mysyscall()
{
printk(“Hello Linux, I’m in Kernel\n”);
}
1-4. kernel rebuild
if you make a new file, you should let it know to make utility
eg) kernel/test.c
modify the following field in Makefile on kernel directory
O_OBJS = sched.o, dma.o, fork.o, ….
… capability.o, test.o
174
Add New System Call (Cont`)
2. make user program with new system call
2-1. make user program
#define _syscall0 (type, name)
\
type name(void)
\
{\
long __res; \
__asm__ volatile (“int 0x80” \
: “=a” (__res) \
: “0” (__NR_##name)); \
__syscall_return(type, __name); \
}
/* include/asm-i386/unistd.h */
#include <linux/unistd.h>
_syscall0(int, mysyscall);
main() {
int i;
i = mysyscall();
}
2-2. make library if possible
#ar, ranlib
Just Do It (百見不如一打)
175
Add New System Call (Cont`)
add new system call2 : arguments passing
1. kernel modification
1-1 #define __NR_show_mult 192
1-2 .long SYMBOL_NAME(sys_show_mult)
/* 192 */
.rept NR_syscalls-192
1-3 asmlinkage int sys_show_mult(int x, int y, int *res) {
int error, compute;
if ((error = verify_area(VERIFY_WRITE, res, sizeof(*res)))
/* include/asm-i386/uaccess.h */
return error;
compute = x*y;
put_user(compute, res);
/* include/asm-i386/uaccess.h */
return (0);
}
cf) copy_to_user(), copy_from_user() /* include/asm-i386/uaccess.h */
176
Add New System Call (Cont`)
add new system call2 : arguments passing
2-1. make user program
#include <linux/unistd.h>
_syscall3(int, show_mult, int, x, int, y, int *, result);
main() {
int ret = 0;
show_mult(2, 5, &ret);
printf(“Result : %d * %d = %d\n”, 2, 5, ret);
}
int show_mult (int x, int y, int *result) {
long __res;
__asm__ volatile (“int 0x80”
: “=a” (__res) ,“0” (__NR_##name),
“b” ((long) (x)), “c” ((long) (y)),
“d” ((long) result)));
if (__res >= 0)
errno =- __res;
return __res;
}
/* include/asm-i386/unistd.h */
177
Add New System Call (Cont`)
add new system call3 : some general system calls
getpid
asmlinkage int sys_getpid() {
current->pid;
NR_TASKS: number of total concurrent tasks
}
all tasks connected using double linked list (next_task, next_run)
global variable: init_task, current
task[0]: init_task, task[1]: init process
nice
asmlinkage int sys_nice(new_priority) {
….
current->priority = newpriority ;
}
pause
asmlinkage int sys_pause() {
current->state = TASK_INTTERUPTIBLE;
schedule();
}
178
Add New System Call (Cont`)
fork
/* arch/i386/kernel/process.c */
sys_fork()
/* kernel/fork.c */
do_fork()
/* arch/i386/kernel/process.c */
- p = alloc_task_struct()
- task structure initialize
- copy_mm()….
- copy_thread()
- wake_up_process(p)
- return (p->pid)
copy_thread()
….
- p->tss.eax = 0;
- p->tss.eip = ret_from_fork;
/* kernel/sched.c */
/* arch/i386/kernel/entry.S */
ret_from_sys_call()
wake_up_process()
- add_to_runqueue(p);
- current->need_resched = 1
/* kernel/sched.c */
schedule()
if (schedule parent)
else (schedule child)
179
Add New System Call (Cont`)
exit
/* kernel/exit.c */
sys_exit()
/* kernel/exit.c */
do_exit()
- sem_exit()
- exit_mmap()
- free_page_tables()
- exit_files()
- exit_thread()
….
….
- handling each child process
- current->state=TASK_ZOMBIE
- schedule()
/* kernel/signal.c */
notify_parent()
180
Add New System Call (Cont`)
Project II: add new system
get kernel information: want to know about process id, state, process
execution time (system time and user time separately), the number of
page faults, the number of open files, and and so on
1. kernel modification
asmlinkage int sys_process_statistics(….) {
….
current->pid, min_flt, maj_flt, times.tms_utime, times.tms_stime
….
}
2. user program
181
Motivation of Module in LINUX
why do we use modules?
Linux is a monolithic kernel
trivial modifications require kernel to be recompiled
kernel is increasing in size by adding new features
many modules occupy permanent space in memory though they are used
rarely
module: steps toward micro-kernelized Linux
small and compact kernel
clean kernel
rapid kernel
solution business: components-based Linux
•예: backup tape driver
182
What can be Modules ?
what can be modules?
possibly anything
current version
file system
block device driver
character device driver
network device driver
exec domain
binary format
register_filesystem, unregister_filesystem
read_super, put_super
register_blkdev, unregister_blkdev
open, release
register_chrdev, unregister_chrdev
open, release
register_netdev, unregister_netdev
open, close
register_exec_domain, unregister_exec_domain
load_binary, personality
register_binfmt, unregister_binfmt
load_binary
….
cf: /lib/modules/x.x.x/*.o
183
How to manipulate modules?
how to manipulate modules?
compilation
# gcc -D__KERNEL__ -D_LINUX -DMODULE -c new_module.c
Enable loadable module support (CONFIG_MODULES) [Y/n/?]
…
MSDOS fs support (CONFIG_MSDOS_FS) [M/n/y/?]
insmod, lsmod, rmmod
#insmod fat
#lsmod
Module: #pages : Used by
fat
6
0
#rmmod fat
kerneld: for on-demand loading
eg: mount -t msdos /dev/fd0 /mnt => transparent load fat & msdos modules
184
How to implement modules?
Module
basic two interfaces
init_module()
cleanup_module()
kernel
register_filesystem()
module
insmod
init_module()
register_blkdev()
cleanup_module()
rmmod
register_netdrv()
sock_register()
185
How to implement modules? (Cont`)
example1 : Hello world!!
/* hello.c */
#include <linux/kernel.h>
#include <linux/module.h>
int init_module() {
printk(“Hello world!! - I’m in kernel\n”);
return 0;
}
void cleanup_module () {
printk(“Bye world - I’m in kernel\n”);
}
# gcc -D__KERNEL__ -D_LINUX -DMODULE -c hello.c
#insmod hello.o
#rmmod
186
How to implement modules? (Cont`)
example2 : simple device driver
/* time.c */
#include <linux/kernel.h>
#include <linux/module.h>
#define HOUR_MAJOR 60
#define HOUR_MINOR 0
struct file_operations time_fops = {
NULL,
time_read,
NULL, NULL, NULL, NULL,
NULL, time_open, NULL, NULL
};
int time_init() {
register_chrdev(HOUR_MAJOR, “time”, &time_fops);
printk(“time module loaded (major=%d)\n”, HOUR_MAJOR);
}
int time_read(fd, buf, size) {
…
copy_to_user(CURRENT_TIME, buf,...);
}
int init_module () {
return time_init();
}
int time_open(..) {
….
}
cleanup_module {
unregister_chrdev(HOUR_MAJOR, “time”);
printk(“time module unloaded \n”);
}
187
How to implement modules? (Cont`)
example2 : simple device driver
#gcc -D__KERNEL__ -D_LINUX -DMODULE -c time.c
#mknod
#insmod
#lsmod
Module:
time
/dev/time c 60 0
time
#pages:
1
Used by:
#cat /dev/time
/* print current time */
#rmmod time
how can the “cat” command invoke the time_read() function ?
188
How to implement modules? (Cont`)
example2 : simple device driver
register_blkdev()
init_module
/* include/linux/major.h */
time_init()
register_chrdev(HOUR_MAJOR, “time”, &time_fops);
register_chrdev()
- chrdevs[major].name = “time”
- chrdevs[major].fops = time_fops
189
How to implement modules? (Cont`)
example2 : simple device driver
open
sys_open()
- get_unused_fd()
- fd_install(fd, f)
filp_open()
/* fs/namei.c */
open_namei()
- struct file initialize
- f->f_op->open()
/* fs/device.c */
time_open()
chrdev_open()
pipe_open()
blkdev_open()
socket_open()
nfs_open()
190
- filp->f_op = get_chrfops(MAJOR
(inode->i_rdev));
/* filp->f_op = chrdevs[major].fops */
- filp->f_op->open;
How to implement modules? (Cont`)
example2 : simple device driver
read
/* fs/read_write.c */
sys_read()
- f->f_op->read
nfs_read()
pipe_read()
time_read()
tty_read()
/* fs/block_dev.c */
block_read()
191
How to implement modules? (Cont`)
example3 : system call wrapper
#include <linux/kernel.h>
#include <linux/module.h>
#include <sys/syscall.h>
#include <linux/sched.h>
#include <asm-i386/uaccess.h>
extern void *sys_call_table[];
int uid;
asmlinkage int (*original_call) (const char *, int, int);
asmlinkage int (*getuid_call) ( );
int init_module ( ) {
original_call = sys_call_table[__NR_open];
sys_call_table[__NR_open] = our_sys_open;
printk(“Spying on UID: %d\n”, uid);
getuid_call = sys_call_table[__NR_getuid];
return 0;
}
void cleanup_module ( ){
if (sys_call_table[__NR_open] != our_sys_open) {
sys_call_table[__NR_open] = original_call;
}
}
192
How to implement modules? (Cont`)
example3 : system call wrapper
asmlinkage int our_sys_open(const chat *fname, int flags, int mode) {
int i=0;
char ch;
if (uid == getuid_call() {
printk(“opened file by %d: “, uid);
do {
get_user(filename+i);
i++;
printk(“%c”, ch);
} while (ch != 0);
}
printk(“\n”);
return original_call(fname, flags, mode);
}
193
How to implement modules? (Cont`)
example4 : new file system
design super block
program file operations, program inode operations
registering : register_filesystem()
#ifdef CONFIG_MINIX_FS
register_filesystem(&(struct file_system_type)
{minix_read_super, “minix”, 1, NULL});
#endif
mount
struct file_system_type {
struct super_block *(*read_super) ();
char *name;
int requires_dev;
struct file_system_type *next;
} *file_system;
194
How to implement modules? (Cont`)
Project III
implement your own modules make file operations
make module interface
make driver
mknod (use pseudo device such as memory)
init_module()
cleanup_module()
mydrv_init()
mydrv
mydrv_open()
mydrv_interrupt()
mydrv_release()
mydrv_out()
mydrv_read()
mydrv_write()
mydrv_ioctl()
195
How to implement modules? (Cont`)
system call for modules
create_module
memory allocation for module (return load address)
a new element for module_list
init_module
physical loading of requesting module (module functions become an
integral part of kernel)
relocating module functions and solving references of kernel symbols
call module specific init_module function
delete_module
get_kernel_syms
to get kernel symbols
196
How to implement modules? (Cont`)
Kernel data structure for create_module()
module_list
module
module
next
ref
symtab
name
...
next
ref
symtab
name
...
size
size
references
symbol table
for this module
197
references
symbol table
for this module
Control flow of FS system call
file access under Linux /* include/linux/sched.h, fs.h */
inode
fs_struct
task structure
…
fs
files
...
count
umask
*root
*pwd
inode
file
f_mode
f_pos
f_flag
f_count
f_owner
f_inode
f_op
f_version
file_struct
count
close_on_exec
fd[0]
fd[1]
…
fd[255]
why do we need the file data structure ?
198
inode
file operation
routines
Control flow of FS system call (Cont`)
Why do we need file data structure
=> to support various type of files with single coherent interface
open
/* fs/open.c */
sys_open()
- get_unused_fd()
- fd_install(fd, f)
/* fs/open.c */
filp_open()
/* fs/namei.c */
open_namei()
- struct file initialize
- f->f_op->open()
/* to support various file */
199
Control flow of FS system call (Cont`)
struct file /* include/linux/fs.h */
f_next, f_prev
f_dentry
f_op
f_mode
f_pos
f_count
f_flags
f_reada, f_ramax
...
/* to access inode */
/* access type */
/* file offset */
/* reference count */
file operation example
fs/ext2/file.c
ext2_file_lseek,
generic_file_read,
ext2_file_write
NULL, NULL,
ext2_file_ioctl
generic_file_mmap
NULL, …….
fs/ufs/file.c
ufs_file_lseek,
generic_file_read,
ufs_file_write
NULL, NULL,
NULL,
generic_file_mmap
NULL, …….
fs/nfs/file.c
NULL,
nfs_file_read,
nfs_file_write
NULL, NULL,
NULL,
nfs_file_mmap
nfs_file_open, ……
where is create()?
200
include/linux/fs.h
lseek()
read()
write()
readdir()
poll()
ioctl()
mmap()
open()
flush()
release()
fsync()
fasync()
…..
fs/pipe.c
pipe_lseek, pipe_read,
pipe_write
NULL, pipe_poll,
pipe_ioctl,
NULL,
pipe_rdwr_open, ...
/* net/socket.c */
sock_lseek
sock_read
sock_write
NULL
sock_poll
sock_ioctl
NULL
sock_no_open
….
fs/device.c
NULL,
NULL,
NULL,
NULL, NULL,
NULL,
NULL
blkdev_open, …….
Control flow of FS system call (Cont`)
open
/* fs/open.c */
System call layer
sys_open()
- get_unused_fd()
- fd_install(fd, f)
/* fs/open.c */
filp_open()
- struct file initialize
- f->f_op->open()
/* fs/namei.c */
open_namei()
VFS layer
Specific File layer
iget(), bread()
pipe_rdwr_open()
sock_no_open()
nfs_file_open()
blkdev_open()
chrdev_open()
201
Control flow of FS system call (Cont`)
read
System call handling
layer
/* fs/read_write.c */
sys_read()
- f->f_op->read
sock_read()
block_read()
pipe_read()
nfs_file_read()
VFS layer
/* mm/filemap.c */
generic_file_read()
tty_read()
Specific File layer
- try to find page in page cache, if (hit) OK.
- get_free_page()
- inode->i_op->readpage()
202
Control flow of FS system call (Cont`)
inode structure in Linux /* include/linux/fs.h, ext2_fs_i.h */
inode
task
….
fd[]
….
file
….
f_dentry
….
f_pos
f_op
dentry
d_inode
inode operation
routines
File specific information
….
i_ino
i_dev
i_count
i_mode
i_nlink
i_uid, gid
……
i_atime, ...
i_rdev
i_op
i_data[15]
i_flags
i_….
203
device driver
Control flow of FS system call (Cont`)
inode operation example
...
i_op
...
fs/ext2/file.c
ext2_file_operations,
NULL, NULL,
NULL, NULL,
...
generic_readpage
NULL
ext2_bmap,
…….
include/linux/fs.h
def_file_operation
create(), lookup()
link(), unlink(), symlink()
mkdir(), rmdir()
mknod(), rename(),
readlink(), followlink()
readpage(), writepage()
bmap(), truncate(),
…….
fs/ufs/file.c
fs/nfs/file.c
ufs_file_operations,
NULL, NULL,
NULL, NULL,
...
generic_readpage
NULL
ufs_bmap,
…….
nfs_file_operations,
NULL, NULL,
NULL, NULL,
...
nfs_readpage
nfs_writepage
NULL
…….
204
fs/dos/files.c
dos_file_operations,
NULL, NULL,
NULL, NULL,
…
dos_readpage,
dos_writepage,
NULL,
…….
fs/pipe.c
rdwr_pipe_fops,
NULL, NULL,
NULL, NULL,
...
fs/device.c
def_blk_fops,
NULL, NULL,
NULL, NULL,
...
Control flow of FS system call (Cont`)
read
System call handling
layer
/* fs/read_write.c */
sys_read()
- f->f_op->read
sock_read()
pipe_read()
VFS layer
block_read()
/* mm/filemap.c */
generic_file_read()
tty_read()
Specific File layer
- try to find page in cache, if (hit) OK.
- inode->i_op->readpage()
nfs_readpage()
/* fs/buffer.c */
/* fs/ext2/inode.c */
ext2_bmap()
/* fs/ufs/inode.c */
ufs_bmap()
generic_readpage()
dos_readpage()
Specific FS layer
coda_readpage()
/* driver/block/ll_rw_blk.c */
ll_rw_block()
/* driver/block/hd.c */
hd_request
205
Device Driver layer
Device Driver Implementation in Linux
data structure
blkdevs, chrdevs for devsw
blk_dev_struct for block driver only
file_operations
/* fs/devices.c */
lseek
read, write, readdir
poll, ioctl, mmap,
open, flush, release
fsync, fasync
…..
struct device_struct {
name;
fops;
} chrdevs[], blkdevs[];
/* include/linux/blkdev.h */
struct blk_dev_struct {
request_fn;
queue;
request;
...
} blk_dev[];
206
Driver Implementation in Linux (Cont`)
buffer_head
b_dev
b_blocknr
b_state
b_count
b_size
...
b_next
b_data
data structure (cont`)
chrdevs[]
name
fops
file_operations
blkdev
request
rq_status
rq_dev
cmd
…
sem
bh
tail
next
request_fn
current_request
207
request
rq_status
rq_dev
cmd
…
sem
bh
tail
next
request
Driver Implementation in Linux (Cont`)
Example of structure of driver: IDE disks
hd_init()
hd_open()
hd_interrupt()
hd_release()
hd_out()
driver/block/hd.c
hd_request()
check_status()
hd_ioctl()
NULL,
block_read,
block_write
NULL, NULL,
hd_ioctl,
NULL,
hd_open,
NULL
hd_release,
block_fsync
struct file_operations hd_ops
208
Driver Implementation in Linux (Cont`)
major number
Major
0
1
2
3
4
5
6
7
8
9
………
23
….
/* include/linux/major.h */
Character devices
Block devices
mem
RAM disk
floppy (fd*)
IDE hard disk (hd* )
terminal
terminal & AUX
Parallel Interface
virtual console (vcs*)
SCSI hard disk (sd*)
SCSI tapes (st*)
Mitsumi CD-ROM (mcd*)
209
Driver Implementation in Linux (Cont`)
initialization of disk driver
register_blkdev()
init_module
init process
/* driver/block/hd.c */
hd_init()
/* include/linux/major.h */
- register_blkdev(HD_MAJOR, “hd”, &hd_fops);
- blk_dev[HD_MAJOR]. request_fn = hd_request
/* fs/devices.c */
register_blkdev()
- blkdevs[major].name = device name
- blkdevs[major].fops = fops
210
Driver Implementation in Linux (Cont`)
disk driver open
/* fs/open.c */
sys_open()
- get_unused_fd()
- fd_install(fd, f)
/* fs/open.c */
filp_open()
/* fs/namei.c */
open_namei()
- struct file initialize
- f->f_op->open()
/* driver/block/hd.c */
/* fs/device.c */
hd_open()
blkdev_open()
pipe_open()
chrdev_open()
socket_open()
nfs_open()
211
- filp->f_op = get_blkfops(MAJOR
(inode->i_rdev));
/* filp->f_op = blkdevs[major].fops */
- filp->f_op->open; /* hd_open */
Driver Implementation in Linux (Cont`)
disk driver read
/* fs/read_write.c */
sys_read()
- f->f_op->read
/* mm/filemap.c */
nfs_read()
pipe_read()
generic_file_read()
tty_read()
/* fs/block_dev.c */
block_read()
- getblk(); /* buffer header */
/* driver/block/ll_rw_blk.c */
ll_rw_block()
make_request()
- request structure initialize
add_request()
- call blk_dev[major].request_fn
/* driver/block/hd.c */
hd_request()
212
- hd_out()
Driver Implementation in Linux (Cont`)
queue and requests (similar to message queue)
requests are sorted by sector number
inb, outb
/* include/linux/blkdev.h */
struct blk_dev_struct {
request_fn;
queue;
request;
...
} blk_dev[];
bread
block_read
struct request {
rq_status
rq_dev
cmd /* R/W */
error
sector, nr_sector
buffer, bh
sem
next
...
}
request_fn
hd_request
queue
buffer
cache
req
req
ll_rw_block
make_request
213
req
block
device
driver
do I/O
Driver Implementation in Linux (Cont`)
various disks and partitions
gendisk
gendisk_head
gendisk
gendisk
8
major
“sd”
name
minor_shift
max_p
part
….
real_devices
next
214
3
major
“ide0”
name
minor_shift
hd_struct
max_p
part
start_sect
….
nr_sects
real_devices
...
next
...
start_sect
nr_sects
Driver Implementation in Linux (Cont`)
tty driver
register_chrdev()
init_module
init process
driver/char/tty_io.c
tty_lseek,
tty_read,
tty_write
NULL,
tty_poll
tty_ioctl,
NULL,
tty_open,
NULL
tty_release,
NULL
tty_afsync
/* driver/block/hd.c */
tty_init()
/* include/linux/major.h */
- register_chrdev(TTY_MAJOR, “tty”, &tty_fops);
/* fs/devices.c */
register_chrdev()
- blkdevs[major].name = device name
- blkdevs[major].fops = fops
215
Driver Implementation in Linux (Cont`)
Example of network driver : 3c509
different from disk and tty driver
not directly interface with VFS
/* driver/net/3c509.c */
/* driver/net/3c509.c */
el3_init()
ip_output()
ip_rcv()
el3_open()
el3_start_xmit()
el3_out()
el3_stop()
el3_interrupt()
el3_release()
216
Driver Implementation in Linux (Cont`)
Example of network driver : 3c509
/* include/linux/netdevices.h */
struct device {
name
mem_end, mem_start
base addr /* port number */
…
init, destructor
….
device_addr
qdisc /* sk_buff */
….
open, stop
hard_start_xmit, hard_header
…
irq
}
init_module() in 3c509
/* driver/net/3c509.c*/
/* register_netdev() */
init port, irq, …
make dev structure
dev->init=el3_init
dev->open=el3_open
dev->hard_start_xmit =
el3_start_xmit
...
el3_open()
….
request_irq(dev->irq, el3_interrupt
217
Task Scheduling
LINUX scheduling
clock tick is 10msec, time quantum is 10 clock ticks
support REAL-TIME task
variables for scheduling in task structure
p_policy : task type /* include/linux/sched.h */
– SCHED_FIFO, SCHED_RR, SCHED_OTHER
p_priority
– set to DEF_PRIORITY (20) /* include/linux/sched.h */
– can be changed using sys_nice() or sys_setpriority();
p_counter
– decrease each clock tick
– counter = priority, when counter of all task is zero
need_resched : need re-scheduling when return from syscall or interrupt
rt_priority
– set using sched_setscheduler(pid, policy, sched_param) system call
– used to set real time tasks (static priority)
218
Task Scheduling (Cont`)
schedule() function /* kernel/sched.c */
need_resched
sleep_on
schedule
- schedule real time task first (rt_priority)
- select a task which has highest values of
counter + priority (using goodness function)
give advantage to the task which run this_cpu
give slight advantage to the task which has mm object
- if (p_counter == 0) for all task
p_counter = p_priority
- context switch : switch_to (current, next) /* arch/i386/kernel/process.c */
219
Task Scheduling (Cont`)
Example of scheduling
3 tasks
millisecond
T1
T2
T3
p_pri p_count.
p_pri p_count.
p_pri p_count.
0
20
20
20
20
20
20
10
20
10
20
20
20
20
20
20
10
20
10
20
20
30
20
10
20
10
20
10
40
20
0
20
10
20
10
20
0
20
0
20
10
20
20
20
20
20
20
220
Signal
a mechanism to inform an asynchronous event to process
types of signal : SIGKILL, SIGINT, SIGBUS, SIGUSR1, ….
action : abort, exit, ignore, stop, user level catch function
void sig_handler(signo)
int signo;
{
signal (SIGUSR1, sig_handler);
printf(“received signal %d\n”, signo);
…..
}
/* reinstall */
/* handle the signal */
main ()
{
signal (SIGUSR1, sig_handler);
….
for ( ; ; )
pause();
/* install the handler */
}
what’s the difference among interrupt, trap, and signal?
221
Signal (Cont`)
register signal handler (signal catch function )
send signal
signal detection : state transition from kernel running to user running
call signal handler
variables for signal in task structure
int sigpending : is signal received or not?
struct signal_struct *sig
sigset_t signal, blocked
typedef struct {
unsigned long sig[_NSIG_WORDS];
} sigset_t; /* asm-i386/signal.h */
struct sigaction /* asm-i386/signal.h */
struct signal_struct /* sched.h */
count
action[_NSIG]
siglock
222
sa_handler
sa_flags
sa_restorer
sa_mask
Signal (Cont`)
register signal catch function
task
….
sig
signal, blocked
sigpending
….
signal_struct
count
action[_NSIG]
siglock
sigset_t
….
63
sigaction
sa_handler
sa_flags
sa_restorer
sa_mask
sigset_t
….
0
/* kernel/signal.c */
sys_signal(sig, handler)
do_sigaction(sig, new_sa, old_sa)
223
63
0
Signal (Cont`)
send signal
task
….
sig
signal, blocked
sigpending
….
signal_struct
count
action[_NSIG]
siglock
sigset_t
….
63
sigaction
sa_handler
sa_flags
sa_restorer
sa_mask
sigset_t
….
0
63
0
/* kernel/signal.c */
sys_kill(pid,sig)
kill_proc_info(sig, info, pid)
send_sig_info(sig, info, *t)
sigaddset(t->signal, sig);
t->sigpending = 1;
224
Signal (Cont`)
signal handling
task
….
sig
signal, blocked
sigpending
….
signal_struct
count
action[_NSIG]
siglock
sigaction
sa_handler
sa_flags
sa_restorer
sa_mask
/* arch/i386/kernel/entry.S */
if (current->sigpending)
do_signal();
/* arch/i386/kernel/signal.c */
do_signal(regs, oldset)
signr = dequeue_signal()
handle SIG_IGN
or SIG_DFL
sigset_t
….
63
0
handle_signal()
sigset_t
….
63
setup stack frame
for signal handler
0
225
Signal (Cont`)
signal handling: state of stack for handling signal
memory
stack
memory
stack
- return address
- arguments
- return address
- arguments
- return address
to kernel
- return address
to sighandler
- arguments
226
Thread
Motivation (golf course)
Possibility of parallel processing
process is too heavy
process model
address space
P
P
P
CPU
P
P
process
time
(Source : UNIX internals)
227
Thread (Cont`)
thread model
address space
thread model
CPU
thread
time
(Source : UNIX internals)
task : a set of thread and a collection of resources (passive)
thread : hardware context, stack, thread information (id, scheduling, ..)
228
Thread (Cont`)
types of threads
kernel thread
LWP (lightweight process) : a kernel supported user thread
user thread : C-thread, P-thread
U
user level scheduler
U
U
U
L
L
K
K
U
U
process (or task)
L
K
K
K
thread scheduler
CPU
CPU
229
Thread (Cont`)
threads in Linux
struct thread: currently only one in task structure
sys_clone()
fully share the address context such as page directory
under developing
can use user level thread (P thread)
/usr/include/pthread.h
pthread_create()
pthread_join()
pthread_mutex_init()
230
Thread (Cont`)
Example of thread programming
/* gcc -lpthread */
#include <pthread.h>
...
int main(int argc, char *argv[]) {
pthread_t *thread;
void *retval;
int cpu, i;
DATA *A;
volatile double s = 0;
pthread_mutex_t s_lock;
typedef struct {
double volatile *p_s;
pthread_mutex_t *p_s_lock;
int n;
} DATA;
if (argc != 0) {
printf(“USAGE: %s, CPU number”, argv[0]);
exit(1);
}
cpu = atoi(argv[1]);
thread = (pthread_t *)calloc(cpu, sizeof(pthread_t));
A = (DATA *) calloc(cpu, sizeof(DATA));
231
#define L 9
double x[L], y[L];
Thread (Cont`)
Example of thread programming
for (i=0; i<L; i++)
x[i] = y[i] = i;
pthread_mutex_init(&s_lock, NULL);
void *SMP_scalprod(void *arg)
{
register double localsum;
long i;
DATA D = *(DATA *)arg;
for (i=0; i<cpu; i++) {
A[i].n=i; /* start offset */
A[i].p_s=&s;
A[i].p_s_lock=&s_lock;
pthread_create(&thread[I], NULL,
SMP_scalprod, &A[i]);
}
localsum = 0.0;
for (i=D.n; i<L; i+=cpu)
localsum += x[i]*y[i];
pthread_mutex_lock(D.p_s_lock);
*(D.p_s) += localsum;
pthread_mutex_unlock(D.p_s_lock);
for (i=0; i<cpu; i++)
pthread_join(thread[i], &retval);
return (NULL);
printf(“results = %f\n”, s);
}
}
232
Data Structure for Virtual Memory
Linux virtual memory structure for each task
global view /* include/linux/sched.h, mm.h, include/asm-i386/page.h */
task_struct
mm
mm_struct
vm_area_struct
map_count
pgd
vm_end
vm_start
vm_flags
…..
mmap
31
11 0
PFN
page directory
vm_file
vm_offset
vm_ops
vm_next
vm area
(data or parts of data)
vm_area_struct
vm_end
vm_start
vm_flags
…..
vm_file
vm_offset
vm_ops
vm_next
233
vm_area
(text)
Data Structure for Virtual Memory (Cont`)
struct mm_struct
include/linux/sched.h
struct mm_struct {
struct vm_area_struct *mmap;
struct vm_area_struct *mmap_avl, *mmap_cache;
pgd_t *pgd;
atomic_t count; int map_count;
struct semaphore mmap_sem;
unsigned long context;
unsigned long start_code, end_code, start_data;
unsigned long end_data, start_brk, brk, start_stack;
unsigned long arg_start, arg_end, env_start, env_end;
unsigned long rss, total_vm, locked_vm, def_flags;
unsigned long swap_cnt, swap_address;
void *segment;
}
include/asm-i386/page.h
typedef struct {unsigned long pgd;} pgd_t;
234
kernel
env_end
arg_end
arg_start
start_stack
stack
brk
end_data
end_code
start_code
bss
data
text
Data Structure for Virtual Memory (Cont`)
pgd_t
task_struct
mm_struct
mm
map_count
pgd
mmap
31
22 21 12 11
0
DIR PAGE offset
31
11 0
31
11 0
PFN
CR3
11
PFN
PFN
page directory
31
page table
235
0
offset
physical address
Data Structure for Virtual Memory (Cont`)
struct vm_area_struct
need to handle segments (or parts of segment) differently: text/data, share/private
include/linux/mm.h
Virtual Memory Area
struct vm_area_struct {
struct mm_struct *vm_mm;
unsigned long vm_start, vm_end;
struct vm_area_struct *vm_next
pgprot_t vm_page_prot;
unsigned short vm_flags;
short vm_avl_height;
struct vm_area_struct *vm_avl_left;
struct vm_area_struct *vm_avl_right;
struct vm_area_struct *vm_next_share;
PAGE_SHARED (COPY,
READONLY, KERNEL)
struct vm_operations_struct *vm_ops;
unsigned long vm_offset;
struct file *vm_file;
unsigned long vm_pte; /* for SVR4 SM */
}
236
•open(vm_area)
•close(vm_area)
•do_mmap(file, addr, len,
prot, flags, off)
•unmap()
•protect()
•nopage()
•wppage()
•swapout()
•swapin()
Data Structure for Virtual Memory
execve (final) : usually demand paging under Linux
task_struct
mm
mm_struct
vm_area_struct
map_count
pgd
vm_end
vm_start
vm_flags
…..
vm_file
vm_offset
vm_ops
vm_next
a.out (ELF format)
p_type
p_offset
p_vaddr
p_filesz
p_memsz
p_flags
e_ident
…
e_phnum
mmap
physical
header1
physical
header2
……
code
data
…….
vm area
vm_area_struct
vm_end
vm_start
vm_flags
…..
open(vm_area),
close(vm_area)
do_mmap(file, addr, len,
prot, flags, off)
unmap()
protect()
nopage(), wppage()
…..
237
vm_file
vm_offset
vm_ops
vm_next
vm_area
Data Structure for Virtual Memory (Cont`)
struct vm_area_struct: AVL (Adelchild-Velskii and Landis) tree
vm_area_struct
40007000
0804b000
0804a000
40087000
40009000
40005000
08053000
40008200
c0000000
400b9000
(Source : the LINUX KERNEL book)
238
Polling & Interrupt
polling mode
#define LP_B(minor) lp_table[(minor)].base /* IO address */
#define LP_S(minor) inb_p(LP_B((minor)+1) /* status port */
#define LP_CHAR(minor) lp_table[(minor).chars
/* busy timeout */
static int lp_char_polled(lpchar, minor)
{
int status = 0;
int count = 0;
….
status=LP_S(minor);
while ((status & LP_PBUSY) && count < LP_CHAR(minor)) {
count++;
if (need_resched)
schedule();
status=LP_S(minor);
};
….
do timeout error handling if necessary (off-line, out of paper, …)
outb_p(lpchar, LP_B(minor));
…
}
239
Polling & Interrupt (Cont`)
interrupt mode
lp_init()
{
….
request_irq(LP_IRQ, lp_interrupt, 0, “PRINTER”);
….
}
static int lp_char(lpchar, minor) {
…
if(…)
outb_p(lpchar, LP_B(minor));
else
interruptible_sleep_on(&lp->lp_wait_q);
...
}
lp_interrupt(int irq, struct pt_regs *regs)
{
….
wake_up_interruptible(&lp->lp_wait_q);
….
}
240
Polling & Interrupt (Cont`)
Interrupt handling under Linux /* arch/i386/kernel.irq.h irq.c */
Interrupt_descriptor[]
0
1
status
handler
action
depth
2
status
handler
action
depth
irqaction
handler
flags
name
dev_id
….
next
irqaction
handler
flags
name
dev_id
….
next
241
irqaction
handler
flags
name
dev_id
….
next
Polling & Interrupt (Cont`)
default IRQ of ISA PC
IRG
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
/* arch/i386/kernel.irq.h irq.c */
Assignment
System timer
Keyboard controller
Second IRQ controller
Serial port 1 (COM1)
Serial port 2 (COM2)
Line printer 2 (LPT2)
Floppy-disk controller (controls two disks)
Line printer 1 (LPT1)
Real-time clock
Redirected IRQ2
Unused
Unused
Motherboard (PS/2) mouse port
Mathematics coprocessor
Hard-disk (IDE) controller 1 (controls two disks)
Hard-disk (IDE) controller 2 (controls two disks)
242
Bottom Half Handling
What is bottom half
to handle long jobs during interrupt handling
top half : request_irq
bottom half : mark_bh(), init_bh() with bh_base data structure
bh_mask_count[32];
struct bh_struct {
void (*routine)();
void *data;
} bh_base[32];
enum {
TIMER_BH,
CONSOLE_BH,
…
KEYBOARD_BH,
…
}
243
Bottom Half Handling (Cont`)
example of bottom half
kbd_init()
{
….
request_irq(KEYBOARD_IRQ, kbd_interrupt, 0, “KBD”);
bh_base[KEYBOARD_BH].routine = kbd_bh;
….
}
kbd_interrupt(int irq, struct pt_regs *regs)
{
….
mark_bh(KEYBOARD_BH);
….
}
kbd_bh() /* called from ret_from_syscall */
{
do KBD interrupt handling
}
244
Bottom Half Handling (Cont`)
timer handling
To deal with some jobs which is required to be invoked
at specific time
struct timer_struct {
unsigned long expires;
void (*fn)(void)
} timer_table[];
init_timer()
add_timer()
del_timer()
245
Network in Linux
Network implementation
one of the basic demands of an operating system
applications
ftp, telnet, rlogin, NFS, e-mail, News
protocol
TCP/IP, OSI, IPX (developed by Novell), SNA, appletalk, X.25
devices
Ethernet(eth0, eth1), SLIP(sl0), PLIP (plip0)
246
Socket interface
Socket interface /* net/socket.c */
virtual interface
to support various protocol family
UNIX, INET, X25, IPX, APPLETALK, …
to support various
Stream, Datagram, Raw, Reliable Delivered Message, ...
socket(), bind(), connect(), listen(), accept()
read(), write()
send(), sendto(), recv(), recvfrom()
247
Layer model
layer structure of a network
BSD socket
INET socket
TCP
UDP
IP
PLIP
SLIP
parallel
port
serial
port
ETHERNET
Ethernet
card
248
ARP
Layer model (Cont`)
Encapsulation
data
TFTP
data
header
TFTP message
UDP
header
Ethernet
header
TFTP
data
header
UDP message
IP
header
UDP
TFTP
header
header
IP packet
data
IP
header
UDP
header
data
TFTP
header
Ethernet
trailer
Ethernet frame
Details of each structure can be found in “The LINUX NETWORK” and
“UNIX network programming”
249
Layer model (Cont`)
Details of TCP/IP protocol
Ethernet frame
Destination ethernet
address
Source ethernet
address
Protocol
Data
Checksum
IP packet
Length
Protocol
Checksum
Source IP
address
Destination IP
address
Data
TCP message
Source TCP
address
Destination
TCP address
250
SEQ
ACK
Data
Important data structure
important data structure
VFS layer
struct file_operations
BSD socket layer
struct net_proto_family /* include/linux/net.h */
struct socket /* include/linux/net.h */
/* include/linux/fs.h */
inet layer
struct sock /* include/net/sock.h */
struct proto_ops /* include/linux/net.h */
transport layer
struct tcp_opt /* include/net/sock.h */
struct proto /* include/net/sock.h */
network layer
struct tcp_func /* include/net/tcp.h */
struct packet_type /* include/linux/netdevice.h */
device layer
struct device /* include/net/netdevice.h */
251
struct sk_buff
/* include
/linux/sk_buff.h */
Important data structure (cont`)
socket data structure
task
….
fd[]
….
/* include/linux/net.h */
file
….
f_dentry
….
f_pos
f_op
dentry
d_inode
struct socket {
state
type
flags
ops /* proto_ops */
sk /* struct sock */
files, inodes
next, wait, ….
}
INET, UNIX, IPX, X25, ..
252
/* include/net/sock.h */
struct sock {
...
}
/* include/linux/net.h */
struct proto_ops {
family
dup, release,
bind, connect,
accept, listen,
...
getsockops
setsockops
sendmsg
recvmsg
}
/* for INET operation */
Important data structure (cont`)
sock data structure
/* include/net/sock.h */
struct tcp_opt {
tcp_header_leng
rcv_next, snd_next,
/* sequence, error
handling information */
….
tcp_func
...
}
/* include/net/tcp.h */
struct tcp_func {
queue_xmit
send_check
….
}
/* for IP operation */
/* include/net/sock.h */
struct sock {
next, prev
daddr, dport
rcv_saddr, sport
...
rmem_alloc
receive_queue /* sk_buff */
wmem_alloc
send_queue
...
pair /* struct sock */
proto /* struct proto */
tp_pinfo
dst_cache /* struct dst_entry */
...
}
253
/* include/net/sock.h */
struct proto {
next, prev
close, bind, retransmit
connect, accept
…
sendmsg, recvmsg
…
name
}
/* for TCP or UDT operations */
/* include/net/dst.h */
struct dst_entry {
next
….
struct device *dev;
struct hh_cache *hh;
(*input)
(*output)
…
}
/* for device operation */
Important data structure (cont`)
network device data structure
/* include/net/sock.h */
struct sock {
...
dst_cache
...
}
/* include/linux/netdevices.h */
struct hh_cache {
hh_refcnt
hh_type
hh_output
…
}
/* for abstract device
operation */
/* include/net/dst.h */
struct dst_entry {
….
*dev;
*hh;
(*input)
(*output)
…
}
254
/* include/linux/netdevices.h */
struct device {
name
mem_end, mem_start
base addr /* port number */
irq
…
init, destructor
….
device_addr
Qdisc /* sk_buff */
….
open, stop
hard_start_xmit, hard_header
...
}
/* for actual network device operation */
Important data structure (cont`)
sk_buff data structure
for virtual copy
struct sock
/* include/linux/sk_buff.h */
struct sk_buff {
next, prev
struct sock *sk;
….
dev
/* TP layer header */
union { th, uh, icmph, …} h;
/* Network layer header */
union { iph, ipv6h, arph, ..} nh
/* Data Link header */
union { ethernet, raw} mac;
struct dst_entry *dst;
…
data, head, tail, len
…
}
sk_buff
headers
data
...
sk_buff
headers
data
...
struct device
sk_buff
headers
data
...
255
Socket Create
socket create
/* include/linux/socket.h */
AF_UNIX, AF_INET, AF_IPX, ...
/* net/socket.c */
sys_socket(family, type, protocol)
SOCK_STREAM, SOCK_DGRAM, SOCK_RAW, ...
sock_create()
sock_alloc()
net_families[family]->create()
256
/* include/linux/net.h */
struct socket {
state
type
flags
ops /* proto_ops */
sk /* struct sock */
files, inodes
next, wait, ….
}
Socket Create (cont`)
protocol family registration
family
/* include/linux/socket.h */
AF_UNIX, AF_INET, AF_IPX, ...
registration
/* net/ipv4/af_inet.c */
struct net_proto_family inet_family_ops = {
PF_INET,
inet_create
}
/* include/linux/net.h */
struct net_proto_family {
family
create()
authentication
encryption, encrypt_net
}
struct net_proto_family net_familiese[];
/* net/socket.c */
sock_register(net_proto_family *ops)
{
...
net_familiese[ops->family] = ops;
}
inet_proto_init()
{
…
sock_register(inet_family_ops)
...
}
/* net/unix/af_unix.c */
/* net/ipx/af_ipx.c */
257
Socket Create (cont`)
/* include/linux/socket.h */
AF_UNIX, AF_INET, AF_IPX, ...
socket create
/* net/socket.c */
sys_socket(family, type, protocol)
SOCK_STREAM, SOCK_DGRAM, SOCK_RAW, ...
sock_create()
sock_alloc()
net_families[family]->create()
unix_create()
/* include/net/sock.h */
struct sock {
...
prot
net_pinfo
tp_pinfo
socket
sk_buff
….
}
/* include/linux/net.h */
struct socket {
state
type
flags
ops /* proto_ops */
sk /* struct sock */
files, inodes
next, wait, ….
}
inet_create()
sk_alloc()
switch (type)
sock->ops=&inet_stream_ops
or sock->ops=&inet_dgram_ops
…
sk->prot = &tcp_prot
258
Socket Create (cont`)
socket create
/* net/socket.c */
sys_socket(family, type, protocol)
sock_create()
get_fd()
get_empty_filp()
file->f_op=&socket_file_ops
associate d_inode with socket structure
/* net/socket.c */
struct file_operations socket_file_ops = {
sock_lseek
sock_read
sock_write
NULL /* readdir */
sock_poll
sock_ioctl
NULL /* mmap */
sock_no_open
NULL /* flush */
sock_close
NULL /* fsync */
sock_fasync
}
259
Socket Create (cont`)
after socket creation
task
….
fd[]
….
file
….
f_dentry
….
f_pos
f_op
/* include/linux/net.h */
struct socket {
state
type
flags
ops /* proto_ops */
sk /* struct sock */
files, inodes
next, wait, ….
}
dentry
VFS layer
d_inode
INET layer
/* include/net/sock.h */
struct sock {
next, prev
daddr, dport
rcv_saddr, sport
...
rmem_alloc
receive_queue /* sk_buff */
wmem_alloc
send_queue
...
pair /* struct sock */
prot /* struct proto */
tp_pinfo
dst_cache /* struct dst_entry */
...
}
TCP layer
IP layer
260
Driver
layer
Send Data
sending data through socket
compare with FS control flow…, that is a piece of pizza
/* net/ipv4/af_inet.c */
struct proto_ops inet_stream_ops = {
PF_INET
sock_no_dup
inet_release
inet_bind
inet_stream_connect
sock_no_socketpair
inet_accept
inet_getname
inet_poll
inet_ioctl
inet_listen
inet_shutdown
inet_getsockopt
inet_setsockopt
sock_no_fcntl
inet_sendmsg
inet_recvmsg
}
/* fs/read_write.c */
sys_write()
f->f_op->write
/* net/socket.c */
sock_write()
socki_lookup(d_inode)
make msg
sock_sendmsg()
sock->ops->sendmsg
/* net/ipv4/af_inet.c */
inet_sendmsg()
sk->prot->sendmsg
261
Send Data (cont`)
sending data through socket
/* net/ipv4/af_inet.c */
inet_sendmsg()
sk->prot->sendmsg
/* net/ipv4/tcp.c */
tcp_v4_sendmsg()
tcp_do_sendmsg()
copy data from user to sk_buff
/* net/ipv4/tcp_output.c */
tcp_send_skb()
tcp_transmit_skb()
make tcp header
sk->tp_pinfo.af_tcp.af_specific
->queue_xmit(skb)
262
/* net/ipv4/tcp_ipv4.c */
struct proto tcp_proto = {
netxt, prev
tcp_close
tcp_v4_connect
tcp_accept
NULL /* retrasmit */
tcp_write_wakeup
tcp_read_wakeup
tcp_poll
tcp_ioctl
tcp_v4_init_sock
tcp_v4_destroy_sock
tcp_shutdown
tcp_getsockopt
tcp_setsockopt
tcp_v4_sendmsg
tcp_recvmsg
…
“TCP”
...
}
Send Data (cont`)
sending data through socket
/* net/ipv4/tcp_output.c */
tcp_transmit_skb()
sk->tp_pinfo.af_tcp.af_specific
->queue_xmit(skb)
/* net/ipv4/ip_output.c */
ip_queue_xmit()
build IP header
fragment handling
call ip_route_output()
/* dst_cache.output =
ip_output in ip_route_output */
sk->dst_cache->output()
/* net/ipv4/ip_output.c */
ip_output()
ip_finish_output(skb)
263
/* net/ipv4/tcp_ipv4.c */
struct tcp_func ipv4_specific = {
ip_queue_xmit
tcp_v4_send_check
tcp_v4_rebulid_header
tcp_v4_conn_request
tcp_v4_sync_recv_sock
tcp_v4_get_sock
sizeof(struct iphdr)
ip_setsockopt
ip_getsockopt
v4_addr2sockaddr
sizeof(struct sockaddr_in)
}
sk_alloc() => tcp_v4_sock_init()
tcp_v4_sock_init() {
…
sk->tp_pinfo.af_tcp.af_specific=&ipv4_specific
..
}
Send Data (cont`)
/* include/linux/netdevices.h */
struct hh_cache {
hh_refcnt
hh_type
hh_output
…
}
sending data through socket
/* include/net/ip.h */
ip_finish_output()
hh->hh_output(skb)
/* net/core/dev.c */
dev_queue_xmit()
hh->output =
neigh_ops->output =
dev_queue_xmit
/* net/ipv4/arp.c*/
input pkt into dev->qdisc
dev->hard_start_xmit()
/* driver/net/3c509.c */
el3_start_xmit()
make ethernet frame
send frame using inb(), outb(), ...
init_module() in 3c509
/* driver/net/3c509.c*/
init port, irq, …
make dev structure
dev->open=el3_open
dev->hard_start_xmit =
el3_start_xmit
...
264
struct device {
name
rmem_end, rmem_start
mem_end, mem_start
base addr
irq
…
init, destructor
….
device_addr
qdisc
….
open, stop
hard_start_xmit,
hard_header
...
}
Send Data (cont`)
sending data through socket
struct sock
struct device
...
qdisc
...
...
send queue
...
sk_buff
headers
data
...
sk_buff
headers
data
...
sk_buff
headers
data
...
Protocol Layer
Device Layer
265
Send Data (Cont`)
Sending all together (TCP/IP & Ethernet)
cf) compare with the control flow of FS, it’s too terrible (FS is a piece of cake)
VFS
BSD socket
inet socket
TCP
/* fs/read_write.c */
sys_write()
/* net/socket.c */
sock_write()
/* net/ipv4/af_inet.c */
inet_sendmsg()
/* net/ipv4/tcp_output.c */
tcp_send_skb()
/* net/ipv4/ip_output.c */
IP
Device
ip_queue_xmit()
/* driver/net/3c509.c */
el3_start_xmit()
Linux kernel
266
Receive Data
receiving data through socket
/* net/ipv4/ip_input.c */
ip_local_deliver()
ip_forward(), ip_defrag()
skb->dst->input()
/* dst.ipput =
ip_local_deliver in ip_route_input() */
/* net/ipv4/ip_input.c */
ip_rcv()
make sk_buff in device structure
ptype->func()
/* net/core/dev.c */
net_bh()
/* include/linux/netdevice.h */
struct packet_type {
type
dev
func
….
}
/* net/ipv4/ip_output.c */
struct packet_type
ip_packet_type = {
ETH_P_IP, NULL,
ip_rcv,
...
}
mark_bh(NET_BH)
/* driver/net/3c509.c */
el3_interrupt()
el3_open()
….
request_irq(dev->irq, el3_interrupt
267
Receive Data (cont`)
receiving data through socket
tcp_data_queue() /* sk_buff into sk */
wake up process
tcp_data()
check consistency, …
tcp_data()
/* net/ipv4/tcp_input.c */
tcp_rcv_state_process()
call tcp_rcv_established
or call tcp_rcv_state_process
/* net/ipv4/tcp_ipv4.c */
tcp_v4_rcv()
tcp_v4_do_rcv()
ipprot->handler()
/* net/ipv4/ip_input.c */
ip_local_deliver()
268
/* include/net/protocol.h */
struct inet_protocol {
handler
err_handler
...
name
}
/* net/ipv4/protocol.c */
struct inet_protocol
tcp_protocol {
tcp_v4_rcv
tcp_v4_err
….
TCP
}
Receive Data (cont`)
receiving data through socket
/* fs/read_write.c */
sys_read()
f->f_op->read
/* net/socket.c */
sock_read()
socki_lookup(d_inode)
make msg header
sock_recvmsg()
sock->ops->recvmsg
/* net/ipv4/af_inet.c */
inet_recvmsg()
sk->prot->sendmsg
/* net/ipv4/tcp.c */
tcp_recvmsg()
add_wait_queue(sk->sleep, {current, NULL})
269
tcp_data()
Receive Data (cont`)
Receiving all together (TCP/IP & Ethernet)
/* fs/read_write.c */
sys_read()
VFS
/* net/socket.c */
sock_read()
BSD socket
/* net/ipv4/af_inet.c */
inet_recvmsg()
inet socket
TCP
/* net/ipv4/tcp.c */
tcp_recvmsg()
wake up
/* net/ipv4/tcp_input.c */
tcp_rcv_state_process()
/* net/ipv4/ip_input.c */
sleep
IP
ip_rcv()
/* net/core/dev.c */
/* driver/net/3c509.c */
Device
Linux kernel
el3_interrupt()
270
net_bh()
Conclusion in Network
Add new features
/* fs/read_write.c */
sys_write()
/* net/socket.c */
sock_write()
/* net/ipv4/af_inet.c */
inet_sendmsg()
secure_tcp()
/* net/ipv4/tcp_output.c */
tcp_send_skb()
/* net/ipv4/ip_output.c */
ip_queue_xmit()
compress_net()
virtual_ip()
/* driver/net/3c509.c */
el3_start_xmit()
Linux kernel
271
Conclusion of Linux
abstraction is just a set of data structure in kernel level
process
struct task_struct
struct user
/* include/linux/sched.h */
/* include/asm-i386/user.h */
memory
struct vm_area_struct
/* include/linux/sched.h, include/asm-i386/page.h */
struct file, struct inode
/* include/linux/fs.h, ext2_fs_i.h */
file
file system
struct super_block
/* include/linux/fs.h, */
buffer
struct buffer_head
/* include/linux/fs.h */
device driver
struct device_struct
IPC
TCP/IP
/* fs/devices.c, driver/* */
/* include/linux/ipc.h, sem.h, msg.h, shm.h */
/* include/linux/tcp.h, ip.h */
272