Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
UNIX 내부 구조 (LINUX Kernel을 중심으로) Contents Part I. UNIX Operating System 1. Introduction 2. Process Management 3. Memory Management 4. File System 5. Synchronization & IPC 6. I/O System (Device Driver) Part II. Detailed Study: LINUX Kernel Internals 1. Where is everything? System call Implementation Device Driver using Module Programming 2. Linux internals 2 References U. Vahalia, “Unix Internals, The New Frontiers”, Prentice Hall, 1996. H. M. Deitel, “Operating Systems”, 2nd edition, Addison-Wesley, 1990 Silberschatz and Galvin, “Operating System Concepts (5th edition)”, AddisonWesley, 1998 Mukesh Singhal and Niranjan G. Shivaratri, “Advanced Concepts in Operating Systems”, McGraw-Hill, 1994. Maurice J. Bach, “The Design of the UNIX Operating System”, Prentice Hall, 1986. M. Beck, etc, “Linux Kernel Internals, 2nd Ed”, Addison-Wesley, 1997 Marshall K. McKusick, K. Bostic, M. Karels and J. Quarterman, “The Design and Implementation of the 4.4 BSD Operating System”, Addison-Weseley Pub. Co., 1996. Benry Goodheart and James Cox, “The Magic Garden Explained”, Prentice Hall, 1994. 3 I. Introduction What is UNIX Operating System? Brief History Kernel Architecture Features of UNIX Operating System 4 What is UNIX Operating System? X window csh vi du who kernel Network Admin. Package wc telnet Hardware ps grep a.out sort gcc ls What’s the similarity between Onion and UNIX? 5 RDBMS What is UNIX Operating System? (Cont`) User Programs User Programs Trap User level Libraries Kernel level System Call Interface File System Management Process Management Buffer Cache IPC Context Device Drivers Memory Management Hardware Control (Interrupts handling, etc) HW level Hardware (Source : The design of the UNIX OS) 6 What is UNIX Operating System? (Cont`) UNIX Operating System is a Resource Manager Physical Resource CPU, Memory, Disk, Network… Abstract Resource process, thread, page, file, inode, message, security, … UNIX Operating System is the Computing Environments provide resources’ service to users system call, API abstraction is just a set of data structure in kernel level 7 Brief History Before UNIX Multics: 1965, AT&T (Bell Lab), General Electronic, MIT Epoch 1969, Ken Thompson, “Space Travel” on PDP-7 Dennis Ritche s5fs, ed, shell (Bourn shell의 조상) 1973년 “The UNIX Time Sharing System” in CACM BSD Billy Joy, Chuch Haley (대학원생) ex, csh, paging based virtual memory system, TCP/IP, ffs, socket 1993년 4.4BSD (final version, 이후 BSDI 회사 ) AT&T System V Version 1,2,…,7, System III, System V, … SVR4.2/ESMP region based virtual memory, IPC, remote file sharing, STREAM, 8 Brief History (Cont`) Commercial UNIX XENIX (MS, SCO), SCO UNIX (SCO), AIX (IBM, Journaling FS), HP-UX (HP), ULTRIX (DEC, 최초의 MP), OSF/1 (Digital), …. SunOS (Sun Microsystems, VFS, NFS), Solaris, Unixware (Novell) Mach 최초의 micro-kernel chorus, Exo-kernel, SPIN, L4, …. http://ssrnet.snu.ac.kr/~choijm/current_os.html standard SVID(System V Interface Definition), POSIX (IEEE), X/OPEN (Inc.) UI (SUN, AT&T : Solaris), OSF (OSF/1) Linux Performance oriented Philosophy of COPYLEFT 9 Kernel Architecture Monolithic Kernel traditional UNIX, SVR4, Solaris, Linux, …. process process process System Call OS Functionality Integrated Kernel OS Personality Hardware 10 Kernel Architecture (Cont`) Monolithic Kernel process process read() fork() System Call sys_read() sys_fork() File System bread() Buffer Cache Process Management copy_mm() OS Personality hd_request() Disk Device Driver Memory Manager copy_thread() do_hd_io() Hardware 11 CPU Kernel Architecture (Cont`) Micro-Kernel Mach, Chorus, L3/L4, SPIN, QNX, Window-NT … process Server Server System Call Microkernel Hardware 12 Server OS Functionality Kernel Architecture (Cont`) Micro-Kernel process read() File System Server Process Server System Call sys_read() hd_request() Microkernel Hardware what is the advantage of micro kernel ? 13 …. Windows-NT Architecture Windows-NT Applications OS/2 Client Logon Process NT Executive POSIX Server Win32 Server Security Server Object Manager POSIX Client OS/2 Server Message Protected Subsystem (Servers) Win32 Client Security Ref. Monitor Trap User mode System Services I/O Manager Kernel mode Process Manager File System Cache Manager Device Drivers LPC Facility VM Mgt. Network Drivers Kernel Hardware Abstraction Layer(HAL) HW Control Hardware (Source : Inside Windows NT) 14 Features What is Good about UNIX Open system free Small is beautiful philosophy file: just stream of bytes Simple and Coherent data, device, pipe, socket, memory, process, … can be treated as a single abstraction (file) Portability high-level language new paradigm: OO, client-server model, clustering, PDA, MM Server True Parallelism Multitasking (Time Sharing), Multiprogramming, Multiprocessor, MPP 15 Features (Cont`) What is Wrong with UNIX Too many variant dumping ground Not small and simple any more uncontrolled growth Building-block approach inappropriate for beginner Lack of GUI not now Ritche’s words, “It takes a genius to understand and appreciate the UNIX’s simplicity” 16 II. Process Management 17 Overview What is process? process state transition context scheduling kernel entry point interrupt, trap, system call signal 18 What is Process? Definition an instance of a running program (runnable program) an execution environment of a program scheduling entity a control flow and address space PCB (Process Control Block) : proc. table and U area Manipulation of Process create, destroy context state transition dispatch (context switch) sleep, wakeup swap 19 Process State Transition user running syscall, interrupt fork initial (idle) return from syscall or interrupt kernel running fork swtch zombie exit wait sleep, lock swtch ready to run wakeup, unlock swap asleep swap suspended ready suspended asleep (Source : UNIX Internals) 20 Process State Transition (Cont`) Flow of execution : execution mode (cf: address space) Kernel execution process A execution Kernel execution process B creation Interrupt or Trap cause change of execution modes process C execution Kernel execution process B execution Kernel execution (Source : Magic Garden) 21 Context context : system context, address (memory) context, H/W context memory proc table file table segment table page table fd Registers (TSS) eip sp eflags eax swap cs disk …. U area …. 22 Context : system context System context proc. Table identification: pid, process group id, … family relation state sleep channel: sleep queue scheduling information : p_cpu, p_pri, p_nice, .. signal handling information address (memory) information U area stores hardware context when the process is not running currently UID, GID arguments, return values, and error status for system call signal catch function file descriptor usage statistics May it be different according to the version and variant of UNIX 23 Context : address context fork example int char glob = 6; buf[] = “a write to stdout\n”; int main(void) { int var; pid_t pid; var = 88; write(STDOUT_FILENO, buf, sizeof(buf)-1); printf(“before fork\n”); if ((pid = fork()) == 0) { glob++; var++; } else sleep(2); /* child */ /* parent */ printf(“pid = %d, glob = %d, var = %d\n”, getpid(), glob, var); exit (0); } (Source : Adv. programming in the UNIX Env., pgm 8.1) guess what can we get from this program? 24 Context : address context (Cont`) fork internal : compile results gcc test.c header text 0xffffffff 0xbfffffff … movl %eax, [glob] addl %eax, 1 movl [glob], %eax ... glob, buf data kernel bss stack var, pid stack 0x0 data text a.out : ELF format Executable and Linking Format user’s perspective (virtual address) 25 Context : address context (Cont`) fork internal : before fork (after run a.out) memory proc T. segment T. text var, pid pid = 11 stack glob, buf data cf) we assume that there is no paging mechanism in this figure. 26 Context : address context (Cont`) fork internal : after fork proc T. memory glob, buf segment T. data pid = 11 text var, pid stack proc T. segment T. glob, buf pid = 12 data stack address space : basic protection barrier 27 var, pid Context : address context (Cont`) fork internal : with COW (Copy on Write) mechanism after “glob++” operation after fork with COW memory proc T. segment T. pid = 11 proc T. text segment T. pid = 11 text stack proc T. data stack segment T. proc T. pid = 12 segment T. pid = 12 data data 28 Context : address context (Cont`) execve internal memory proc T. segment T. data pid = 11 text a.out stack text data stack 29 header text data bss stack Context : hardware context time sharing (multitasking) Where am I ?? time quantum process 1 … process 2 process 3 30 Context : hardware context (Cont`) brief reminds the 80x86 architecture ALU Control Unit IN OUT Registers • eip, eflags • eax, ebx, ecx, edx, esi, edi, … • cs, ds, ss, es, ... • cr0, cr1, cr2, cr3, GDTR, TR, ... 31 Context : hardware context (Cont`) context swtch save context Proc T. TSS eip sp eflags eax CPU Proc T. cs U area restore context TSS eip sp eflags eax cs U area 32 Context : hardware context (Cont`) context swtch : pseudo-code in UNIX … /* need context swtch */ if (save_context()) { /* pick another process to run from ready queue */ …. restore_context(new process) /* The control does not arrive here, NEVER !!! */ } /* resuming process executes from here !!! */ …... (Source : The Design of the UNIX OS) trick : register (eg, eax in 80*86 CPU) Think about the difference between context switch and system call. 33 Process Scheduling Process scheduling allocate CPU resource among the competing processes criteria : fairness, efficiency (response time vs. throughput) types of processes Interactive Batch (Computation-Intensive) Real-time video,hospital types of scheduling Preemptive scheduling other processes can take CPU away from the current running process Non preemptive scheduling(Windows98) other processes can not take CPU away from the current running process 34 스케줄링 기준 중앙처리장치 이용률(utilization) 처리율(throughput) 완료프로세스/시간 반환 (turnaround) 시간 프로세스 시작->끝 대기(waiting)시간 준비 큐에서 보낸 시간의 합 응답(response)시간 작업제출 후 응답이 시작될 때까지 걸리는 시간 35 Process Scheduling (Cont`) Existing Policies FCFS (First Come First Served) RR (Round-Robin) SJF (Shortest Job First) Multilevel Feedback Queue EDF (Earliest Deadline First) RM (Rate Monotonic) Fair Queuing Gang Scheduling Causality Scheduling Process migration 36 은행 time quantum(10-100milisec) 여러 개의 큐 Process Scheduling (Cont`) UNIX : Round Robin with multilevel Feedback Queue Round-Robin Ready Queue P3 P2 P1 37 CPU Process Scheduling (Cont`) Multilevel Feedback Queue Ready Queue 1 P8 P7 P6 CPU P4 CPU Ready Queue 2 P5 •higher priority •less time quantum ……. Ready Queue n P3 P2 CPU P1 38 Process Scheduling (Cont`) Round-Robin : real implementation scheduling information in proc. table : p_pri, p_cpu, p_nice every clock tick : increments p_cpu for current running process every second : p_cpu = p_cpu * decay factor (generally 1/2) p_pri = PUSER + p_cpu/2 + p_nice Example of System III 3 process, PUSER=50, p_nice = 0, clock ticks 60 at every second P1 P2 P3 p_pri p_cpu p_pri p_cpu p_pri p_cpu second 0 50 0 50 0 50 0 1 65 30 50 0 50 0 2 57 15 65 30 50 0 3 53 7 57 15 65 30 4 66 33 53 7 57 15 39 Process Scheduling (Cont`) Example of BSD decay factor : (2*load_average) / (2*load_average + 1) p_pri = PUSER + (p_cpu/4) + (2*p_nice) clock tick is 10msec time quantum is 10 clock ticks Example of Mach decay factor : 5/8 p_usrpri = PUSER + (3.8*(max(1,M/P) ) * p_cpu )/T + 0.5 * p_nice Example of SVR4 support REAL-TIME class process class independent scheduler / class dependent scheduler Example of LINUX support REAL-TIME process select a process that has the highest value of “priority + counter” “counter” of the current process decreases at each clock tick. 40 Process Scheduling (Cont`) Range of Process Priorities Kernel Mode Priority Swapper P Waiting for Disk I/O P P P Waiting for Buffer Waiting for Inode P Waiting for TTY IO User Mode Priority Waiting for Child Exit P P User Level 0 (50) P P User Level 1 P …… P User Level n 41 P P (Source : The Design of the UNIX OS) Kernel Entry Point Interrupt Trap system call device kernel process MM HWM PM FS 42 DD Interrupt Handling Interrupt a mechanism that peripheral devices inform an asynchronous event to UNIX Operating System Real time Clock Kernel CPU IVT disk PIC tty network interrupt handlers 0 clock() clock() 1 nmi() disk_intr() 2 tty_intr() 3 disk_intr() 4 net_intr() cdrom …. what’s the difference between polling and interrupt? 43 합격자 발표 Interrupt Handling (Cont`) interrupt handling mechanism similar to the step of receiving a letter while telephoning step if user mode, change kernel mode save context of current process (make new context layer) determine interrupt source find interrupt vector and call interrupt handler …. interrupt handling….. restore saved context what if another interrupt is triggered while handling a interrupt? 44 Interrupt Handling (Cont`) clock interrupt handler ( timer_interrupt() in Linux ) clock() { restart clock /* will interrupt again */ if (callout table not empty) (eg) timer_list in LINUX) adjust time and schedule callout function if necessary if (profiling on) count program counter at time of interrupt gather statistics per process and system update CPU usage for the current running process if (one second elapsed) { alarm handling calculate the p_pri for all process reschedule if necessary wake up swapper or page daemon if necessary } } (Source : The Design of the UNIX OS) 45 Trap Handling trap : an asynchronous software event IVT 0 div_by_zero() 1 invalid_opcode() 2 overflow() 3 segment_fault () 4 page_fault () …. 20 clock() 21 nmi() 22 tty_intr() 23 disk_intr() 24 net_intr() …. 80 system_call() …. 46 System Call Handling system call : an example of trap Kernel trap sys_call_table (sysent[]) IVT 0 div_by_zero() 0 sys_no_syscall() 1 invalid_opcode() 1 sys_exit() 2 overflow() 2 sys_fork() 3 segment_fault () 3 sys_read () 4 page_fault () 4 system_call() sys_write () …. 80 system_call() 47 …. …. sys_getpid() …. 255 sys_no_syscall() 47 sys_fork() sys_read() System Call Handling (Cont`) invoke system call Kernel process main() { …. fork() } libc.a …. fork() { …. movl $2, eax trap $80 …. } …. read() { … } IVT sys_call_table (sysent[]) 0 div_by_zero() 0 sys_no_sys() 1 in_opcode() 1 sys_exit() 2 overflow() 2 sys_fork() sys_fork() 3 seg_fault () 4 page_fault () 3 sys_read () 4 sys_write () sys_read() …. …. 80 system_call() …. 47 sys_getpid() …. 255 sys_no_sys() 48 System Call Handling (Cont`) how to make a new system call coding new system call function in kernel space allocate syscall_number (and an empty slot in sys_call_table[]) and registering kernel rebuild reconfigure library ar, ranlib coding your program with new system call 49 Signal a mechanism to inform an asynchronous event to process types of signal : SIGKILL, SIGINT, SIGBUS, SIGUSR1, …. action : abort, exit, ignore, stop, user level catch function void sig_handler(signo) int signo; { signal (SIGUSR1, sig_handler); printf(“received signal %d\n”, signo); ….. } /* reinstall */ /* handle the signal */ main () { signal (SIGUSR1, sig_handler); …. for ( ; ; ) pause(); /* install the handler */ } what’s the difference among interrupt, trap, and signal? 50 Signal (Cont`) register signal handler (signal catch function ) send signal signal detection : state transition from kernel running to user running call signal handler variables for signal in task structure in LINUX int sigpending : is signal received or not? struct signal_struct *sig sigset_t signal, blocked typedef struct { unsigned long sig[_NSIG_WORDS]; } sigset_t; /* asm-i386/signal.h */ struct sigaction /* asm-i386/signal.h */ struct signal_struct /* sched.h */ count action[_NSIG] siglock 51 sa_handler sa_flags sa_restorer sa_mask III. Memory Management 52 Memory Hierarchy hierarchy register CPU cache Main Memory • larger capacity • lower speed • lower cost Secondary Storage Server (or INTERNET) caching is more and more important (how to keep consistency?) 53 Memory Management Strategy Three strategies Fetch strategy: when a process (page) is brought into memory? demand fetch prefetch (agent in Web) Placement strategy: where a process (page) is put on memory? first fit, best fit, worst fit replacement strategy: which process (page) is evicted from memory? LRU, LFU, MRU, … 54 History of Memory Management System single user system (stone age of memory management) overlay fixed partition multiprogramming system absolute assembler, relocating assembler variable partition multiprogramming system coalescing , compaction virtual memory system paging segmentation (segment, region, vm_object) paging/segmentation 55 중첩(Overlay) 할당된 기억장치보다 큰 프로세스를 위해 예) 2-pass 어셈블러 심볼테이블(20K) 공통루틴(30K) 중첩드라이버(10K) pass 1 (70K) pass2 (80K) 56 History (Cont`) variable partition multiprogramming system Scenario • fork P1 (40K) • fork P2 (20K) • fork P3 (10K) • fork P4 (20K) • fork P5 (40K) • fork P6 (20K) • fork P7 (70K) • exit P1 • exit P3 • exit P4 • exit P6 memory and kernel internals 0 kernel free memory map 100 P1 140 P2 P3 P4 P5 160 170 190 230 100 140 40 160 190 30 230 250 20 320 400 80 P6 250 P7 320 400 57 Memory Management Strategy : Placement memory and kernel internals 0 kernel free memory map 100 P1 140 160 170 190 P2 P3 P4 P5 100 140 40 160 190 30 230 250 20 320 400 80 230 250 P6 P7 320 Where to go?? 400 58 Scenario • fork P1 (40K) • fork P2 (20K) • fork P3 (10K) • fork P4 (20K) • fork P5 (40K) • fork P6 (20K) • fork P7 (70K) • exit P1 • exit P3 • exit P4 • exit P6 • fork P8 (25K) Memory Management Strategy : Placement memory 0 kernel kernel internals 100 free memory map P1 140 160 170 190 P2 P3 P4 P5 first fit best fit Scenario 100 140 40 fork P8 (25K) 160 190 30 230 250 20 320 400 80 230 250 P6 P7 worst fit 320 400 issue : fragmentation employed at swap management, KMA (kernel memory allocator) 59 Virtual Memory virtual memory : separate virtual address and physical address virtual address kernel stack 0xffffffff kernel kernel bss stack kernel data kernel text bss data text page 0x0 60 Virtual Memory (Cont`) virtual address : Linux case 0xffffffff kernel 0xc0000000 env_end arg_end arg_start start_stack stack shared memory bss data text bss end_bss end_data data text end_code start_code brk shared C library bss end_data data end_code 0x0 other shared library program text start_code (Source : Linux Internals) 61 Virtual Memory (Cont`) physical memory consists of kernel and a set of processes physical memory 0x4ffffff P4 P3 P2 P1 kernel 0x0 62 Virtual Memory (Cont`) physical memory a collection of page frame (4K or 8K) physical memory P1 page frame n page frame n-1 …. P2 page frame 5 page frame 4 page frame 3 page frame 2 page frame 1 63 P3 Virtual Memory (Cont`) address translation segment table origin register virtual address v = (s, p, d) offset segment page number number p d s b + s' + segment table p' page frame number p' page table 64 offset d physical address Virtual Memory (Cont`) address translation : table structure V segment start address (s’) L R W E A segment table V page frame number (p’) D R U W COW page table cf) disk block descriptor per each page table entry swap (fs) number block number type (fill 0, demand fill) 65 Virtual Memory (Cont`) execve (final) memory nK n-1 K proc T. segment T. 1 1 0 1 0 0 0 0 0 0 1 0 4K 28 K 20 K 12 K 32 K 28 K 24 K 20 K 16 K 12 K 8K 4K 0K page T. 66 T2 a.out 0K text D1 12 K S1 T1 header 48 K data stack Virtual Memory (Cont`) anonymous pages of segment SVR 4.0 virtual memory structure struct proc p_as struct as seg_list hint struct hat struct seg as_ptr private s_ops base size struct segvn_data private data as_ptr private s_ops base size anon_map vnode resident pages of file virtual address space as_ptr private s_ops base size text data stack as_ptr private s_ops base size u area 67 Virtual Memory (Cont`) BSD (Mach) virtual memory structure struct task vm_map struct vm_map first hint last struct vm_map_entry struct vm_object struct vm_page resident page list 68 struct pmap Virtual Memory (Cont`) Linux virtual memory structure task_struct mm mm_struct count pgd mmap vm_area_struct vm_end vm_start vm_flag vm_inode vm_end vm_area_struct vm_end vm_start vm_flag vm_inode vm_end 69 Data Code Virtual Memory (Cont`) advantage of virtual memory large address space no need of placement strategy flexible memory object sharing among the processes P1 segment T. 1 1 0 1 0 4K 28 K memory 20 K page T. P2 segment T. 1 1 0 1 8K 28 K 40 K page T. no free lunch : disadvantage of virtual memory address translation 70 Virtual Memory (Cont`) address translation with TLB (Translation Lookahead Buffer) segment table origin register virtual address v = (s, p, d) offset segment page number number p d s b + s p p' s' TLB (associative memory) + segment table p' page frame number p' page table 71 offset d physical address Virtual Memory (Cont`) HAT (Hardware Address Translation) isolate all hardware dependent code HAT in SVR4, pmap in BSD, pgd in Linux, ... responsible all address translation transparently case study : 80*86 CPU segment descriptor table (GDT, LDT) virtual address 16bit segment descriptor 32bit offset segment translation 32bit linear address 72 cf) 80*86 reminds GDT - available for all tasks - segment for OS code data - descriptor for LDT, TSS LDT - for a specific task IDT - interrupt service routine Virtual Memory (Cont`) HAT (Hardware Address Translation):Paging case study : 80*86 CPU 31 linear address 22 21 12 11 0 DIR PAGE offset 31 11 0 31 11 0 31 PFN 11 PFN PFN 0 offset physical address page directory page table CR3 control register: Page Directory Base Register 31 11 page table entry PFN 73 6 5 2 1 0 DR UWP •D: Dirty •R: referenced •U:User/Supervisor •W:Read/Write •P:Present(valid) Replacement Strategy Which page can be evicted from memory ? memory replacement policy p2 p4 p1 p3 p7 page fault for p8 p8 disk goal : reduce the number of page fault and thrashing 74 Replacement Strategy (Cont`) basic principle of replacement : locality temporal locality : stack, tree traverse, counting variable spatial locality : array, sequential code, file reference replacement policy FIFO (First In First Out) LRU (Least Recently Used) LFU (Least Frequently Used) NUR (Not Used Recently) MRU (Most Recently Used) Working Set Second Chance(FIFO+reference bit) 75 Replacement Strategy (Cont`) example : FIFO, LRU, LFU scenario : page reference order system internals p1, p2, p3, p1, p4, p2, p1, p3, p4, p7, p8 memory p2 p4 p1 p3 p7 disk p8 guess which page will be evicted from memory under the LRU policy? which policy is the best policy? 76 Replacement Strategy (Cont`) Project I : program a simulator for FIFO, LRU, and LFU policy and compare their performance. assume - memory consists of 20 page frames - a range of page number is 0 ~ 49 - number of references is 300 program the 3 policies - use linked list for FIFO and LRU - use priority tree for LFU if possible - use hash to fast find a page compare the performance and discuss it 77 Replacement Strategy (Cont`) Example of real implementation in UNIX : buffer cache head lru list header hash queue header tail (page_no % 5 ) = 0 10 45 (page_no % 5 ) = 1 21 26 (page_no % 5 ) = 2 2 (page_no % 5 ) = 3 33 28 (page_no % 5 ) = 4 24 19 30 3 43 (Source : The Design of the UNIX OS) 78 Replacement Strategy (Cont`) example : NUR used by pagedaemon (two-handed clock algorithm) V page frame number (p’) possible combination D R U W COW 0 0 1 1 0 1 1 0 79 replace page having (0,0) combination first Swapper vs. PageDaemon swapping and paging replace some object from memory when memory is almost full. swapping object : process swap in/ swap out swap space management similar to variable partition multiprogramming paging object : page page fault handling 80 IV. File System 81 Overview of File System process 1 …. process 2 process n User mode System mode Virtual File System ffs nfs ext2fs ntfs buffer cache …. mmfs procfs File System device driver 82 User Interface System call open read/write close dup link pipe, mkfifo mkdir, readdir mknod stat mount sync, fsck 83 User Interface (Cont`) file descriptor, file table, inode (vnode) proc table fd segment table file table vnode inode TSS U area 84 User Interface (Cont`) fork vs open fork proc table open same file fd vnode proc table file table vnode file table parent proc table fd parent fd file table child how about dup? 85 Disk system physical view plotter, arm, head cylinder, track, sector seek time, rotational latency, transmission time logical view (a viewpoint of UNIX) disk is a collection of disk blocks the disk block size is usually equal to the page frame size 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 …. 86 Structure of File disk block allocation want to create a file with size of 14 K assume - disk block size is 4 K. 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 sequential allocation non sequential allocation block chain, indexed block, FAT 87 .. Structure of File (Cont`) non sequential allocation block chain new file name 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 88 .. Structure of File (Cont`) non sequential allocation index block new file name …... index block what if the index block is full ? 89 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 .. Structure of File (Cont`) non sequential allocation FAT (File Allocation Table) FAT new file name 4 5 NIL 12 11 6 9 21 34 NIL UN NIL 7 UN 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 .. what is the adv. and disadv. among block chain, index block, and FAT ? 90 Structure of File (Cont`) sequential allocation new file name start size 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 .. what is the adv. and disadv. between sequential and non sequential allocation ? 91 Structure of File (Cont`) inode in Unix File System inode type (4bit) u g s r w x r w x r w x i_inode_number i_mode i_nlink i_uid, gid i_rdev i_atime, ctime, mtime S_IFSOCK S_IFLNK S_IFREG S_IFBLK S_IFDIR S_IFCHR S_IFIFO direct …. indirect 92 Structure of File (Cont`) inode in Unix File System: find block assume the size of disk block is 4K which block is related if f_offset is 10000 ? (or 47000 ) file table inode 4 7 12 18 f_offset 24 direct 33 …. 41 indirect 165 93 169 Structure of Directory connect file name to disk block(s) directory entry in UNIX FS inode number file name directory entry in DOS file name extension attributes time first block number provide hierarchical structure for file system inode 1 disk block 1 inode 3 i_mode time …. 1 i_mode time …. 7 1 1 3 4 5 6 7 9 .. . usr dev etc vmunix var mnt disk block 7 1 .. 3 . 12 src 16 include 17 lib 20 bin 23 member 25 local 94 inode 23 disk block 39 i_mode time …. 39 3 .. 23 . 32 jim 33 tom 37 mark 41 sooni 42 mjc Structure of Directory (Cont`) hierarchical view / usr src dev include etc lib jim var bin member tom mark 95 mnt vmunix local sooni mjc Structure of Directory (Cont`) open example open(“/usr/member/sooni/test.c”, O_RD) find inode using directory structure (namei()) allocate fd, file table and initialize proc table fd file table inode f_offset …. 96 Structure of File System file system: boot, super, inode, data block /dev/hda /dev/hdb system /dev/hda1 /dev/hda3 /dev/hda2 boot super i-node disk blocks 97 Structure of File System (Cont`) super block : manage information for file system (cf: inode for file) struct superblock s_type s_flag s_dev s_blocksize s_magic s_name …. s_free_inode [] s_free_disk block [] free inode list (map) ... free disk block list (map) ... iget, iput balloc, bfree 98 Structure of File System (Cont`) super block struct superblock s_type s_flag s_dev s_blocksize s_magic s_name …. s_free_inode [] s_free_disk block [] 29 27 26 24 21 20 19 61 57 56 54 51 50 48 46 45 43 42 41 39 38 37 34 disk block 29 disk block 61 …… 99 Structure of File System (Cont`) mount vfsmntlist “mount /dev/hda3 /mnt” super block for /dev/hda3 inode for /mnt inode for root on FS of /dev/hda3 open(“/mnt/test.c”, O_RD) 100 s_dev s_blocksize mounted point root inode ... vfsmount mmt_sb vfsmount Inode for special file inode structure for special file pipe no indirect block (unnamed pipe) readers, writers, read pointer, write pointer special device file no direct, indirect block device number : major number + minor number major number : corresponding device type used as index for device switch table minor number : corresponding device unit pass as argument to device driver 101 Existing File System S5FS first and conventional UNIX file system FFS support 255 characters file name cylinder groups fragments LFS small write optimize suitable for RAID storage system directory entry for ffs i_no size file_name fast file system structure boot block super block cylinder group 1 (inode, disk blocks) cylinder group 2 VxFS (Journaling File System) fast recovery using internal logging …... 102 Existing File System ext2 File System Linux default file system similar to Berkeley’s FFS inode : 12 direct block used bitmap for free block and inode management fault-tolerant features Ext2 file system structure super block boot block Group descriptor Block group 0 Block bitmap Block group 1 Inode bitmap …… Inode table Block group n Data Blocks 103 Existing File System NFS stateless protocol XDR (Extended Data Representation) AFS, Coda File System disconnected operation Sprite File System VFS application nfsd VFS to support various file system nfs server system call strong consistency nfs client mfs procfs VFS NFS RPC stub 104 NFS RPC stub XDR UFS swap space management swap space management P1 stack swap space 0 P1 P2 P3 P4 P5 P6 data text P2 stack data text 64M 105 swap space management swap used map Scenario • swap out P1 (3M) • swap out P2 (3M) • swap out P3 (2M) • swap out P4 (1M) • swap out P5 (3M) • swap out P6 (4M) • swap in P2 • swap in P4 • swap in P5 swap used map 3 6 3 P1 8 12 4 P2 16 64 48 swap space 0 P3 P4 P5 P6 64M why does UNIX manage swap space differently to the FS ? 106 V. Inter-Process Communication 107 Inter-Process Communication (IPC) synchronization pipes communication via files signal System V IPC message queue shared memory semaphore IPC with sockets 108 synchronization parallelism multiprocessor (true parallelism) or time sharing (quasi-parallelism) race condition : more than one process want to access a same resource shared resource mutual exclusion only one process can exclusively access a shared resource at a time critical section : a portion of a program that accesses a shared resource representative mechanism: ipl, lock, semaphore, test&set deadlock 109 synchronization (Cont’) example of race condition I int main(void) { pid_t pid; if ((pid = fork()) == 0) { /* child */ charatatime(“output from child\n”); } else { charatatime(“output from parent\n”); } exit (0); } void charatatime(char *str) { char *ptr; int c; setbuf(stdout, NULL); for (ptr = str; c=*ptr++; ) putc(c, stdout); } (Source : Adv. programming in the UNIX Env. pgm 8.7) guess what the results are? 110 outpuot utfprut froom chmild parent synchronization (Cont`) system internals task structure fd file structure inode f_pos shared resource fd 111 synchronization (Cont`) example of race condition II scenario process P1 is currently dispatching (removing from ready queue) disk interrupt occurs disk interrupt handler wake up process P2 and want to insert it into ready queue RQ P2 RQ RQ P4 P1 P4 P1 P4 P1 P3 P3 P3 112 synchronization (Cont`) ipl (interrupt priority level) BSD SVR4 Purpose spl0 spl0 enable all interrupts splsoftclock spltimeout disable functions scheduled by timers splnet disable network protocol processing splstr disable STREAMS interrupts spltty spltty disable terminal interrupts splbio spldisk disable disk interrupts splclock disable hardware clock interrupt splhigh spl7 or splhi disable all interrupts splx splx restore ipl to previously saves value 113 synchronization (Cont`) lock associate lock variable to each shared resource lock before (unlock after) the critical section spin_lock primitive void spin_lock(spinlock_t *s) { while (test_and_set (s) != 0) ; } void spin_unlock (spinlock_t *s) { *s = 0; } (Source : UNIX internals) 114 synchronization (Cont`) sleep_lock process wants resource lock the resource No is it locked? Yes use resource sleep on resource unlock resource awakened by any process Yes wake up all waiting processes does anyone want it? No continue other processing spin lock or sleep lock, lock granularity, rw_lock (try_lock) 115 synchronization (Cont`) semaphore an object that can be accessed P and V (and sem_initialize) method. semaphore primitive void initsem (semaphore_t *sem, int val) { *sem = val; } void P (semaphore_t *sem) { *sem -= 1; while (*sem < 0) sleep; } void V (semaphore_t *sem) { *sem += 1; if (processes slept on sem queue) wake up the processes slept on sem; } (Source : UNIX internals) 116 synchronization (Cont`) semaphore : example client server shared memory remove an item from shared memory produce an item put the item into shared memory consume the item 117 synchronization (Cont`) semaphore : example client server sem1, sem2 shared memory produce an item initsem(sem1, 5) initsem(sem2, 0) P(sem1) P(sem2) put the item into shared memory remove an item from shared memory V(sem1) V(sem2) consume the item 118 synchronization (Cont`) semaphore in the linux kernel widely used for ‘wait until condition meet’ (eg read disk blocks) semaphore /* include/asm-i386/semaphore, kernel/sched.c */ declare semaphore for each shared resource struct semaphore { atomic_t count; struct wait_queue *wait; } void down (struct semaphore *sem) { while (sem->count <= 0) sleep_on (&sem->wait); sem->count--; } void up (struct semaphore *sem) { sem->count++; wake_up (&sem->wait); } 119 down(x) critical section up(x) down(x) critical section up(x) process 1 process 2 shared resource struct semaphore *x synchronization (Cont`) semaphore in the linux kernel sleep, wakeup /* include/linux/wait.h kernel/sched.c */ struct wait_queue { struct task_struct *task; struct wait_queue *next; } void sleep_on (struct wait_queue *queue) { struct wait_queue entry = {current, NULL}; current->state = TASK_UNINTERRUPTABLE; add_wait_queue (queue, &entry); schedule(); remove_wait_queue(queue, &entry); } void wake_up (struct wait_queue *queue) { struct wait_queue *p = *queue; do { p->task->state = TASK_RUNNING; add_runqueue(p); p->p->next; } while (p != *queue); } interruptible_sleep_on(), wake_up_interruptible() 120 synchronization (Cont`) Deadlock system state that processes wait events that never occur. process 1 resource 1 process 2 resource 2 process 3 resource 3 resource 4 process 4 121 synchronization (Cont`) Deadlock deadlock prevention deadlock avoidance deadlock detection and correction reduction of resource allocation graph R1 R1 R1 P2 P1 P2 P1 P3 R2 R1 P2 P2 P1 P1 P3 P3 R2 122 P3 R2 R2 pipe named pipe, unnamed pipe pipe(fd[]), mkfifo(path, mode), mknod(path, mode, dev_t) process 1 process 2 write fd write fd read fd pipe kernel no indirect blocks in inode rd_pointer, wr_pointer, number of readers, number of writers 123 S_IFREG S_IFCHR S_IFBLK S_FIFO pipe pipe(unnamed pipe) limit cannot broadcast no object boundaries cannot direct data to a specific reader FIFO(named pipe) FIFO file must be explicitly deleted(unlink) named less secure than pipe 124 pipe (Cont`) example of pipe : “% ls -l | more” for (;;) { read_command(); parsing_command(); pipe(fd[]); if (fork()) { close(stdin); dup(fd[0]); if (fork()) { close(stdout) dup(fd[1]); exec(“ls”, …); } exec(“more”, …); } wait(); } 125 Communication via files the oldest way of data exchanging among processes P P file race condition may be occurred reading a data before the other has completed modifying it mandatory or advisory locking lockf, flock, fcntl fcntl(fd, cmd, arg) flock structure l_type l_whence l_start l_len l_pid F_GETLK, F_SETLK, …... 126 F_RDLCK, F_WRLCK, F_UNLCK, F_SHLCK, F_EXLCK Communication via files (Cont`) A deadlock scenario with file locking file P P In Linux, fcntl() returns the error EDEADLOCK 127 Signal register signal handler (signal catch function ) send signal signal detection : state transition from kernel running to user running call signal handler variables for signal in task structure int sigpending : is signal received or not? struct signal_struct *sig sigset_t signal, blocked typedef struct { unsigned long sig[_NSIG_WORDS]; } sigset_t; /* asm-i386/signal.h */ struct sigaction /* asm-i386/signal.h */ struct signal_struct /* sched.h */ count action[_NSIG] siglock 128 sa_handler sa_flags sa_restorer sa_mask System V IPC Message, Shared Memory, and Semaphore Common properties Key => id (cf: file name => fd) In kernel, ***id_ds for System V IPC (eg: msqid_ds) ipc_perm: key, uid, cuid, access mode, … ipcs, ipcrm Difference message : suitable for Object-Orient Concept shared memory : fast semaphore : for user level synchronization 129 System V IPC (Cont`) message queue msqid = sys_msgget (key, flag) sys_msgsnd (msqid, msgp, msgsz, flag) sys_msgrcv (msqid, msgp, msgsz, msgtype, flag) sys_msgctl(msqid, cmd, msqid_ds) senders struct msqid_ds P /* create */ /* send */ /* receive */ /* control */ receivers P P msg msg msg P P 130 System V IPC (Cont`) struct msqid_ds P P P msg_perm msg_first msg_last msg_stime msg_rtime msg_ctime wwait_queue rwait_queue msg_cbytes msg_qnum msg_qbytes msg_lspid msg_lrpid msg_next msg_type msg_spot msg_ts msg_next msg_type msg_spot msg_ts msgtype in sys_msgrcv() =0 : receive the first msg in the queue >0 : receive the given type msg in the queue <0 : receive the msg having the smallest value 131 System V IPC (Cont`) shared memory shmid = sys_shmget (key, size, flag) sys_shmat (shmid, shmaddr, shmflag, raddr) sys_shmdt (shmaddr) sys_shmctl(shmid, cmd, shmid_ds) struct shmid_ds shm_perm shm_segsz shm_atime shm_dtime shm_ctime shm_cpid shm_lpid shm_nattach shm_npage shm_pages /* for page table entries */ attaches /* struct vm_area_struct */ 132 System V IPC (Cont`) using shared memory vm area of process A vm area of process B kernel stack kernel stack 0xa27e8000 0x77ed000 0xa27e0000 heap data text heap data shared memory region 133 text 0x77e5000 System V IPC (Cont`) semaphore semid = sys_semget (key, nsems, flag) semop (semid, sops, nsops) semctl(semid, semnum, cmd, *arg) struct sembuf sops; struct sembuf { unsigned short sem_num; short sem_op; short sem_flg; } if (sem_op > 0) V() operation else P() operation struct 134 struct semid_ds sem_perm sem_otime sem_ctime sem_base sem_pending …… sem_nsems socket socket common interface for IPC and networking Protocol family: UNIX, INET, AX25, IPX, Appletalk layer structure of a network BSD socket INET TCP UDP IP PLIP SLIP parallel port serial port ETHERNET Ethernet card 135 ARP socket (Cont`) information for communication 5-tuple {protocol, local-addr, local-process, foreign-addr, foreign-process C library routines socket() : protocol, make socket structure bind() : assign local-addr and local-process connect() : foreign-addr, foreign-process listen() accept() : waiting in server : make connection to a client read(), write() send(), sendto(), recv(), recvfrom() cf) system call: sys_socketcall /* net/socket.c */ 136 socket (Cont`) socket structure file …. f_dentry …. f_pos f_op /* net/socket.c */ sock_lseek sock_read sock_write NULL sock_poll sock_ioctl NULL sock_no_open …. /* include/linux/net.h */ struct socket { state type flags ops /* proto_ops */ sk /* struct sock */ files, inodes next, wait, …. } 137 /* include/net/sock.h */ struct sock { ... } /* include/linux/net.h */ struct proto_ops { family dup, release, bind, connect, accept, listen, ... getsockops setsockops sendmsg recvmsg } /* for INET operation */ socket (Cont`) connection oriented protocol server socket() bind() listen() client accept() socket() blocks until connection from a client connect established connect() write() read() data (request) processing request write() data (reply) 138 read() socket (Cont`) connectionless protocol server socket() client bind() socket() recvfrom() bind() blocks until data received from a client sendto() data (request) processing request sendto() data (reply) 139 recvfrom() TLI connection oriented protocol server client t_open() t_open() t_bind() t_bind() t_listen() t_connect() wait for connection connection request t_accept() t_rcv() data (request) t_snd() data (reply) t_rcv() processing request t_snd() 140 VI. I/O System (Device Driver) 141 Role of a device driver handle data movement between memory and peripheral devices usually written by a third-party P P P P system call interface kernel file system device driver interface (through devsw table) tty driver disk driver 142 network driver Peripheral Device: General Structure H/W configuration extremely hardware dependent controller CSR (Control and Status Register) - driver writes to the CSRs to issue commands to the device and reads CSRs to obtain completion status or error condition - memory mapped I/O, special in/out instruction (eg) 80*86’s in/out command) - programmed I/O (tty, modem, printer), DMA (disk) internal buffer device itself 143 Disk Driver Disk I/O handling convert logical disk block number into physical sector(s) handle read/write requests, handle interrupt disk scheduling FCFS SSTF (Shortest Seek Time First) SCAN C-SCAN ….. DMA (channel) RAID 144 Terminal Driver interactive : line discipline canonical mode, raw mode (stty) cblock process raw queue (clists) tty_read canon queue tty_write out queue tty driver interrupt xbuf rbuf 145 CSR in/out General structure of Device Driver well defined entry point top half, bottom half character device driver block device driver open open close close read in/out strategy write ioctl in/out size intr intr mmap what’s the difference between character and block device driver? 146 Device Switch Table devsw: table for registering the entry points of device drivers struct cdevsw { int (*d_open) (); int (*d_close) (); int (*d_read) (); int (*d_write) (); int (*d_ioctl) (); int (*d_mmap) (); int (*d_segmap) (); int (*d_xpoll) (); int (*d_xhalt) (); struct streamtab *d_str; struct ttytab *d_tty; …. } cdevsw[]; struct bdevsw { int (*d_open) (); int (*d_close) (); int (*d_strategy) (); int (*d_size) (); int (*d_xhalt) (); …. } bdevsw[] (Source : UNIX Internals) 147 Device Switch Table (Cont`) Example of switch table bdevsw cdevsw hd_open hd_close hd_strategy con_open con_close con_read con_write con_ioctl ht_open ht_close ht_strategy tty_open tty_close tty_read tty_write tty_ioctl cd_open cd_close cd_strategy ed_open ed_close ed_read ed_write ed_ioctl nulldev nulldev mm_read mm_write nulldev hd_open hd_close hd_read hd_write nulldev dev file #ls -l /dev/ brw-r--r-- 0 1 brw-r--r-- 0 2 …. brw-r--r-- 0 11 brw-r--r-- 1 0 …. crw-r--r-- 1 0 crw-r--r-- 1 1 …. crw-r--r-- 5 0 hda1 hda2 hdb1 tape tty0 tty1 rhda1 why do we access disks through character interface? 148 Device Switch Table (Cont`) example : open open(“/dev/tty0”, O_RD) proc table fd file table inode i_dev : c, 1,0 cdevsw con_open con_close con_read con_write con_ioctl tty_open tty_close tty_read tty_write tty_ioctl ed_open ed_close ed_read ed_write ed_ioctl nulldev nulldev mm_read mm_write nulldev gd_open gd_close gd_read gd_write nulldev (*cdevsw[getmajor(dev)].d_open) (dev, …) 149 Device Switch Table (Cont`) install new device driver make new device driver and linking kernel my_open(), my_read(), my_write(), my_close(), …. register devsw table make special file # mknod /dev/mydrv [b|c] major_number minor_number 150 Device Switch Table (Cont`) control flow user mode read() kernel queue devsw table wakeup sleep interrupt handler driver IVT device where the requesting process is slept? 151 STREAM full-duplex data transfer and processing path consists of a pair of queues user application STREAM head user kernel W R W R STREAM module W R W R STREAM driver hardware 152 STREAM (Cont`) user user STREAM head STREAM head TCP UDP IP IP token ring ethernet user user user STREAM head STREAM head STREAM head TCP UDP IP ATM Reusable Module Multiplexing 153 DQDB STREAM (Cont`) STREAM features transparency among the queues reusable multiplexing message based communication virtual copying STREAM scheduler : priority bands 154 Part II. Detailed Study: Linux Kernel Internals 155 Contents why Linux? where is everything (kernel source code) ? kernel configure and compile system call implementation module programming some important kernel date structures 156 References M. Beck, H. Bohme, M Dziadzka, U Kunitz, R. Magnus, D. Verworner, “Linux Kernel Internals, 2nd Ed”, Addison-Wesley, 1997 Fred Butzen, Christopher Hilton, “The LINUX Network”, The M&T Books Slackware Series, 1998 Remy Card, etc, “the LINIX KERNEL Book”, John Wiley & Son, 1998 A. Bubini, “LINUX Device Driver”, O’REILLY, 1998 Anonymous, “Maximum Linux Security (A Hacker’s Guide To Protecting Your Linux Server and WS)”, SAMS Publishing, 1999 http://www.linux.org/ http://www.kernel.org/ http://kldp.org/ /usr/src/linux 157 Why Linux? freely available Linus Torvalds, Copyleft 1991 version 0.01 (November 1999, version 2.2.13) Redhat, Debian, Slackware, Alzza supported many companies Main characteristics multi-tasking multi-user access multi-processor support various architecture (80*86, sparc, mips, alpha, smp, ..) demand load executables paging dynamic cache for hard disk 158 Why Linux? (Cont`) main characteristics (cont`) shared library support for POSIX 1003.1 various formats for executable files true 386 protected mode emulating maths co-processor support for national keyboards and fonts support diverse file system (ext2, ..) TCP/IP, SLIP, PPP BSD sockets System V IPC Virtual Console 159 Why Linux? (Cont`) drawbacks monolithic kernel (currently micro kernerlize in many research) not for beginners (for system programmers) not well structured (performance-oriented) Key attraction ‘experimenting’ with the system (handle the kernel by yourself) supported many companies free: solution business & add on features thanks to the INTERNET & GNU (special thanks to Anti-MS feeling) 160 Where is everything? Linux Operating System Structure user level application System Calls Interface Central kernel File System ext2fs xiafs minix nfs iso9660 kernel level proc msdos Buffer Cache task management scheduler signals memory management loadable modules Peripheral Manager block hd network Network Manager ipv4 ethernet ……. character cdrom isdn scsi pci Machine Interface Machine H/W level (Source : the LINUX KERNEL book) 161 Where is everything? (Cont`) source structure based on version 2.2.5 under development : the contents described below may be changed ipc kernel lib mm scripts Doc cdrom /usr/src/linux driver arch alpha fs init block include arm char net net 802 pci m68k coda asm-alpha appletalk pnp mips ext2 asm-arm decnet sbus ethernet ppc sparc i386 boot kernel lib math-emu mm msdos asm-i386 ipv6 scsi sound nfs linux unix video ntfs net sunrpc ufs scsi x25 hpfs video 162 Where is everything? (Cont`) main subdirectory arch/ architecture dependent codes : arch/i386, arch/alpha, …. arch/i386/boot/ – bootstrapping – configure devices, memory arch/i386/kernel/ – kernel entry point handling (trap/interrupt handling) – context switch arch/i386/mm/ – machine dependent memory management code init/ all the functions needed to start the kernel hand-made process 0 (init_task or task[0]) fork process 1, 2, 3, ... 163 Where is everything? (Cont`) main subdirectory kernel/ (arch/i386/kernel) central section of the kernel main system call implementation (fork, exit, etc.) time management scheduler signal handling mm/ virtual memory interface paging, kernel memory management fs/ virtual file system interface implementations of the various file systems (ext2, nfs,...) 164 Where is everything? (Cont`) main subdirectory drivers/ drivers for hardware components drivers/block/ : block-oriented driver(hard disks) drivers/cdrom/ : proprietary CD-ROM drives drivers/char/ : character-oriented driver (serial ports, tty, modem, ..) drivers/net : network cards drivers/pci/ : PCI bus access and control drivers/scsi/ : SCSI interface drivers/sound/ : sound card drivers ipc/ classical inter-process communication semaphores, shared memory, message queues 165 Where is everything? (Cont`) main subdirectory net/ various network protocol implementations : TCP/IP, ARP, ... code for sockets to the UNIX and Internet domains lib/ some standard kernel library functions (printk) modules/ kernel module files modules can be added to the kernel later (insmod, rmmod) include/ commonly included kernel-specific header files include/asm-i386/ : architecture-dependent header files for Intel CPU include/linux/ : Linux kernel internal structure (task, inode) 166 Kernel Configuration and Compile new kernel is generated in three steps 1. configure (Documentation/Configuration.help, see chapter 3 of “The LINUX Network”) make config (menuconfig, xconfig) make oldconfig 2. depend make dep (make clean:optional) 3. compile make zImage cf) - make zdisk (#dd bs=8192 if=$(BOOTIMAZGE) of=/dev/fd0) - make zlilo (#cat $(BOOTIMAGE) > $(INSTALL_PATH)/vmlinuz) /etc/lilo.conf - #mkbootdisk --device /dev/fd0 zImage 167 Add New System Call System Call : Control flow in Linux Kernel user process sys_call_table /* arch/i386/kernel/entry.S */ do system call real system call function libc.a idt_table /* arch/i386/kernel/traps.c*/ push args save system call number make trap system call handler system_call () /*arch/i386/kernel/entry.S */ catch trap through IDT call real handler function using sys_call_table 168 Add New System Call (Cont`) IDT (Interrupt Descriptor Table) define : include/asm_i386/desc.h, arch/i386/kernel/traps.c, irq.h constructed while kernel initialization /*arch/i386/kernel/traps.c, irq.c*/ idt_table 0x0 divide_error() debug() nmi() …. segment_not_present() …. page_fault () …. 0x20 timer_interrupt() common trap handler for 80*86 FIRST_EXTERNAL_VECTOR device interrupt handler (IRQ) hd_interrupt() …. SYSCALL_VECTOR 0x80 system_call() 0xff …. 169 Add New System Call (Cont`) sys_call_table sys_call_table syscall number : include/asm_i386/unistd.h #define #define #define …. #define __NR_exit 1 __NR_fork 2 __NR_read 3 __NR_vfork 190 sys_call_table : arch/i386/kernel/entry.S ENTRY(sys_call_table) .long SYMBOL_NAME(sys_ni_syscall) .long SYMBOL_NAME(sys_exit) .long SYMBOL_NAME(sys_fork) .long SYMBOL_NAME(sys_read) …. .long SYMBOL_NAME(sys_vfork) .rept NR_syscalls-190 170 0 sys_ni_syscall() sys_exit() sys_fork() sys_read() sys_write() ….. 190 sys_vfork() …. 255 /* 0 */ /* 1 */ /* 2 */ /* 3 */ /* 190 */ Add New System Call (Cont`) put them altogether : example of fork Kernel user process main() { …. fork() } IVT 0x0 divide_error() debug() libc.a …. fork() { …. movl 2, %eax int $0x80 …. } …. ENTRY(system_call) /* entry.S */ SAVE_ALL …. call *SYMBOL_NAME(sys_call_table)(,%eax,4) …. nmi() sys_call_table …. 1 sys_exit() 0x80 system_call() …. 2 sys_fork() sys_fork() 3 sys_read () 4 sys_write () /* arch/i386/kernel/process.c */ …. 171 /* kernel/fork.c */ Add New System Call (Cont`) Syntax of real system call handler in Linux asmlinkage int sys_fork(regs) /* arch/i386/kernel/process.c */ { return do_fork(..); } int do_fork(..) /* kernel/fork.c */ { …. /* create new process */ } asmlinkage int sys_read(fd, buf, count) { ….. /* read data */ } 172 /* fs/read_write.c */ Add New System Call (Cont`) Example: add new system call1 (too simple example) 1. kernel modification 1-1. allocate syscall number : include/asm-i386/unistd.h #define __NR_exit 1 …. #define __NR_vfork 190 #define __NR_mysyscall 191 1-2. register sys_call_table : arch/i386/kernel/entry.S ENTRY(sys_call_table) ….. .long SYMBOL_NAME(sys_mysyscall) .rept NR_syscalls-191 173 /* 191 */ Add New System Call (Cont`) 1-3. coding new system call handler asmlinkage int sys_mysyscall() { printk(“Hello Linux, I’m in Kernel\n”); } 1-4. kernel rebuild if you make a new file, you should let it know to make utility eg) kernel/test.c modify the following field in Makefile on kernel directory O_OBJS = sched.o, dma.o, fork.o, …. … capability.o, test.o 174 Add New System Call (Cont`) 2. make user program with new system call 2-1. make user program #define _syscall0 (type, name) \ type name(void) \ {\ long __res; \ __asm__ volatile (“int 0x80” \ : “=a” (__res) \ : “0” (__NR_##name)); \ __syscall_return(type, __name); \ } /* include/asm-i386/unistd.h */ #include <linux/unistd.h> _syscall0(int, mysyscall); main() { int i; i = mysyscall(); } 2-2. make library if possible #ar, ranlib Just Do It (百見不如一打) 175 Add New System Call (Cont`) add new system call2 : arguments passing 1. kernel modification 1-1 #define __NR_show_mult 192 1-2 .long SYMBOL_NAME(sys_show_mult) /* 192 */ .rept NR_syscalls-192 1-3 asmlinkage int sys_show_mult(int x, int y, int *res) { int error, compute; if ((error = verify_area(VERIFY_WRITE, res, sizeof(*res))) /* include/asm-i386/uaccess.h */ return error; compute = x*y; put_user(compute, res); /* include/asm-i386/uaccess.h */ return (0); } cf) copy_to_user(), copy_from_user() /* include/asm-i386/uaccess.h */ 176 Add New System Call (Cont`) add new system call2 : arguments passing 2-1. make user program #include <linux/unistd.h> _syscall3(int, show_mult, int, x, int, y, int *, result); main() { int ret = 0; show_mult(2, 5, &ret); printf(“Result : %d * %d = %d\n”, 2, 5, ret); } int show_mult (int x, int y, int *result) { long __res; __asm__ volatile (“int 0x80” : “=a” (__res) ,“0” (__NR_##name), “b” ((long) (x)), “c” ((long) (y)), “d” ((long) result))); if (__res >= 0) errno =- __res; return __res; } /* include/asm-i386/unistd.h */ 177 Add New System Call (Cont`) add new system call3 : some general system calls getpid asmlinkage int sys_getpid() { current->pid; NR_TASKS: number of total concurrent tasks } all tasks connected using double linked list (next_task, next_run) global variable: init_task, current task[0]: init_task, task[1]: init process nice asmlinkage int sys_nice(new_priority) { …. current->priority = newpriority ; } pause asmlinkage int sys_pause() { current->state = TASK_INTTERUPTIBLE; schedule(); } 178 Add New System Call (Cont`) fork /* arch/i386/kernel/process.c */ sys_fork() /* kernel/fork.c */ do_fork() /* arch/i386/kernel/process.c */ - p = alloc_task_struct() - task structure initialize - copy_mm()…. - copy_thread() - wake_up_process(p) - return (p->pid) copy_thread() …. - p->tss.eax = 0; - p->tss.eip = ret_from_fork; /* kernel/sched.c */ /* arch/i386/kernel/entry.S */ ret_from_sys_call() wake_up_process() - add_to_runqueue(p); - current->need_resched = 1 /* kernel/sched.c */ schedule() if (schedule parent) else (schedule child) 179 Add New System Call (Cont`) exit /* kernel/exit.c */ sys_exit() /* kernel/exit.c */ do_exit() - sem_exit() - exit_mmap() - free_page_tables() - exit_files() - exit_thread() …. …. - handling each child process - current->state=TASK_ZOMBIE - schedule() /* kernel/signal.c */ notify_parent() 180 Add New System Call (Cont`) Project II: add new system get kernel information: want to know about process id, state, process execution time (system time and user time separately), the number of page faults, the number of open files, and and so on 1. kernel modification asmlinkage int sys_process_statistics(….) { …. current->pid, min_flt, maj_flt, times.tms_utime, times.tms_stime …. } 2. user program 181 Motivation of Module in LINUX why do we use modules? Linux is a monolithic kernel trivial modifications require kernel to be recompiled kernel is increasing in size by adding new features many modules occupy permanent space in memory though they are used rarely module: steps toward micro-kernelized Linux small and compact kernel clean kernel rapid kernel solution business: components-based Linux •예: backup tape driver 182 What can be Modules ? what can be modules? possibly anything current version file system block device driver character device driver network device driver exec domain binary format register_filesystem, unregister_filesystem read_super, put_super register_blkdev, unregister_blkdev open, release register_chrdev, unregister_chrdev open, release register_netdev, unregister_netdev open, close register_exec_domain, unregister_exec_domain load_binary, personality register_binfmt, unregister_binfmt load_binary …. cf: /lib/modules/x.x.x/*.o 183 How to manipulate modules? how to manipulate modules? compilation # gcc -D__KERNEL__ -D_LINUX -DMODULE -c new_module.c Enable loadable module support (CONFIG_MODULES) [Y/n/?] … MSDOS fs support (CONFIG_MSDOS_FS) [M/n/y/?] insmod, lsmod, rmmod #insmod fat #lsmod Module: #pages : Used by fat 6 0 #rmmod fat kerneld: for on-demand loading eg: mount -t msdos /dev/fd0 /mnt => transparent load fat & msdos modules 184 How to implement modules? Module basic two interfaces init_module() cleanup_module() kernel register_filesystem() module insmod init_module() register_blkdev() cleanup_module() rmmod register_netdrv() sock_register() 185 How to implement modules? (Cont`) example1 : Hello world!! /* hello.c */ #include <linux/kernel.h> #include <linux/module.h> int init_module() { printk(“Hello world!! - I’m in kernel\n”); return 0; } void cleanup_module () { printk(“Bye world - I’m in kernel\n”); } # gcc -D__KERNEL__ -D_LINUX -DMODULE -c hello.c #insmod hello.o #rmmod 186 How to implement modules? (Cont`) example2 : simple device driver /* time.c */ #include <linux/kernel.h> #include <linux/module.h> #define HOUR_MAJOR 60 #define HOUR_MINOR 0 struct file_operations time_fops = { NULL, time_read, NULL, NULL, NULL, NULL, NULL, time_open, NULL, NULL }; int time_init() { register_chrdev(HOUR_MAJOR, “time”, &time_fops); printk(“time module loaded (major=%d)\n”, HOUR_MAJOR); } int time_read(fd, buf, size) { … copy_to_user(CURRENT_TIME, buf,...); } int init_module () { return time_init(); } int time_open(..) { …. } cleanup_module { unregister_chrdev(HOUR_MAJOR, “time”); printk(“time module unloaded \n”); } 187 How to implement modules? (Cont`) example2 : simple device driver #gcc -D__KERNEL__ -D_LINUX -DMODULE -c time.c #mknod #insmod #lsmod Module: time /dev/time c 60 0 time #pages: 1 Used by: #cat /dev/time /* print current time */ #rmmod time how can the “cat” command invoke the time_read() function ? 188 How to implement modules? (Cont`) example2 : simple device driver register_blkdev() init_module /* include/linux/major.h */ time_init() register_chrdev(HOUR_MAJOR, “time”, &time_fops); register_chrdev() - chrdevs[major].name = “time” - chrdevs[major].fops = time_fops 189 How to implement modules? (Cont`) example2 : simple device driver open sys_open() - get_unused_fd() - fd_install(fd, f) filp_open() /* fs/namei.c */ open_namei() - struct file initialize - f->f_op->open() /* fs/device.c */ time_open() chrdev_open() pipe_open() blkdev_open() socket_open() nfs_open() 190 - filp->f_op = get_chrfops(MAJOR (inode->i_rdev)); /* filp->f_op = chrdevs[major].fops */ - filp->f_op->open; How to implement modules? (Cont`) example2 : simple device driver read /* fs/read_write.c */ sys_read() - f->f_op->read nfs_read() pipe_read() time_read() tty_read() /* fs/block_dev.c */ block_read() 191 How to implement modules? (Cont`) example3 : system call wrapper #include <linux/kernel.h> #include <linux/module.h> #include <sys/syscall.h> #include <linux/sched.h> #include <asm-i386/uaccess.h> extern void *sys_call_table[]; int uid; asmlinkage int (*original_call) (const char *, int, int); asmlinkage int (*getuid_call) ( ); int init_module ( ) { original_call = sys_call_table[__NR_open]; sys_call_table[__NR_open] = our_sys_open; printk(“Spying on UID: %d\n”, uid); getuid_call = sys_call_table[__NR_getuid]; return 0; } void cleanup_module ( ){ if (sys_call_table[__NR_open] != our_sys_open) { sys_call_table[__NR_open] = original_call; } } 192 How to implement modules? (Cont`) example3 : system call wrapper asmlinkage int our_sys_open(const chat *fname, int flags, int mode) { int i=0; char ch; if (uid == getuid_call() { printk(“opened file by %d: “, uid); do { get_user(filename+i); i++; printk(“%c”, ch); } while (ch != 0); } printk(“\n”); return original_call(fname, flags, mode); } 193 How to implement modules? (Cont`) example4 : new file system design super block program file operations, program inode operations registering : register_filesystem() #ifdef CONFIG_MINIX_FS register_filesystem(&(struct file_system_type) {minix_read_super, “minix”, 1, NULL}); #endif mount struct file_system_type { struct super_block *(*read_super) (); char *name; int requires_dev; struct file_system_type *next; } *file_system; 194 How to implement modules? (Cont`) Project III implement your own modules make file operations make module interface make driver mknod (use pseudo device such as memory) init_module() cleanup_module() mydrv_init() mydrv mydrv_open() mydrv_interrupt() mydrv_release() mydrv_out() mydrv_read() mydrv_write() mydrv_ioctl() 195 How to implement modules? (Cont`) system call for modules create_module memory allocation for module (return load address) a new element for module_list init_module physical loading of requesting module (module functions become an integral part of kernel) relocating module functions and solving references of kernel symbols call module specific init_module function delete_module get_kernel_syms to get kernel symbols 196 How to implement modules? (Cont`) Kernel data structure for create_module() module_list module module next ref symtab name ... next ref symtab name ... size size references symbol table for this module 197 references symbol table for this module Control flow of FS system call file access under Linux /* include/linux/sched.h, fs.h */ inode fs_struct task structure … fs files ... count umask *root *pwd inode file f_mode f_pos f_flag f_count f_owner f_inode f_op f_version file_struct count close_on_exec fd[0] fd[1] … fd[255] why do we need the file data structure ? 198 inode file operation routines Control flow of FS system call (Cont`) Why do we need file data structure => to support various type of files with single coherent interface open /* fs/open.c */ sys_open() - get_unused_fd() - fd_install(fd, f) /* fs/open.c */ filp_open() /* fs/namei.c */ open_namei() - struct file initialize - f->f_op->open() /* to support various file */ 199 Control flow of FS system call (Cont`) struct file /* include/linux/fs.h */ f_next, f_prev f_dentry f_op f_mode f_pos f_count f_flags f_reada, f_ramax ... /* to access inode */ /* access type */ /* file offset */ /* reference count */ file operation example fs/ext2/file.c ext2_file_lseek, generic_file_read, ext2_file_write NULL, NULL, ext2_file_ioctl generic_file_mmap NULL, ……. fs/ufs/file.c ufs_file_lseek, generic_file_read, ufs_file_write NULL, NULL, NULL, generic_file_mmap NULL, ……. fs/nfs/file.c NULL, nfs_file_read, nfs_file_write NULL, NULL, NULL, nfs_file_mmap nfs_file_open, …… where is create()? 200 include/linux/fs.h lseek() read() write() readdir() poll() ioctl() mmap() open() flush() release() fsync() fasync() ….. fs/pipe.c pipe_lseek, pipe_read, pipe_write NULL, pipe_poll, pipe_ioctl, NULL, pipe_rdwr_open, ... /* net/socket.c */ sock_lseek sock_read sock_write NULL sock_poll sock_ioctl NULL sock_no_open …. fs/device.c NULL, NULL, NULL, NULL, NULL, NULL, NULL blkdev_open, ……. Control flow of FS system call (Cont`) open /* fs/open.c */ System call layer sys_open() - get_unused_fd() - fd_install(fd, f) /* fs/open.c */ filp_open() - struct file initialize - f->f_op->open() /* fs/namei.c */ open_namei() VFS layer Specific File layer iget(), bread() pipe_rdwr_open() sock_no_open() nfs_file_open() blkdev_open() chrdev_open() 201 Control flow of FS system call (Cont`) read System call handling layer /* fs/read_write.c */ sys_read() - f->f_op->read sock_read() block_read() pipe_read() nfs_file_read() VFS layer /* mm/filemap.c */ generic_file_read() tty_read() Specific File layer - try to find page in page cache, if (hit) OK. - get_free_page() - inode->i_op->readpage() 202 Control flow of FS system call (Cont`) inode structure in Linux /* include/linux/fs.h, ext2_fs_i.h */ inode task …. fd[] …. file …. f_dentry …. f_pos f_op dentry d_inode inode operation routines File specific information …. i_ino i_dev i_count i_mode i_nlink i_uid, gid …… i_atime, ... i_rdev i_op i_data[15] i_flags i_…. 203 device driver Control flow of FS system call (Cont`) inode operation example ... i_op ... fs/ext2/file.c ext2_file_operations, NULL, NULL, NULL, NULL, ... generic_readpage NULL ext2_bmap, ……. include/linux/fs.h def_file_operation create(), lookup() link(), unlink(), symlink() mkdir(), rmdir() mknod(), rename(), readlink(), followlink() readpage(), writepage() bmap(), truncate(), ……. fs/ufs/file.c fs/nfs/file.c ufs_file_operations, NULL, NULL, NULL, NULL, ... generic_readpage NULL ufs_bmap, ……. nfs_file_operations, NULL, NULL, NULL, NULL, ... nfs_readpage nfs_writepage NULL ……. 204 fs/dos/files.c dos_file_operations, NULL, NULL, NULL, NULL, … dos_readpage, dos_writepage, NULL, ……. fs/pipe.c rdwr_pipe_fops, NULL, NULL, NULL, NULL, ... fs/device.c def_blk_fops, NULL, NULL, NULL, NULL, ... Control flow of FS system call (Cont`) read System call handling layer /* fs/read_write.c */ sys_read() - f->f_op->read sock_read() pipe_read() VFS layer block_read() /* mm/filemap.c */ generic_file_read() tty_read() Specific File layer - try to find page in cache, if (hit) OK. - inode->i_op->readpage() nfs_readpage() /* fs/buffer.c */ /* fs/ext2/inode.c */ ext2_bmap() /* fs/ufs/inode.c */ ufs_bmap() generic_readpage() dos_readpage() Specific FS layer coda_readpage() /* driver/block/ll_rw_blk.c */ ll_rw_block() /* driver/block/hd.c */ hd_request 205 Device Driver layer Device Driver Implementation in Linux data structure blkdevs, chrdevs for devsw blk_dev_struct for block driver only file_operations /* fs/devices.c */ lseek read, write, readdir poll, ioctl, mmap, open, flush, release fsync, fasync ….. struct device_struct { name; fops; } chrdevs[], blkdevs[]; /* include/linux/blkdev.h */ struct blk_dev_struct { request_fn; queue; request; ... } blk_dev[]; 206 Driver Implementation in Linux (Cont`) buffer_head b_dev b_blocknr b_state b_count b_size ... b_next b_data data structure (cont`) chrdevs[] name fops file_operations blkdev request rq_status rq_dev cmd … sem bh tail next request_fn current_request 207 request rq_status rq_dev cmd … sem bh tail next request Driver Implementation in Linux (Cont`) Example of structure of driver: IDE disks hd_init() hd_open() hd_interrupt() hd_release() hd_out() driver/block/hd.c hd_request() check_status() hd_ioctl() NULL, block_read, block_write NULL, NULL, hd_ioctl, NULL, hd_open, NULL hd_release, block_fsync struct file_operations hd_ops 208 Driver Implementation in Linux (Cont`) major number Major 0 1 2 3 4 5 6 7 8 9 ……… 23 …. /* include/linux/major.h */ Character devices Block devices mem RAM disk floppy (fd*) IDE hard disk (hd* ) terminal terminal & AUX Parallel Interface virtual console (vcs*) SCSI hard disk (sd*) SCSI tapes (st*) Mitsumi CD-ROM (mcd*) 209 Driver Implementation in Linux (Cont`) initialization of disk driver register_blkdev() init_module init process /* driver/block/hd.c */ hd_init() /* include/linux/major.h */ - register_blkdev(HD_MAJOR, “hd”, &hd_fops); - blk_dev[HD_MAJOR]. request_fn = hd_request /* fs/devices.c */ register_blkdev() - blkdevs[major].name = device name - blkdevs[major].fops = fops 210 Driver Implementation in Linux (Cont`) disk driver open /* fs/open.c */ sys_open() - get_unused_fd() - fd_install(fd, f) /* fs/open.c */ filp_open() /* fs/namei.c */ open_namei() - struct file initialize - f->f_op->open() /* driver/block/hd.c */ /* fs/device.c */ hd_open() blkdev_open() pipe_open() chrdev_open() socket_open() nfs_open() 211 - filp->f_op = get_blkfops(MAJOR (inode->i_rdev)); /* filp->f_op = blkdevs[major].fops */ - filp->f_op->open; /* hd_open */ Driver Implementation in Linux (Cont`) disk driver read /* fs/read_write.c */ sys_read() - f->f_op->read /* mm/filemap.c */ nfs_read() pipe_read() generic_file_read() tty_read() /* fs/block_dev.c */ block_read() - getblk(); /* buffer header */ /* driver/block/ll_rw_blk.c */ ll_rw_block() make_request() - request structure initialize add_request() - call blk_dev[major].request_fn /* driver/block/hd.c */ hd_request() 212 - hd_out() Driver Implementation in Linux (Cont`) queue and requests (similar to message queue) requests are sorted by sector number inb, outb /* include/linux/blkdev.h */ struct blk_dev_struct { request_fn; queue; request; ... } blk_dev[]; bread block_read struct request { rq_status rq_dev cmd /* R/W */ error sector, nr_sector buffer, bh sem next ... } request_fn hd_request queue buffer cache req req ll_rw_block make_request 213 req block device driver do I/O Driver Implementation in Linux (Cont`) various disks and partitions gendisk gendisk_head gendisk gendisk 8 major “sd” name minor_shift max_p part …. real_devices next 214 3 major “ide0” name minor_shift hd_struct max_p part start_sect …. nr_sects real_devices ... next ... start_sect nr_sects Driver Implementation in Linux (Cont`) tty driver register_chrdev() init_module init process driver/char/tty_io.c tty_lseek, tty_read, tty_write NULL, tty_poll tty_ioctl, NULL, tty_open, NULL tty_release, NULL tty_afsync /* driver/block/hd.c */ tty_init() /* include/linux/major.h */ - register_chrdev(TTY_MAJOR, “tty”, &tty_fops); /* fs/devices.c */ register_chrdev() - blkdevs[major].name = device name - blkdevs[major].fops = fops 215 Driver Implementation in Linux (Cont`) Example of network driver : 3c509 different from disk and tty driver not directly interface with VFS /* driver/net/3c509.c */ /* driver/net/3c509.c */ el3_init() ip_output() ip_rcv() el3_open() el3_start_xmit() el3_out() el3_stop() el3_interrupt() el3_release() 216 Driver Implementation in Linux (Cont`) Example of network driver : 3c509 /* include/linux/netdevices.h */ struct device { name mem_end, mem_start base addr /* port number */ … init, destructor …. device_addr qdisc /* sk_buff */ …. open, stop hard_start_xmit, hard_header … irq } init_module() in 3c509 /* driver/net/3c509.c*/ /* register_netdev() */ init port, irq, … make dev structure dev->init=el3_init dev->open=el3_open dev->hard_start_xmit = el3_start_xmit ... el3_open() …. request_irq(dev->irq, el3_interrupt 217 Task Scheduling LINUX scheduling clock tick is 10msec, time quantum is 10 clock ticks support REAL-TIME task variables for scheduling in task structure p_policy : task type /* include/linux/sched.h */ – SCHED_FIFO, SCHED_RR, SCHED_OTHER p_priority – set to DEF_PRIORITY (20) /* include/linux/sched.h */ – can be changed using sys_nice() or sys_setpriority(); p_counter – decrease each clock tick – counter = priority, when counter of all task is zero need_resched : need re-scheduling when return from syscall or interrupt rt_priority – set using sched_setscheduler(pid, policy, sched_param) system call – used to set real time tasks (static priority) 218 Task Scheduling (Cont`) schedule() function /* kernel/sched.c */ need_resched sleep_on schedule - schedule real time task first (rt_priority) - select a task which has highest values of counter + priority (using goodness function) give advantage to the task which run this_cpu give slight advantage to the task which has mm object - if (p_counter == 0) for all task p_counter = p_priority - context switch : switch_to (current, next) /* arch/i386/kernel/process.c */ 219 Task Scheduling (Cont`) Example of scheduling 3 tasks millisecond T1 T2 T3 p_pri p_count. p_pri p_count. p_pri p_count. 0 20 20 20 20 20 20 10 20 10 20 20 20 20 20 20 10 20 10 20 20 30 20 10 20 10 20 10 40 20 0 20 10 20 10 20 0 20 0 20 10 20 20 20 20 20 20 220 Signal a mechanism to inform an asynchronous event to process types of signal : SIGKILL, SIGINT, SIGBUS, SIGUSR1, …. action : abort, exit, ignore, stop, user level catch function void sig_handler(signo) int signo; { signal (SIGUSR1, sig_handler); printf(“received signal %d\n”, signo); ….. } /* reinstall */ /* handle the signal */ main () { signal (SIGUSR1, sig_handler); …. for ( ; ; ) pause(); /* install the handler */ } what’s the difference among interrupt, trap, and signal? 221 Signal (Cont`) register signal handler (signal catch function ) send signal signal detection : state transition from kernel running to user running call signal handler variables for signal in task structure int sigpending : is signal received or not? struct signal_struct *sig sigset_t signal, blocked typedef struct { unsigned long sig[_NSIG_WORDS]; } sigset_t; /* asm-i386/signal.h */ struct sigaction /* asm-i386/signal.h */ struct signal_struct /* sched.h */ count action[_NSIG] siglock 222 sa_handler sa_flags sa_restorer sa_mask Signal (Cont`) register signal catch function task …. sig signal, blocked sigpending …. signal_struct count action[_NSIG] siglock sigset_t …. 63 sigaction sa_handler sa_flags sa_restorer sa_mask sigset_t …. 0 /* kernel/signal.c */ sys_signal(sig, handler) do_sigaction(sig, new_sa, old_sa) 223 63 0 Signal (Cont`) send signal task …. sig signal, blocked sigpending …. signal_struct count action[_NSIG] siglock sigset_t …. 63 sigaction sa_handler sa_flags sa_restorer sa_mask sigset_t …. 0 63 0 /* kernel/signal.c */ sys_kill(pid,sig) kill_proc_info(sig, info, pid) send_sig_info(sig, info, *t) sigaddset(t->signal, sig); t->sigpending = 1; 224 Signal (Cont`) signal handling task …. sig signal, blocked sigpending …. signal_struct count action[_NSIG] siglock sigaction sa_handler sa_flags sa_restorer sa_mask /* arch/i386/kernel/entry.S */ if (current->sigpending) do_signal(); /* arch/i386/kernel/signal.c */ do_signal(regs, oldset) signr = dequeue_signal() handle SIG_IGN or SIG_DFL sigset_t …. 63 0 handle_signal() sigset_t …. 63 setup stack frame for signal handler 0 225 Signal (Cont`) signal handling: state of stack for handling signal memory stack memory stack - return address - arguments - return address - arguments - return address to kernel - return address to sighandler - arguments 226 Thread Motivation (golf course) Possibility of parallel processing process is too heavy process model address space P P P CPU P P process time (Source : UNIX internals) 227 Thread (Cont`) thread model address space thread model CPU thread time (Source : UNIX internals) task : a set of thread and a collection of resources (passive) thread : hardware context, stack, thread information (id, scheduling, ..) 228 Thread (Cont`) types of threads kernel thread LWP (lightweight process) : a kernel supported user thread user thread : C-thread, P-thread U user level scheduler U U U L L K K U U process (or task) L K K K thread scheduler CPU CPU 229 Thread (Cont`) threads in Linux struct thread: currently only one in task structure sys_clone() fully share the address context such as page directory under developing can use user level thread (P thread) /usr/include/pthread.h pthread_create() pthread_join() pthread_mutex_init() 230 Thread (Cont`) Example of thread programming /* gcc -lpthread */ #include <pthread.h> ... int main(int argc, char *argv[]) { pthread_t *thread; void *retval; int cpu, i; DATA *A; volatile double s = 0; pthread_mutex_t s_lock; typedef struct { double volatile *p_s; pthread_mutex_t *p_s_lock; int n; } DATA; if (argc != 0) { printf(“USAGE: %s, CPU number”, argv[0]); exit(1); } cpu = atoi(argv[1]); thread = (pthread_t *)calloc(cpu, sizeof(pthread_t)); A = (DATA *) calloc(cpu, sizeof(DATA)); 231 #define L 9 double x[L], y[L]; Thread (Cont`) Example of thread programming for (i=0; i<L; i++) x[i] = y[i] = i; pthread_mutex_init(&s_lock, NULL); void *SMP_scalprod(void *arg) { register double localsum; long i; DATA D = *(DATA *)arg; for (i=0; i<cpu; i++) { A[i].n=i; /* start offset */ A[i].p_s=&s; A[i].p_s_lock=&s_lock; pthread_create(&thread[I], NULL, SMP_scalprod, &A[i]); } localsum = 0.0; for (i=D.n; i<L; i+=cpu) localsum += x[i]*y[i]; pthread_mutex_lock(D.p_s_lock); *(D.p_s) += localsum; pthread_mutex_unlock(D.p_s_lock); for (i=0; i<cpu; i++) pthread_join(thread[i], &retval); return (NULL); printf(“results = %f\n”, s); } } 232 Data Structure for Virtual Memory Linux virtual memory structure for each task global view /* include/linux/sched.h, mm.h, include/asm-i386/page.h */ task_struct mm mm_struct vm_area_struct map_count pgd vm_end vm_start vm_flags ….. mmap 31 11 0 PFN page directory vm_file vm_offset vm_ops vm_next vm area (data or parts of data) vm_area_struct vm_end vm_start vm_flags ….. vm_file vm_offset vm_ops vm_next 233 vm_area (text) Data Structure for Virtual Memory (Cont`) struct mm_struct include/linux/sched.h struct mm_struct { struct vm_area_struct *mmap; struct vm_area_struct *mmap_avl, *mmap_cache; pgd_t *pgd; atomic_t count; int map_count; struct semaphore mmap_sem; unsigned long context; unsigned long start_code, end_code, start_data; unsigned long end_data, start_brk, brk, start_stack; unsigned long arg_start, arg_end, env_start, env_end; unsigned long rss, total_vm, locked_vm, def_flags; unsigned long swap_cnt, swap_address; void *segment; } include/asm-i386/page.h typedef struct {unsigned long pgd;} pgd_t; 234 kernel env_end arg_end arg_start start_stack stack brk end_data end_code start_code bss data text Data Structure for Virtual Memory (Cont`) pgd_t task_struct mm_struct mm map_count pgd mmap 31 22 21 12 11 0 DIR PAGE offset 31 11 0 31 11 0 PFN CR3 11 PFN PFN page directory 31 page table 235 0 offset physical address Data Structure for Virtual Memory (Cont`) struct vm_area_struct need to handle segments (or parts of segment) differently: text/data, share/private include/linux/mm.h Virtual Memory Area struct vm_area_struct { struct mm_struct *vm_mm; unsigned long vm_start, vm_end; struct vm_area_struct *vm_next pgprot_t vm_page_prot; unsigned short vm_flags; short vm_avl_height; struct vm_area_struct *vm_avl_left; struct vm_area_struct *vm_avl_right; struct vm_area_struct *vm_next_share; PAGE_SHARED (COPY, READONLY, KERNEL) struct vm_operations_struct *vm_ops; unsigned long vm_offset; struct file *vm_file; unsigned long vm_pte; /* for SVR4 SM */ } 236 •open(vm_area) •close(vm_area) •do_mmap(file, addr, len, prot, flags, off) •unmap() •protect() •nopage() •wppage() •swapout() •swapin() Data Structure for Virtual Memory execve (final) : usually demand paging under Linux task_struct mm mm_struct vm_area_struct map_count pgd vm_end vm_start vm_flags ….. vm_file vm_offset vm_ops vm_next a.out (ELF format) p_type p_offset p_vaddr p_filesz p_memsz p_flags e_ident … e_phnum mmap physical header1 physical header2 …… code data ……. vm area vm_area_struct vm_end vm_start vm_flags ….. open(vm_area), close(vm_area) do_mmap(file, addr, len, prot, flags, off) unmap() protect() nopage(), wppage() ….. 237 vm_file vm_offset vm_ops vm_next vm_area Data Structure for Virtual Memory (Cont`) struct vm_area_struct: AVL (Adelchild-Velskii and Landis) tree vm_area_struct 40007000 0804b000 0804a000 40087000 40009000 40005000 08053000 40008200 c0000000 400b9000 (Source : the LINUX KERNEL book) 238 Polling & Interrupt polling mode #define LP_B(minor) lp_table[(minor)].base /* IO address */ #define LP_S(minor) inb_p(LP_B((minor)+1) /* status port */ #define LP_CHAR(minor) lp_table[(minor).chars /* busy timeout */ static int lp_char_polled(lpchar, minor) { int status = 0; int count = 0; …. status=LP_S(minor); while ((status & LP_PBUSY) && count < LP_CHAR(minor)) { count++; if (need_resched) schedule(); status=LP_S(minor); }; …. do timeout error handling if necessary (off-line, out of paper, …) outb_p(lpchar, LP_B(minor)); … } 239 Polling & Interrupt (Cont`) interrupt mode lp_init() { …. request_irq(LP_IRQ, lp_interrupt, 0, “PRINTER”); …. } static int lp_char(lpchar, minor) { … if(…) outb_p(lpchar, LP_B(minor)); else interruptible_sleep_on(&lp->lp_wait_q); ... } lp_interrupt(int irq, struct pt_regs *regs) { …. wake_up_interruptible(&lp->lp_wait_q); …. } 240 Polling & Interrupt (Cont`) Interrupt handling under Linux /* arch/i386/kernel.irq.h irq.c */ Interrupt_descriptor[] 0 1 status handler action depth 2 status handler action depth irqaction handler flags name dev_id …. next irqaction handler flags name dev_id …. next 241 irqaction handler flags name dev_id …. next Polling & Interrupt (Cont`) default IRQ of ISA PC IRG 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 /* arch/i386/kernel.irq.h irq.c */ Assignment System timer Keyboard controller Second IRQ controller Serial port 1 (COM1) Serial port 2 (COM2) Line printer 2 (LPT2) Floppy-disk controller (controls two disks) Line printer 1 (LPT1) Real-time clock Redirected IRQ2 Unused Unused Motherboard (PS/2) mouse port Mathematics coprocessor Hard-disk (IDE) controller 1 (controls two disks) Hard-disk (IDE) controller 2 (controls two disks) 242 Bottom Half Handling What is bottom half to handle long jobs during interrupt handling top half : request_irq bottom half : mark_bh(), init_bh() with bh_base data structure bh_mask_count[32]; struct bh_struct { void (*routine)(); void *data; } bh_base[32]; enum { TIMER_BH, CONSOLE_BH, … KEYBOARD_BH, … } 243 Bottom Half Handling (Cont`) example of bottom half kbd_init() { …. request_irq(KEYBOARD_IRQ, kbd_interrupt, 0, “KBD”); bh_base[KEYBOARD_BH].routine = kbd_bh; …. } kbd_interrupt(int irq, struct pt_regs *regs) { …. mark_bh(KEYBOARD_BH); …. } kbd_bh() /* called from ret_from_syscall */ { do KBD interrupt handling } 244 Bottom Half Handling (Cont`) timer handling To deal with some jobs which is required to be invoked at specific time struct timer_struct { unsigned long expires; void (*fn)(void) } timer_table[]; init_timer() add_timer() del_timer() 245 Network in Linux Network implementation one of the basic demands of an operating system applications ftp, telnet, rlogin, NFS, e-mail, News protocol TCP/IP, OSI, IPX (developed by Novell), SNA, appletalk, X.25 devices Ethernet(eth0, eth1), SLIP(sl0), PLIP (plip0) 246 Socket interface Socket interface /* net/socket.c */ virtual interface to support various protocol family UNIX, INET, X25, IPX, APPLETALK, … to support various Stream, Datagram, Raw, Reliable Delivered Message, ... socket(), bind(), connect(), listen(), accept() read(), write() send(), sendto(), recv(), recvfrom() 247 Layer model layer structure of a network BSD socket INET socket TCP UDP IP PLIP SLIP parallel port serial port ETHERNET Ethernet card 248 ARP Layer model (Cont`) Encapsulation data TFTP data header TFTP message UDP header Ethernet header TFTP data header UDP message IP header UDP TFTP header header IP packet data IP header UDP header data TFTP header Ethernet trailer Ethernet frame Details of each structure can be found in “The LINUX NETWORK” and “UNIX network programming” 249 Layer model (Cont`) Details of TCP/IP protocol Ethernet frame Destination ethernet address Source ethernet address Protocol Data Checksum IP packet Length Protocol Checksum Source IP address Destination IP address Data TCP message Source TCP address Destination TCP address 250 SEQ ACK Data Important data structure important data structure VFS layer struct file_operations BSD socket layer struct net_proto_family /* include/linux/net.h */ struct socket /* include/linux/net.h */ /* include/linux/fs.h */ inet layer struct sock /* include/net/sock.h */ struct proto_ops /* include/linux/net.h */ transport layer struct tcp_opt /* include/net/sock.h */ struct proto /* include/net/sock.h */ network layer struct tcp_func /* include/net/tcp.h */ struct packet_type /* include/linux/netdevice.h */ device layer struct device /* include/net/netdevice.h */ 251 struct sk_buff /* include /linux/sk_buff.h */ Important data structure (cont`) socket data structure task …. fd[] …. /* include/linux/net.h */ file …. f_dentry …. f_pos f_op dentry d_inode struct socket { state type flags ops /* proto_ops */ sk /* struct sock */ files, inodes next, wait, …. } INET, UNIX, IPX, X25, .. 252 /* include/net/sock.h */ struct sock { ... } /* include/linux/net.h */ struct proto_ops { family dup, release, bind, connect, accept, listen, ... getsockops setsockops sendmsg recvmsg } /* for INET operation */ Important data structure (cont`) sock data structure /* include/net/sock.h */ struct tcp_opt { tcp_header_leng rcv_next, snd_next, /* sequence, error handling information */ …. tcp_func ... } /* include/net/tcp.h */ struct tcp_func { queue_xmit send_check …. } /* for IP operation */ /* include/net/sock.h */ struct sock { next, prev daddr, dport rcv_saddr, sport ... rmem_alloc receive_queue /* sk_buff */ wmem_alloc send_queue ... pair /* struct sock */ proto /* struct proto */ tp_pinfo dst_cache /* struct dst_entry */ ... } 253 /* include/net/sock.h */ struct proto { next, prev close, bind, retransmit connect, accept … sendmsg, recvmsg … name } /* for TCP or UDT operations */ /* include/net/dst.h */ struct dst_entry { next …. struct device *dev; struct hh_cache *hh; (*input) (*output) … } /* for device operation */ Important data structure (cont`) network device data structure /* include/net/sock.h */ struct sock { ... dst_cache ... } /* include/linux/netdevices.h */ struct hh_cache { hh_refcnt hh_type hh_output … } /* for abstract device operation */ /* include/net/dst.h */ struct dst_entry { …. *dev; *hh; (*input) (*output) … } 254 /* include/linux/netdevices.h */ struct device { name mem_end, mem_start base addr /* port number */ irq … init, destructor …. device_addr Qdisc /* sk_buff */ …. open, stop hard_start_xmit, hard_header ... } /* for actual network device operation */ Important data structure (cont`) sk_buff data structure for virtual copy struct sock /* include/linux/sk_buff.h */ struct sk_buff { next, prev struct sock *sk; …. dev /* TP layer header */ union { th, uh, icmph, …} h; /* Network layer header */ union { iph, ipv6h, arph, ..} nh /* Data Link header */ union { ethernet, raw} mac; struct dst_entry *dst; … data, head, tail, len … } sk_buff headers data ... sk_buff headers data ... struct device sk_buff headers data ... 255 Socket Create socket create /* include/linux/socket.h */ AF_UNIX, AF_INET, AF_IPX, ... /* net/socket.c */ sys_socket(family, type, protocol) SOCK_STREAM, SOCK_DGRAM, SOCK_RAW, ... sock_create() sock_alloc() net_families[family]->create() 256 /* include/linux/net.h */ struct socket { state type flags ops /* proto_ops */ sk /* struct sock */ files, inodes next, wait, …. } Socket Create (cont`) protocol family registration family /* include/linux/socket.h */ AF_UNIX, AF_INET, AF_IPX, ... registration /* net/ipv4/af_inet.c */ struct net_proto_family inet_family_ops = { PF_INET, inet_create } /* include/linux/net.h */ struct net_proto_family { family create() authentication encryption, encrypt_net } struct net_proto_family net_familiese[]; /* net/socket.c */ sock_register(net_proto_family *ops) { ... net_familiese[ops->family] = ops; } inet_proto_init() { … sock_register(inet_family_ops) ... } /* net/unix/af_unix.c */ /* net/ipx/af_ipx.c */ 257 Socket Create (cont`) /* include/linux/socket.h */ AF_UNIX, AF_INET, AF_IPX, ... socket create /* net/socket.c */ sys_socket(family, type, protocol) SOCK_STREAM, SOCK_DGRAM, SOCK_RAW, ... sock_create() sock_alloc() net_families[family]->create() unix_create() /* include/net/sock.h */ struct sock { ... prot net_pinfo tp_pinfo socket sk_buff …. } /* include/linux/net.h */ struct socket { state type flags ops /* proto_ops */ sk /* struct sock */ files, inodes next, wait, …. } inet_create() sk_alloc() switch (type) sock->ops=&inet_stream_ops or sock->ops=&inet_dgram_ops … sk->prot = &tcp_prot 258 Socket Create (cont`) socket create /* net/socket.c */ sys_socket(family, type, protocol) sock_create() get_fd() get_empty_filp() file->f_op=&socket_file_ops associate d_inode with socket structure /* net/socket.c */ struct file_operations socket_file_ops = { sock_lseek sock_read sock_write NULL /* readdir */ sock_poll sock_ioctl NULL /* mmap */ sock_no_open NULL /* flush */ sock_close NULL /* fsync */ sock_fasync } 259 Socket Create (cont`) after socket creation task …. fd[] …. file …. f_dentry …. f_pos f_op /* include/linux/net.h */ struct socket { state type flags ops /* proto_ops */ sk /* struct sock */ files, inodes next, wait, …. } dentry VFS layer d_inode INET layer /* include/net/sock.h */ struct sock { next, prev daddr, dport rcv_saddr, sport ... rmem_alloc receive_queue /* sk_buff */ wmem_alloc send_queue ... pair /* struct sock */ prot /* struct proto */ tp_pinfo dst_cache /* struct dst_entry */ ... } TCP layer IP layer 260 Driver layer Send Data sending data through socket compare with FS control flow…, that is a piece of pizza /* net/ipv4/af_inet.c */ struct proto_ops inet_stream_ops = { PF_INET sock_no_dup inet_release inet_bind inet_stream_connect sock_no_socketpair inet_accept inet_getname inet_poll inet_ioctl inet_listen inet_shutdown inet_getsockopt inet_setsockopt sock_no_fcntl inet_sendmsg inet_recvmsg } /* fs/read_write.c */ sys_write() f->f_op->write /* net/socket.c */ sock_write() socki_lookup(d_inode) make msg sock_sendmsg() sock->ops->sendmsg /* net/ipv4/af_inet.c */ inet_sendmsg() sk->prot->sendmsg 261 Send Data (cont`) sending data through socket /* net/ipv4/af_inet.c */ inet_sendmsg() sk->prot->sendmsg /* net/ipv4/tcp.c */ tcp_v4_sendmsg() tcp_do_sendmsg() copy data from user to sk_buff /* net/ipv4/tcp_output.c */ tcp_send_skb() tcp_transmit_skb() make tcp header sk->tp_pinfo.af_tcp.af_specific ->queue_xmit(skb) 262 /* net/ipv4/tcp_ipv4.c */ struct proto tcp_proto = { netxt, prev tcp_close tcp_v4_connect tcp_accept NULL /* retrasmit */ tcp_write_wakeup tcp_read_wakeup tcp_poll tcp_ioctl tcp_v4_init_sock tcp_v4_destroy_sock tcp_shutdown tcp_getsockopt tcp_setsockopt tcp_v4_sendmsg tcp_recvmsg … “TCP” ... } Send Data (cont`) sending data through socket /* net/ipv4/tcp_output.c */ tcp_transmit_skb() sk->tp_pinfo.af_tcp.af_specific ->queue_xmit(skb) /* net/ipv4/ip_output.c */ ip_queue_xmit() build IP header fragment handling call ip_route_output() /* dst_cache.output = ip_output in ip_route_output */ sk->dst_cache->output() /* net/ipv4/ip_output.c */ ip_output() ip_finish_output(skb) 263 /* net/ipv4/tcp_ipv4.c */ struct tcp_func ipv4_specific = { ip_queue_xmit tcp_v4_send_check tcp_v4_rebulid_header tcp_v4_conn_request tcp_v4_sync_recv_sock tcp_v4_get_sock sizeof(struct iphdr) ip_setsockopt ip_getsockopt v4_addr2sockaddr sizeof(struct sockaddr_in) } sk_alloc() => tcp_v4_sock_init() tcp_v4_sock_init() { … sk->tp_pinfo.af_tcp.af_specific=&ipv4_specific .. } Send Data (cont`) /* include/linux/netdevices.h */ struct hh_cache { hh_refcnt hh_type hh_output … } sending data through socket /* include/net/ip.h */ ip_finish_output() hh->hh_output(skb) /* net/core/dev.c */ dev_queue_xmit() hh->output = neigh_ops->output = dev_queue_xmit /* net/ipv4/arp.c*/ input pkt into dev->qdisc dev->hard_start_xmit() /* driver/net/3c509.c */ el3_start_xmit() make ethernet frame send frame using inb(), outb(), ... init_module() in 3c509 /* driver/net/3c509.c*/ init port, irq, … make dev structure dev->open=el3_open dev->hard_start_xmit = el3_start_xmit ... 264 struct device { name rmem_end, rmem_start mem_end, mem_start base addr irq … init, destructor …. device_addr qdisc …. open, stop hard_start_xmit, hard_header ... } Send Data (cont`) sending data through socket struct sock struct device ... qdisc ... ... send queue ... sk_buff headers data ... sk_buff headers data ... sk_buff headers data ... Protocol Layer Device Layer 265 Send Data (Cont`) Sending all together (TCP/IP & Ethernet) cf) compare with the control flow of FS, it’s too terrible (FS is a piece of cake) VFS BSD socket inet socket TCP /* fs/read_write.c */ sys_write() /* net/socket.c */ sock_write() /* net/ipv4/af_inet.c */ inet_sendmsg() /* net/ipv4/tcp_output.c */ tcp_send_skb() /* net/ipv4/ip_output.c */ IP Device ip_queue_xmit() /* driver/net/3c509.c */ el3_start_xmit() Linux kernel 266 Receive Data receiving data through socket /* net/ipv4/ip_input.c */ ip_local_deliver() ip_forward(), ip_defrag() skb->dst->input() /* dst.ipput = ip_local_deliver in ip_route_input() */ /* net/ipv4/ip_input.c */ ip_rcv() make sk_buff in device structure ptype->func() /* net/core/dev.c */ net_bh() /* include/linux/netdevice.h */ struct packet_type { type dev func …. } /* net/ipv4/ip_output.c */ struct packet_type ip_packet_type = { ETH_P_IP, NULL, ip_rcv, ... } mark_bh(NET_BH) /* driver/net/3c509.c */ el3_interrupt() el3_open() …. request_irq(dev->irq, el3_interrupt 267 Receive Data (cont`) receiving data through socket tcp_data_queue() /* sk_buff into sk */ wake up process tcp_data() check consistency, … tcp_data() /* net/ipv4/tcp_input.c */ tcp_rcv_state_process() call tcp_rcv_established or call tcp_rcv_state_process /* net/ipv4/tcp_ipv4.c */ tcp_v4_rcv() tcp_v4_do_rcv() ipprot->handler() /* net/ipv4/ip_input.c */ ip_local_deliver() 268 /* include/net/protocol.h */ struct inet_protocol { handler err_handler ... name } /* net/ipv4/protocol.c */ struct inet_protocol tcp_protocol { tcp_v4_rcv tcp_v4_err …. TCP } Receive Data (cont`) receiving data through socket /* fs/read_write.c */ sys_read() f->f_op->read /* net/socket.c */ sock_read() socki_lookup(d_inode) make msg header sock_recvmsg() sock->ops->recvmsg /* net/ipv4/af_inet.c */ inet_recvmsg() sk->prot->sendmsg /* net/ipv4/tcp.c */ tcp_recvmsg() add_wait_queue(sk->sleep, {current, NULL}) 269 tcp_data() Receive Data (cont`) Receiving all together (TCP/IP & Ethernet) /* fs/read_write.c */ sys_read() VFS /* net/socket.c */ sock_read() BSD socket /* net/ipv4/af_inet.c */ inet_recvmsg() inet socket TCP /* net/ipv4/tcp.c */ tcp_recvmsg() wake up /* net/ipv4/tcp_input.c */ tcp_rcv_state_process() /* net/ipv4/ip_input.c */ sleep IP ip_rcv() /* net/core/dev.c */ /* driver/net/3c509.c */ Device Linux kernel el3_interrupt() 270 net_bh() Conclusion in Network Add new features /* fs/read_write.c */ sys_write() /* net/socket.c */ sock_write() /* net/ipv4/af_inet.c */ inet_sendmsg() secure_tcp() /* net/ipv4/tcp_output.c */ tcp_send_skb() /* net/ipv4/ip_output.c */ ip_queue_xmit() compress_net() virtual_ip() /* driver/net/3c509.c */ el3_start_xmit() Linux kernel 271 Conclusion of Linux abstraction is just a set of data structure in kernel level process struct task_struct struct user /* include/linux/sched.h */ /* include/asm-i386/user.h */ memory struct vm_area_struct /* include/linux/sched.h, include/asm-i386/page.h */ struct file, struct inode /* include/linux/fs.h, ext2_fs_i.h */ file file system struct super_block /* include/linux/fs.h, */ buffer struct buffer_head /* include/linux/fs.h */ device driver struct device_struct IPC TCP/IP /* fs/devices.c, driver/* */ /* include/linux/ipc.h, sem.h, msg.h, shm.h */ /* include/linux/tcp.h, ip.h */ 272