|
Blue Forest http://www.lslnet.com at 10:18 on June 6, 2006
-<Unix Internels> books Glance
[code]
Applicants : gfcat (Garfield Cat) "gfcat@smth.edu.cn>;
Title : -<Unix Internels> books Glance (1) BlueOcean (mail)
Pigeon Point BBS letter : the North side of the Unnamed
Source : from smthnew (bbs.net.tsinghua.edu.cn [166.111.8.238])
Date : Thu Dec 12 19:04:42 2002
The writer : BlueOcean (Blue), Unix letter :
Title : -<Unix Internels> books Glance (1)
Shuimu Tsinghua BBS letter Station : Station (Mon Apr 27 12:59:09 1998) WWW-POST
This book borrowed from the senior honor, the real Bible class craft.
Taiwan saw from the essence of this book in the area, are now posted.
This is the first I see this thread on the technical details of the book, but from the perspective of realizing that COOL!
Unix, I love you!
The writer : syc@cc.ntu.edu.tw (Shiau Yong-Ching), Linux billboards :
Title : core on Unix
Wrote station : National Taiwan University (Wed Jul 9 23:23:08 1997)
Sobee!netnews.ntu!not-for-mail transfer station :
Before leaving, I gave four years of university education to accompany the TAnet & UNIX. . .
Unix Internals
The New Fronters
Guide
"UNIX installation of the system has grown to number 10.
It is expected this figure will continue to increase "
-- Ken Thompson and Dennis Ritchie
UNIX programmers in the second edition of the Handbook
June 12, 1972
Many UNIX discuss books, but most are only discussing how to use the book to discuss non-UNIX Programming
, And UNIX systems on the core of the book is less. . Following are several well-known books :
* Bach's The Design of the UNIX Operating System, 1986 -- discussions System V
Release 2
* Leffler and others, The Design and Implementation of the 4.3BSD UNIX
Operating System, 1988 -- discussions 4.3BSD Unix
* Goodheart and Cox, The Magic Garden Explained, 1994 -- discussions System V
Release 4.0
However, these books are the only individual / specific written by the UNIX system, few readers to a whole
Perspective. Unix Internals and Uresh Vahalia this book -- The New Frontiers
From a macro point of view is Unix system. In his book, the author discusses the business and academic communities
The UNIX system, a detailed description of the algorithm, the advantages and disadvantages of each system
Detailed and objective comparison, it is system administrators, programmers, computer gamers understand each
UNIX core of the best books.
The book is based on UNIX System V Release 4.2, UNIX and erudite each other.
To discuss pros and cons. Publication date is 1996, is the latest information on UNIX core.
I will read the z infrastructure is not very difficult. UNIX is the use of process management and design experience
It is necessary, the proposed Bach's The Design of the Unix Operating
System's first read, read this Name
Familiar with the book.
She describes this fascinating part of this book, is that most UNIX fans wanted to know about the part of a feast
Those who can not read the original book, or do not have the time who specializes in Unix core. For readers not to the extent specific assumptions.
With the depth is introduced Hing, please forgive me.
Chapter 1 Introduction
Lost UNIX
UNIX studied there for a description of the book, but IBM AIX support is almost a word was mentioned. If you
AIX is a supporter Name the book may get you down, because the core UNIX like AIX
The development of non-impact. Similarly, HP-UX as well. Only you will know, probably
UNIX System V were from two well-known family of UNIX.
Writing about the book in 1995, 1995,1996 However, the biggest change is the year of UNIX, there are a lot of
More manufacturers update their old UNIX UNIX system, almost all of SVR4 to par.
Made no mention of them, should not affect the bar, perhaps they have no special skills, which is
: P relatively mediocre, or it may be because they are less open to deal with the reasons, not academic. . .
UNIX is the most important chapter of history, but changed too much since 1995/96, some things
History, too late :)
UNIX systems are often referred to the book of the UNIX System V Release 4.x authentic.
The largest non-mainstream UNIX 4.x BSD UNIX (basically 4.3, 4.4) and UNIX successor.
Carnegie-Mellon University Mach (ch sound development k read Mac). Again, the highest share
UNIX--Sun of SunOS and Solaris. Digital Equipment Corporation (
Digital, or DEC) OSF/1, Digital UNIX now changed his name, it is because
Mach use the core and feeling a lot of light exposure increase.
At the book, the Sun can say is the most powerful UNIX research and development firms, it can not be overlooked.
History lesson
The late 1960s, Bell Telephone Laboratories, and General Electric
Massachusetts Institude of Technlogy cooperation for the development of a multi-user
Sectors, Multics. The project was started in March 1969 abolished. After the abolition of the story we are a bit
Familiar, somewhat familiar, put key part here point out :
* Ken Thompson wrote in DEC PDP-7 Space Travel named the arcade.
* PDP-7 lack of procedural development environment, so, Dennis Ritchie, Ken Thompson wrote the UNIX + out.
B * Ken Thompson wrote the language (BCPL evolved from the literal language)
* Dennis Ritchie put into the famous B C language.
* November 1973 Unix version 4, the use of C language adapted.
The first article of Unix Paper, "The UNIX Time Sharing System" by Ken Thompson and Dennis
Ritchie
, In October 1973 the ACM Symposium on OS (∧) forward. In July of the following year
The Communications of the ACM published. This is the first UNIX contact with the outside world.
Reasons for the spread of free UNIX
1956 AT & T antitrust laws by the investigation. AT & T during the investigation, signed an agreement with the federal government.
Not unrelated to the operation of cable operators and telephone. BTL Class under the AT & T.
UNIX ∧ published in the academic world, continue to seek UNIX and MSA, will be free of AT & T
MSA to provide academic, it caused a UNIX widely spread.
Berkeley's Computer Science Research Group, CSRG to a lot of contribution to the development of UNIX.
Berkeley UNIX called BSD UNIX. BSD UNIX contributed a pair of virtual memory, TCP, Fast
File System (FFS), reliable signals, the socket interface.
4.4BSD the original Mach VM into the VM, and brought the Logged File System. (LFS).
CSRG has closed up shop after completing Douglas. The reasons are :
* Insufficient subsidies
* BSD features can be seen in the commercial systems (the so no DIY)
* The system has a large group could not maintain the degree of
Is a company Berkeley Software Design, Inc.. (BSDI), set up to continue marketing 4.4BSD.
Engaged in commercial activities. They called BSD/386 the BSD. BSDI BSD/386 declared after Berkeley
Rewritten the software of the AT & T has not. But AT & T is made or to Berkeley and BSDI.
Tel : 1-800-ITS-UNIX BSDI is the fuse. This action delayed the release of 4.4BSD.
Finally on February 4, 1994, the two sides reached a settlement, told revoked. BSDI declared a non-AT & T
The software of the 4.4BSD source software, called 4.4BSD-Lite. Next is the story of the Internet
A legend, you can see the 386BSD discussion zone.
UNIX System #
Justice antitrust investigation will be split into several subsidiary companies AT & T, BTL Class renamed AT & T Bell Laboratories.
And AT & T were allowed to enter the computer market. AT & T's commercial version of UNIX System III are,
System V, System V Release 2 (SVR2) System V Release 3, System V Release
4/4.2
System V introduced many new features (as opposed to the old UNIX), such as regions of virtual memory structure (and
BSD is not the same), IPC, remote file sharing, shared libraries,
STREAMS framework and so on.
Commercial UNIX
Commercial UNIX UNIX indisputable Tim for many characteristics, such as SunOS the Network File System (NFS).
Vnode/vfs interface to support multiple file system, a new VM system (used for SVR4)
AIX journaling file system is the first to support the commercial UNIX. ULTRIX (DEC old UNIX)
UNIX multiprocessor support is one of the first trend.
Mach
Mach is Carnegie-Mellon University (CMU) microkernel (micro-core) operating system. (1980s)
Following the function of more and more complex and large UNIX are increasingly difficult to grasp the concept microkernel
Put Kernel soon as possible, leaving only a significant part of its remaining functions with user class procedures (known as
Server) to reach again. Jici reduce Kernel complex degrees.
Mach design goals
* Compatibility with UNIX
* In a single processor, multi-processor implementation can
* Suitable for distributed computing environment
Mach2.5 version is the most common, as are many commercial UNIX DEC OSF/1, NextStep foundation.
Mach3.0 real pure completely Microkernel version.
What are the standards
UNIX versions of the standards as much as his. The details carefully with the fate of the various standards.
Novell will discuss the latest information for the UNIX trademark X/Open sold, and Sun Solaris 2.5 version.
1986 IEEE appointed a committee to develop a standard open operating system, known as
POSIX (Portable Operating Systems Interface, plus X-final, to put it bluntly, because
There is the standard UNIX). "- This is what I heard, not to write the book.
X/Open is a body composed of international computer manufacturers, was established in 1984. More pragmatic purpose.
Not to add a large number of UNIX standards. Instead, focus on the pooling of existing standards,
Sorted out a common environment. Statistically (X/Open Portability Guide) was his role.
The current UNIX trademark is owned by X/Open.
Apart from the standard, UNIX manufacturers have formed an alliance.
UI, the Unix International, AT & T and Sun is the main union. The main product of a SVR4 and OpenLook.
OSF, Open Software Foundation for IBM, DEC, HP, a subsidiary of investment company headed.
OSF Motif on UNIX contributed a standard DCE (Distributed Computing Environment).
Out to disturb the situation in the NT, UI collapse of the AT & T UNIX not the (operating system to concentrate on his plan9? )
SVR4 become the successor to the Sun Solaris, but Sun is no longer insist OpenLook, while supporting CDE
(Common Desktop Environment, the talk is that Motif).
Chapter 2 The Process and the Kernel
Prcoess information on the structure and implementation of state, and there is nothing special. Unix Progarmming tantamount
The small complex study.
Vfork description is rather special. Some books give us a modern UNIX, with vfork = fork
Copy on write an impression. In fact vfork, child and parent is sharing a piece address in space (like
Multi-threading), the parent of the child can be amended information.
Unix may present a semantic and primitive vfork
Not like a bar. .
Chapter 3 Lightweight Processes and Threads
One of the properties mentioned here : Modern UNIX threads. In Windows, OS2 under the craze
Everyone should be somewhat influenced by now. Lightweight Threads and processes defined here.
Some system threads = en, but to discuss the needs, the book threads> leaf.
A thread is the smallest unit of the implementation, on behalf of a process in which the activities of the state system. Traditional
The process is only one thread procedures. A thread is the same procedure five inside
Five work in hand, which is five small independent procedures along the Noninterference means. (
Phrase of words, even as the good of the) process is a relatively low-order point of view there are five
Instruction flow and five stack. If still very vague definition of Mach 9.30 to see more clearly.
Thread three : Kernel threads and lightweight processes, user threads.
Kernel threads are several Kernel Ready is a split could mean something.
This is useful for handling asynchronous events, Kernel can die for every birth to handle asynchronous events.
Kernel threads operation is very handy, and almost no burden to bear, but the structure of the Kernel help.
Kernel threads support the Kernel called multithreaded Kernel.
Lightweight Process is a Kernel support for the user thread. In other words, every leaf, Kernel
Have to use a thread to handle / support. So en Kernel thread can be seen as an extension. Parade
Is a great price for the leaf, because he has something to the Kernel context switching.
User threads in user space, the thread through simulation library, no or only minimal needs
Kernel support for faster context switching, because things do not change the page table and so on.
Light use up more rapidly. Windows user manipulation of the threads provide excellent solutions.
Kernel threads user can basically do not know the existence of a system can and
For the library to provide several different policy. Is the general user application threads
Multi-threading needs in the form of this approach because no Kernel, there is the most lithe,
Efficient. Mentioned here, are you advertising, OS2 and Windows indisputable multithreading support
Puzzled? Mulitthreading they really do not know what the function of Jerusalem in dispute. . .
Mach thread library is the most famous' s C threads, and POSIX pthreads.
SunOS4.x alleged threas en refer to the user, rather than the book leaf.
Solaris Kernel threads posed by the group, some leaf used to implement some system for some
Tasks.
Kernel thread more innovative applications to replace interrupt
Handler. Have interrupt and when can produce Kernel
The birth of a thread (cloned) used to handle the interrupt. Re-entrant resolved the issue of breath
Some of the thread scheduling meant, but a few words to describe inconvenient, not very interesting, even if the.
Mach's Threads
Mach Kernel, there are two basic concepts : the task and thread.
Task static objects from the mapping space and system resources (port rights) form.
Task for the implementation of a thread environment.
Mach thread is the basic unit implementation.
Mach provides graphics for operating cthread threads.
Continuation Mach 3.0
Mach3.0 meaningful improvement of threads, called continuation, the starting point is to
Save the stack used in a multi-threaded Kernel, the thread more (abuse :)
It is also a drain on the more stack space, have yet to be resolved.
Following the procedure is very common fragments, as well as general Kernel thread will block the reasons (that is,
Need to stack the time.
Syscall (args) (
. . . LEAVES OF 13 SPECIES OF LAURACEAE For example, DMA transmission arrangements piece disk block. . . . */
Thread_block (); LEAVES OF 13 SPECIES OF LAURACEAE wait until completed */
Process (args1);
Return;
}
If syscall block occupation, we must retain a stack down. If the process can () No
Stack used inside information, then it would not retain stack. Of course, the advance should tail
There static variable parameters Lane.
Therefore, the original process becomes this :
Syscall () (
...
Thread_block (process);
LEAVES OF 13 SPECIES OF LAURACEAE not reached */
}
Process () (LEAVES OF 13 SPECIES OF LAURACEAE final disposal */
...
Thread_syscall_return (status);
}
Save the stack will be a wake up the thread is also a continuation happened.
He used to be directly saved point cache/TLB misses.
Digital UNIX
OSF/1 Mach discussed how to make use with UNIX tasks&threads real meaning out of the Process.
--
Buck barks in the darkness
Source : Shuimu Tsinghua BBS activity stations bbs.net.tsinghua.edu.cn [FROM : ns.nlsde.buaa.edu.cn]
Applicants : gfcat (Garfield Cat) "gfcat@smth.edu.cn>;
Title : -<Unix Internels> books thesis (b) BlueOcean (mail)
Pigeon Point BBS letter : the North side of the Unnamed
Source : from smthnew (bbs.net.tsinghua.edu.cn [166.111.8.238])
Date : Thu Dec 12 19:04:48 2002
The writer : BlueOcean (Blue), Unix letter :
Title : books Glance -<Unix Internels> (2)
Shuimu Tsinghua BBS letter Station : Station (Mon Apr 27 13:09:14 1998) WWW-POST
Chapter 4 Signals and Session Management
This section of the signal and UNIX session (&controlling terminals. . . ) Made a detailed description of
It is not only detailed, but also bring together the various versions are clearly written, the book is no ordinary Zhemo clear.
Mach signal of the strategy is to focus, we can see mach Unix is incomplete.
Under the Unix installed a signal handler to signal. Mach is used and
IPC to signal (officially named exception). Mach each have corresponding thread
Exception port. If it has to deal with exception, is to listen and take another one this thread
Exception port. This is per-thread exception port. There is also a per-task
Exception port. If no one listened to thread the exception port will to task
Port exception, of course, nobody listens to the task exception port to let him die!
Who used? Debugger is!
Unix debugger has shortcomings, there is no way Killing fork out the child. Through IPC.
Mach debugger no such problems as the exception mach debugger can listen and
Port on the list. Use IPC has become an easy thing for remote debugging. (A mach
Proxy server, the message can be local route to the remote port, seeminglessly!)
Killing the debugger had only Unix fork out their own child, if one has been implemented
No procedures for the web. Mach is, as I have just said, if someone can listen and exception
Port can be. This has improved the UNIX use /proc file system may achieve the same effect.
You must understand what is learned UNIX process group, but do not necessarily what is clearly session.
The concept is based on SVR4 and 4.4BSD session with shaping the group process, because the only concept
No way to clear a Login session and a session of a different job. The concept of session
Is quite new. . SVR3 and 4.3BSD process in the past, and the mixed group.
Chapter 5 Process Scheduling
Is the core clock interrupt process scheduling, multi-system is the heart!
Kernel used in the form of a callout.
Int id = timeout (void (*callout) (), caddr_t arg, long delta);
Callout to a registration function, activated delta tick.
Kernel are so many elements to the timer mechanism for the registration, including scheduler.
Can be divided into three categories : a systematic ap
* Interactive-Dean intensive, such as Shell, editor, GUI
* Solvents - cpu intensive, such as software buil;ds, scientific compuutations.
* Real-time critical, such as films, the demand ap screening rate is fixed,
Such as 20 frames per second. If a system can only provide an uncertain rate,
15-40 variation, and the average 30, it is difficult to accept.
Unix on the first two (ie, time-sharing) are deeds, this chapter also to the various Unix
Done a deal with the run queue depth discussions. Of course noteworthy or Unix
Of the real-time support.
Before Unix Kernel is not the reason to support realtime non-preemptive reasons,
Sometimes Kernel will be something for too long, starvation process. The solution is SVR4
Put the algorithms into several time, placed among several preemption point. In these points
Kernel can preempt that makes it possible to support real-time.
Solaris Kernel go further and use the most appropriate means of data-structure
Protected (synchronize, mutex lock, samaphore), to become a real Kernel
Preemptive. Solaris can be seen really is not bad.
Non-preemptive traditional Unix Kernel is the use of these properties save a lock.
Mach the scheduling policy is interesting, in a thread msg_send ()
, He might block the general living, message enqueue, find the queue to run
A thread to run. However, if someone is already waiting for the message.
Mach direct scheduling receiver is running out, or enqueue a message directly
Copy到 user-level address space. And so on and so increase the effectiveness of IPC,
Save time for a search because the run queue, and IPC enqueue/dequeue time.
Mach is a special processor set concept. He allocated a task may require Kernel
N satellites (a set, a group) to run CPU.
Digital UNIX originated in the mach, but it has no use mach the scheduler.
This chapter also introduces some by the end of the proposed algorithm, the more interesting is three-level scheduler.
Must satisfy the real needs of arbitrary time-sharing and is not very likely, because funding
Sources limited. Three-level scheduler introduction of the concept of restraint, the first real-time registration process.
Kernel allowed to reserve sufficient resource implementation. And the Kernel will reserve the appropriate resources,
Time-sharing processes will ensure a standstill.
In the overloading of the network, the Kernel as long as there is no time to deal with network activity.
Will lead to receive livelock happened. 三 scheduler can network
Handled as a real-time task, given the appropriate quota enough. Even on the drop off is not enough.
Guarantee you eat like this, when no live! [Real-time task with a fixed priority.
-- Kernel fixed resource quotas and guaranteed he would have these things, just like people daily guarantee
A bread will give you, we will not rush up to a pile of bread (time-sharing) day
Sometimes, it has attracted several, and sometimes scrambling less than the first, if there was a person who would rush, the other
Sufferings of the people will starve to death
Chapter 6 Interprocess Communications
IPC programming a chapter on some complicated study.
Only the traditional Unix IPC mechanism pipe. SYSV introduction of System V IPC, including semaphore.
Shared memory and message queue. The structure is Unix System V STREAMS
Important IPC mechanisms. The book made no mention of BSD socket interface, as well as the assumptions
IPC said. Mach is a message passing to the Kernel (message-passing
Kernel), most of the services are relying on the exchange of information, has spent a chapter
Introduction to great length.
Special :
SVR4 STREAMS actual use for pipe, pipe SVR4 is bi-directional. 1
Reading pipe ends can be written. The other is to use a UNIX pipe in order to achieve the same purpose. -
System V semaphore has one drawback : not a single allocation and initialization steps.
Race condition can lead to happen.
In System V STREAMS, IPC message queue can be almost eliminated. STREAMS
Powerful functions (see the last chapter), the message queue can replace all of the functions.
Mach is the most difficult chapter on the IPC. I believe we are most interested.
= Message : some type of other data. Divided into two types, simple data, namely the
Information. Complex data, it may out-of-line memory or port rights.
(The latter)
Port : = a protected queue of messages. Simply means that the message queue id.
Almost with the file descriptor, as a whole, on behalf of a message queue.
Message to be sent to port (port communication), but not for a particular task or thread.
: = A port visit to the port authority rights, the rights send and receive rights. This
Object task is to control access rights, rather than thread. Send rights allowed
A task (all threads) to a certain port to relay messages. Receive rights
Analogy. A port can send rights to a number of tasks, but only
Receive rights -- is a task with the task of the port. Receive a
The task right, it automatically send right)
Out-of-line memory : = transmission of large volume data, the message is not directly copy efficiency. Through
Virtual memory system reform memory mapping, the method can be used copy-on-write
To improve. - Line memory transmission of the message from the sender copy到 Kernel.
By Kernel copy-to receiver. Out-of-line memory is Kernel
Alter some of virtual memory mapping, the only sender/receiver
Modify the information produced copy-on-write the page fault, only
A copy.
Complex data and complex data is different from simple data processing message of the need for Kernel
, Translated to the receiver. Direct transfer of data can be simple.
Each task has a send-right task_self port. Task_self port is Kernel
Owned. Task_self task through communication port and Kernel. In addition, a task task_notify
The receive port rights, for receiving (Kernel) to the message. Thread_self a thread
Port, to send a message to the Kernel, the reply came back the reply pyramid for receiving Kernel
(For example, system call, Prishtine. . . ). Thread the port, port rights to the task of all.
Another exception to handle exception/signal/debug port. Talked about before.
A message of a column within the data structure reply_port. If the sender on the need to reply
This column insert its own port. (So will receive their own right). Message passing,
Kernel for the receiver will create a reply_port to send the right.
Port Name Space
Port is an integer, and the file descriptor, the name of each task independently owned space.
Ie. The port id different task if, say, No. 100, does not refer to the same port.
Port translations
Because space is a task port name among independent. Kernel must establish a translation
To remember who is who. (Unix is the global file descriptor table).
What puzzles us most is reading this chapter right port. Kernel data structure inside the port because
No port right record. Mach put it in the end to the port where possession of information right now? Waited a long time to carry out.
It turns out the port translation. Kernel port is the port translation
Right. Be bold enough to say,
Mach Kernel port is not the right concept. As long as the investigation in the port translation
Port, it can send a message to it.
Each entry port translations, are representative of a connection.
Entry contains the following information : a "task, local_name, kport, type>;, is the significance of
Local_name the port of id is a task that has to kport (a data port at the Kernel
The structure of the target) port right. Send or receive port is right? Decisions by type.
Local_name right is the so-called port name.
Msg_send () with the "task, local_name, * *>; find kport then
"Port.owner_task, * port, the port in the task of finding *>; local_name.
Deletion of a port task, to find kport Kernel, and then get rid of the "* * kport, *>;, and
Notify the relevant task.
Qingdiao task to the end of the "task, *, *, *>; all kports.
Kernel hash table with two quick visits to translation, TP_table (key=<task, port>;)
TL_table (key=<task, local_name>;)
Port Rights Transfer
Transfer to embed the right general message header inside port is 12000
The reply port. Such a transfer port is the port the right send
The right to somebody else. The complex will use more complicated message. Message inside.
Kernel should tell the others (send right) gave a third party. (This chapter details not mentioned).
This is the mach name server functions : In addition to the port mentioned before, every mach
Another task bootstrap port, through the port, the task can send message to mach
The name server. Name server provides a mechanism to visit other server. The process is as follows :
1. Registration name server to server (through bootstrap port)
2. Name server to the client and the server asked how communication.
3. Transfer a name server to the client to the server to send right
4. Client and server can be a communication port, the use of this port and server links.
Mach is the ideal mechanism for many Unix Kernel into user level server. Name server
Opening of this mechanism is the key.
Port Interpolation
Port interpolation can let people stole the send/receive rights to a certain task.
Mach provides task_extract_send (), task_insert_send (), task_extract_receive ().
Task_insert_receive () to do this thing stall. This is the message interception ipc debugger.
Netmsgserver
Another Port interpolation and the like, is the remote Mach IPC. In fact,
To put it bluntly, the Mach remote IPC reached only through proxy netmsgserver for it.
Mach can do so for two reasons, one is send_msg () so long as we know our own
Local_port_name would suffice, but even he does not know how the port, even happened. This is
Netmsgserver the server name and port, despite the task to tell him the name server
Can be sent to the port. The second point is anonymous senders. Receiver could not learn from the message
Who is the sender. So netmsgserver can provide full transparency of services.
IPC Mach 3.0 pairs of doing something to improve their Behold. This is Mach 2.5 IPC
Encounter problems.
Chapter 7 Synchronization and Multiprocessors
Kernel MSG referred to the algorithm and data structure. Especially multiprocessor
The following issues. The biggest problem in the virtual memory multiprocessor cache/tlb
Synchronize problem. This chapter does not mention that extended virtual memory further.
Chapter 7 of the contents rather trivial and not suitable for summary.
--
Buck barks in the darkness
Applicants : gfcat (Garfield Cat) "gfcat@smth.edu.cn>;
Title : -<Unix Internels> books thesis (c) BlueOcean (mail)
Pigeon Point BBS letter : the North side of the Unnamed
Source : from smthnew (bbs.net.tsinghua.edu.cn [166.111.8.238])
Date : Thu Dec 12 19:04:49 2002
The writer : BlueOcean (Blue), Unix letter :
Title : -<Unix Internels> books thesis (c)
Shuimu Tsinghua BBS letter Station : Station (Mon Apr 27 13:20:24 1998) WWW-POST
Chapter 8 File System Interface and Framework
Located in the second half of the discussion (complex study) with the filesystem to the system call. Discuss the latter half of the mordern
Unix VFS (vnode) framework, which means that the file system to deal with the Kernel. The next chapter
Disk file system discussed in the layout.
S5fs is early Unix (System V tilt before) file system is simple but can still function, the largest
The disadvantage is that the characters are 14 stalls were restricted. 4.2BSD designed Fast File System, FFS.
Provide optimum performance and function. FFS be widely welcomed last SVR4
Used. FFS also known as the UFS (Unix File System)
Early Unix Kernel system can also support two or more files. Each
Have a solution. File system switch with AT & T, DEC used gnode/gfs, Sun
Vnode/vfs most widely accepted as the final victory, as the SVR4
A part of it.
Inode is the unique concept of Unix. Unix Kernel inode as he files using the interface.
An inode of a file recording the Kernel's visit to the files necessary to use the information
And the file attributes and so on. Inode reason is the name of the majority of the data structure
On-disk contents were formed by the inode come. Unix file system on the disk of the image were not stall
Using integer (i-number) to a file for each. Contents is a more specific files,
I-number stalls were recorded and control it. With i-number, Kernel to find inode.
Is a file storage location and attributes of places. The general said the disk inode to on-disk
Inode was read into the Kernel memory, the use of the inode in-core inode.
Vnode/vfs in-core inode is to convert the concept has become a virtual object class.
Vnode provide an abstract interface. Kernel operating accept all of the files
Vnode as defined through the virtual likelihood to achieve. After a vnode initialization,
Have targets at the file system dependent part. So regardless of the files underneath
Format for that call to the correct names Kernel function. The structure makes VFS
Unix can support multiple file systems, file systems provide Kernel via vnode
Developer of a standard interface. July's Rhapsody on the InfoMac
An article VFS files can be compressed, decompress, online antivirus functions
Really rather strange. .
A special Unix implementations, and the definition is not quite the same.
Definition :
(Struct vnode
Public_data1,2,3 that. . . .
Caddr_t v_data;
Vnode;)
Vnode.v_data = (caddr_t *) &ffs_data;
Make this for real, interface and file system dependent data together :
Struct (ffs_node
Public_data1,2,3 that. . . LEAVES OF 13 SPECIES OF LAURACEAE struct vnode and the same sequence */
caddr_t v_data;
private_data1,2,3.... /* ffs private data */
} vnode;
vnode.v_data = (caddr_t *)&
Understand?不懂就去看书吧!书上是用图示法的,见图知意.
vnode/vfs有个地方和原本的unix不一样,会导致race condition.我想这是
许多security hole的泉源.就是file name lookup不一样.以前只有一个
filesystem的时候(看Bach的书) Unix是传一个路径给namei(),就可以找到
inode. vnode的方法要把一个路径切碎,一段一段找.这也使performance
降低不少. (unix filesystem瓶颈之一在file name lookup,每次lookup都会
碰到路径上每一个目录的inode....4.4BSD和OSF/1都有提出解决的对策.
(自己看书,太琐碎了)
有了vnode/vfs, 则可以鼓励出更多的创意, 更多的档案系统可以被提出来.
比较有趣的有NFS, specfs, fifofs, /proc fs.这里顺便提一下后三者.本书是在
稍后的章节才提到.
specfs在干什么的呢?以前s5fs的时代,有些特别的inode不表示实体的档案,
而是用来表示周边装置.现在假设我们的ffs上有个/dev/tty1的档案. 利用
刚刚提到的vnode/vfs的方式, kernel会去呼叫到ffs的函式,而这些函式并不能
知道怎样驱动tty driver.解决的方式是设计一个specfs.只要kernel发现
inode是个device file的话,就不去呼叫vnode的open,而是呼叫specvp(),把
该vnode传给他.specvp就会把适当的virtual function和指标填到vnode去,
而得到所希望的操作.这就是specfs存在的目的. Chapter 16有更详细的资料.
并讨论到不同device file major/minor device number都一样,也就是指到
相同device的问题解法.
fifofs和specfs一样.用来把ffs的某个vnode换成有fifo作用的named pipe.
/proc file system则是把process在记忆体中的状态,转换为file system的方式
方便存取.这样子debugger得以更方便的控制process,提供更多的功能,
也可以debugger执行到一半的process. (前面提到在旧的Unix下,这是不可行的)
Chapter 9 File System Implementations
本章主要讨论on disk file system layout, 和一些其他的file system.
前面没有提到vnode在kernel如何组织,所以这章也补齐.
首先是s5fs.其实这还是自己看比较快,看图说故事就对了.
比较特别的是file system如何处理free block. s5fs用free block
list来管理free block.这个list会很占空间吗?不会的,因为此
list在free block上,也就是借用没有用的的空间.free block list
的大小和所剩的空间成正比.如果没有空间了,也不需要
free block list了; free block list原先占用的空间就
会释出来当作free block用.
其实前面没说清楚. free block list分成两部份.在super block
内有一小节,剩下的才放在free data block上. super block的
free list用完了就会从free data block上的list调借. 所以
free block list占用data block的空间会在data block用完前释出.
Berkeley Fast File System
s5fs有许多限制, performance是一个问题.比如说ls -l dir,就会在inode和dir
的档案间来回奔跑. (inode才有file attribute.) s5fs把inode table放在disk
空间的最前面, 其他的地方放data block.导致浪费掉许多seek time.
s5fs fragmentation的问题也没有处理得很好.用free block list的方法只能
在档案系统刚建立的时候拥有良好的连续配置,用久了fragmentation就严重了.
block size也有问题. block size大会增加效率,但是浪费空间,反之减少效率,
但是空间利用率较高.
s5fs单一的super block也很危险, super block 毁了整个filesystem就完了.
s5fs还有一个缺点是14个字的档名限制.
FFS的出现就是解决这些问题的. How? s5fs把disk看成是一个磁带般处理,
没有考虑实体结构, FFS设计的时候就面对现实,把实体结构考虑进去,也就是
考虑了disk的head, track, cylinder等性质. hard disk由好几片platter
(硬碟片)组成.cylinder就是不同platter下相同track所组成的圆柱体.
FFS把一个disk划成好几个cylinder group. 一个cylinder group
由相邻的cylinder组成.简单的说,FFS就是把一个disk划成几个小
disk (cylinder group),每个cylinder group (disk)上面放个s5fs
就是了. 这样子作可以让某个cylinder group内的资料限制在几个
cylinder的□围内,减少seek time.
针对fragmentation的问题则是使用bitmap来解决. FFS使用超大
的data block以降低fragmentation带来的影响.
但是对于小的档案(和大档案的尾巴,不满一个block的部份),则是把data block
切成几个小块,存放好几个小档来节省浪费掉的空间.存放在fragmented
block的资料一定要是一个档案的最尾端,而且在此block内的资料必须
连续存放,不可以说A档的tail住在一block的第1/4, 3/4块, B档比较小,
住在同一 block的第2/4块.这种情形要把A档的block移到另一个新的block
上,让A的最后一个block连续存放.
为了避免因为一个档案慢慢的成长导致不必要的搬移, FFS限制只有direct
block可以用fragmentation.也就是档案超过一定大小就不用fragmentation block
了.
除了架构上的改变, allocation policy也很重要. FFS这样配置硬碟空间:
* 同一个目录的档案(inode)都放在同一个cylinder group上. (localizing
policy)
* 每个新的目录都和其parent位于不同的cylinder group上. (distributing
policy)
* 档案的data block放在和其inode相同的cylinder group上(localizing policy)
* 档案超过48kb,以及以后每增加1MB,就要换跑道,喔是cylinder
group.免得某个档案灌爆一个cylinder.
* 配置data block时考虑disk的interleave factor.(不懂?那你一定没有
用过MS-DOS的floppy加速程式:) 没关系, 本书有图解)
s5fs的super block被分成两部份. FFS4每个cylinder group都有记录
自己的空间使用状况. 而整个disk的super block只记录整体性的资料,
如cylinder的大小位置, block的大小位置等等资料.每个cylinder group
都有一个super block的备份. FFS把这些备份分散开来, 使得没有单一的
磁头,磁轨,磁柱或碟片存放这些备份. (没有把所有的鸡蛋放在同一篮子
上的意思.)
即使目前的SCSI硬碟并不区分head, cylinder, sector,制造商提供的资料只是
让他乘起来和硬碟的大小相同而已, 但是实验显示FFS的方式还是可以得到
良好的效能.
FFS还有很多chache的改进可以提高他的效能,不过要自己看书:)
Temporary File System
temporary file system对于需要暂时使用档案的场合可以增进效率.
以前的方式是把一块ram划下来作ram disk.这样子浪费记忆体. BSD用memory file
system (mfs)的方式, 让一个io server用他的address space当作空间来提供
temporary file system的暂存空间,不过最大的缺点是context switch使
performance不好. Sun的方式最好, tmpfs整合了vnode/vfs介面和VM介面,
让整个virtual memory来提供tmpfs的空间. 整个存取的方法和一般的filesystem
没有两样,是最理想的方法. 另外一个方法是设定file system write back的时间,
以企图延后所有档案系统的资料写入disk来达到temp file system的目的. (期望
他没被写入前就杀掉了).
Unix用到暂时档的地方很多(如cc).所以temporary file system的发展
是很有价值的.
Chapter 9结束的时候提到了Buffer Cache.这个东西在Bach的书里是重点,但是
目前的Unix已经不用Buffer Cache,改采整合file system和virtual memory的方式,
更为有效的运用记忆体.
Chapter 10 Distributed File Systems
本章介绍..NFS/RFS/AFS/DFS.
NFS其实算蛮简单的,他只是把kernel往disk写回的动作(& reading),换成
network packet送出去而已.对kernel的介面则是藏在vnode/vfs下.
NFS performance bottleneck在于NFS要求每次的write()不能够cache,一定要
马上写回,而档案属性必须一个一个档询问,使得ls -l产生大量的traffic.
当然现在已有方法改进.
NFS version 3于1995年公布,更正了NFS v2的大部份的问题. (security
的问题是在RPC上面. RPC本来就有提供security的机制,只是少有实作而
已...) 由于NFS v3出现蛮晚的,有支援的Unix可能算很赶的上流行了:)
本章提到两个dedicated NFS server. 我比较有兴趣的是Auspex NS5000.
另外一个是IBM的, focus在容错. Auspex NS5000有好几个CPU, 每个CPU
都作单一个事,分成两组,一组处理网路上的需求,另一组处理硬碟档案系
统的IO.另外有一个CPU跑修改过的SunOS,admin用,其他的CPU都跑一个叫
做functional multiprocessing kenel (FMK)的系统,彼此间用message
passing交谈. FMK的好处是他只提供作NFS server所必要的function,
而不提供Unix的语意/环境,使得系统省下不少负担.
RFS没什么好提的...
AFS, Andrew File System是CMU发展的系统, 后来成立Transarc
Corporation.继续发展. 后来AFS演变成OSF DCE的Distributed File
System.本书AFS/DFS都有详细的介绍.他们太复杂了,看书比较清楚.
结论是AFS比较不好, DFS作了很多改善, 功能也比较多. DFS比较特
别的是对分散式档案系统cache的改进. DFS server会给他的client
一个(read/write/status/lock..)token, 允许他(read/write/status
/lock..)等等的动作而不需要与server synchronize,也不用
update cache.也就是说该client拥有该资料的使用权(token,权杖之意).
client拥有token直到kernel撤回这样权力为止.kernel随时可撤回
这些token,表示有人要更改data,需要synchronize了.
token的想法非常的高明,因为传统的方法所有的时间都要synchronize.但是
token的想法则是考虑到大部份的时间内皆不会产生race condition,所以
不需要每个动作都synchronize.
不过文中也指出, DFS非常的复杂, 不好实作就是了.
--
Buck barks in the darkness
寄信人: gfcat(Garfield Cat) <gfcat@smth.edu.cn>;
标 题: 好书共赏-《Unix Internels》(四) BlueOcean (转寄)
信 差: 北大未名站 BBS 信鸽
来 源: from smthnew (bbs.net.tsinghua.edu.cn [166.111.8.238])
日 期: Thu Dec 12 19:05:05 2002
发信人: BlueOcean (Blue), 信区: Unix
标 题: 好书共赏-《Unix Internels》(四)
发信站: BBS 水木清华站 (Tue Apr 28 01:07:10 1998)
Chapter 11 Advanced File Systems
首先提到interleave的问题.不懂interleave的人可以看这里. 这里提到一个关键
性的bench mark. Unix filesytem的效率在于写入资料的速度. 因为Unix系统的
cache作的已经很好了, 所以read()大部份都发生在cache上, 但是write()则必须
把资料写到硬碟去,变成了bottleneck. 如何改善write()的速度就可以提升
整体效率.
皆下来讨论kernel要如何把资料写到disk上,在crash时有最小的 损失,
fsck才能够做到最大的复原.
File System Clustering (Sun-FFS)
一般的档案存取都是sequential的,虽然会透过好几个read/write system call
达成. 如果kernel也可以像C stdio一样收集这些data block,然后再一起写到
disk去可以增加效率,这就是file system clustering. SunOS首先提出这种办法,
后来SVR4和4.4BSD也都采用了. file system clustering提升了不少FFS的效率,
使得FFS仍然足以与新提出的档案系统匹敌.
The Journaling Approach
这里logging = journaling, 意思是记录. journaling file system or
logging file system (jfs or lfs)基本上就是把对file system的修改
(包括对档案属性,档案内容,档案大小的修改)都以append only的方式附
加(记录)到单一的档案(磁碟空间)去.最主要的目的是crash之后,只有最
后append的部份有可能出问题, fsck的速度极快(只要放弃最后那段log
就可以了). 不过这样说好像太简单了,实际上lfs还要复杂些,有些race
condition要处理.
一个档案由两个资料组成, 一个是data, 另一个叫meta-data,指的是档案
的permission,access time, modify time,...等等.
我们可以下面的特性来区分lfs:
log 什么东西? data, metadata都log, 或者是只记录meta-data. meta-data
logging更可进一步决定是所有对档案属性的改变都记录起来, 或者只记录影
响档案系统结构的改变就好了. (time-stamps, ownership, permissions等要
是当机时没改到基本上不会影响档案系统的完整)
实作的方式是使用纯LFS(log-structured file system),另一者是用lfs
的概念辅助FFS,改善crash的处理(log-enhanced file system). 纯LFS需
要full data logging, 辅助性lfs通常就是mata-data logging而已.
crash recovery方法也有两种, 一种是redo-only log,另外一种是undo-redo.
redo-only log在 crash后把残馀的log继续做完. undo-redo log则可以选择
redo或undo.不同点差在crash 后的处理. redo-only较省disk space,
但是crash recovery有较多synchronization 的问题会出现. undo-redo
的方式有较多的的优点,有synchronization 问题的地方可以把档案还原.
既然要undo,就会记录档案原来的资料, 当然较浪费空间.
本章讨论的两个lfs, 4.4BSD LFS和 Episode File System (by
Transarc,由AFS衍生出来). Episode是OSF DCE标准采用的local
file system, 是DFS的基础. 4.4BSD LFS是根据Sprite作业系统
的研究而来的. LFS使用redo-only log. Episode使用redo-undo log.
对LFS还是搞不清楚什么是append only log? 没关系, 看了4.4BSD LFS后就了解了.
Figure 11-2一目了然. LFS基本上把DISK当成是一条磁带,每个block大小为0.5 MB.
一个block在disk上是一段连续的空间, 不过相邻的(logically..) block则没有在
disk上相邻, 而是用list的方式相邻. 用程式表示如下:
struct log_segment { /* block在LFS的术语里面叫log segment */
int next; /* 下一个log_segment 所在的位址 */
char data[0.5MB-sizeof(int)];
};
而FFS则是在disk上以渐进的方式动态配置新的log segment以记录log.
segment间以linked list的方式串在一起.
Kernel记录的log就是这样子记录到disk上的. 在这个segment chain
的tail才是整个档案系统的最新资料.不过旧的资料也没有被盖掉就是了.
LFS还有一个机制, 类似garbagge collection, 更像defragementation.
它使用一个cleaner process来清理旧的log. 这个process会固定的把旧
的log独出来(以segment为单位).然后他会将这个segment的内容与实际
的资料比对,要是这个segment内的资料都是没有用的资料(已被新的资料
所取代了)那么cleaner process就可以把此segment free掉. 如果segment
还有一点东西, 怎么办? 简单! 重写一次,append到最新的log去就好了
嘛(感觉好像nu的speed disk :).
LFS仍然维持directory和inode的架构, 只不过以前的inode是固定
在disk blocks的最前端,而现在inode则是分散各地,存在各log
segment里面. 这样子怎么读这个file system呢? LFS还有一个
inode map, 记录所有inode的位置. inode map被当成是一般的
data一般,也是定期会写到log里面去. kernel就是靠此inode map
作为读取这个档案系统的开端.
大家最感兴趣的, 应该还是效率问题了. 首先LFS需要消耗更为大量
的记忆体才能满足他的运作所需, 有时候是个缺点. 与FFS比较的结
果在大部份的状况下都胜过FFS, 除了在高度多工的时候稍微输了一
点. Sun-FFS (有file system clustering和一些小地方的改进)和
LFS比就不相上下了. BSD LFS在处理meta-data方面较Sun-FFS强(create,
remove, mkdir,rmdir....). 但在read/write等io集中的测试中 Sun
FFS则较快, 特别是LFS cleaner启动的情况下更是如此. 显然
clustering发挥了不少功效.
这边你要谨记在心的是LFS在metadata处理会赢过FFS, 是因为写入
动作较有效率的原因(sequential write), 前面已经提到file system
的瓶颈在write...
Sun FFS和BSD LFS在模拟实际状况的bench mark上的平均分数是
差不多的. LFS比FFS占优势的地方大概是快速的复原能力吧!
本节引用了几份测试报告和bench mark, 有兴趣的可以看参考书目,
看人家如何评量一个档案系统的效能.
本节提到另一个有趣的产品, Write-Anywhere File Layout (WAFL)
system, 是一家名为 Network Appliance Corporation 的 FAServer
系列的NFS产品. WAFL整合了log-structured file system, NVRAM
(non-volatile ram, 像PC cmos的咚咚),和RAID-4磁碟阵列, 提供了
高速的NFS response time. WAFL 有一个特色是许多系统管里者有兴
趣的, 就是snapshot. snapshot就是备份的意思嘛! 也就是系统某一时间
的状态. snapshot在log-structured file system下应该不难制作,
因为所有的资料都用append模式嘛..把cleaner process的功能作个
修改就是了. snapshot 优于传统的备份的原因应该很明显, 传统的
备份过程必须花上一段时间,系统在这段时间内若有修改的话, 这段
时间内的修改则不知道有没有备份到.而snapshot则可以像照相一般
得到系统瞬间的备份.使用者也可以利用snapshot取得旧的档案内容
或达到undelete的功能.
由LFS和Sun-FFS的比较可以了解meta-data logging(log-enhanced
file system)存在的原因了, 取长补短嘛. 本章也讨论了meta-data
logging的系统. meta-data logging在商品化的系统上较受欢迎,
成为市场主流, 因为他可以架构在FFS上,改变不会太大, 还可以和
对FFS的改进(如clustering)互相配合,相得益彰.
本章讨论的另一个lfs是Transarc的Episode File System. Episode
采用redo-undo metadata logging, 并且他的file system可以横越
好几个硬碟. Episode也提供了类似snapshot的功能, 称为cloning.
cloning使用copy-on-write的技术,只复制anode (Episode的inode).
Episode在security方面也提供了POSIX式的access-control list (ACL),
提供较传统Unix更为精细的档案属性控制.
要是你可以拦截使用者对某些目录的存取,然后偷天换日一下, 那一定可以
让系统更有趣. 比如说Mail来的时候通知使用者(不用一直polling), 使用者
读某个档时就把目前时间印出来, 使用者open /tcp/<service>;/<hostname>;时,
就产生对<sevice:hostname>;的连线,让没有网路知觉的shell,awk程式也可以
轻松处理网路上的资料.
本章讨论了两个这样的系统,第一个是watchdogs,这是学生的研究作品,
整合性不足. 4.4 BSD Portal file system才可以算完整解决方案.
一个新的File System写起来很复杂, 而一般人可能只想在file system
上加点小功能,比如说on-line compression等等. 4.4BSD和SunSoft皆
提出了Stackable file system的模组化机制,以达到这个目的.
SunSoft的版本在本书写作的时候还在草稿阶段而已.BSD的则是已经放
入4.4BSD原始码内了.
Chapter 12 Kernel Memory Allocation
本章以后开始讨论到记忆体管理的问题了. Chapter 12讨论kernel
如何管理自己所用的资料结构所使用的记忆体,如 inode的配置等问
题. Chapter 13,13,15则是讨论kernel的virtual memory管理. 被
kernel用来放自己的资料结构的记忆体就不能给paging system用,
所以两者之间如何平衡是很重要的.
本章提了好几个memory allocator. 其中提到C Library的malloc所
使用的方法值得提出来和大家分享. malloc配置记忆体是所谓的power
of two free lists.把记忆体分成不同的2的次方的大小
(32,64,128...1024bytes)来管理. 不过allocator保留这些block
的最前面几个byte当作header, 当这块记忆体不用时(free),
则header指向下一个free block, 彼此间是一个list. 而使用中
时header则指向他所属的list. (比如说大小是32的list), 这样
子free()才会知道怎么归还记忆体.如果你有K&R这本书的话, 可
以翻翻看书上的□例是不是这样子作的.
power-of-two的配置方式有个致命的缺点, 就是可使用的空间只有
sizeof(block)-sizeof(header),也就是略小于(32,64,128,...1024).
如果应用程式要一个比如说64bytes的记忆体, 那么64-block就装不下,
要分配一个128-block才行, 造成浪费.回想一下,你写程式是不是很喜欢
malloc(128), malloc(512), malloc(1024)呢? 是不是感觉上
应该对performance比较好呢? 看完这段描述, 那你可能就不会
这么想了. 我想,这也是许多人评论不同的c compiler记忆体管
理优劣的一个地方吧!如果你常自己抓一些source code来安装,
就可以了解为什么很多作者都弃系统的library不用,
自己提供malloc了吧.
书上提到power of two的改进法, 称为McKusic-Karels Allocator.
获得4.4BSD和Digital UNIX采用. McKusic-Karels 配置法把一段连
续的记忆体都切成固定的大小, 比如说32bytes, 那么使用中的header
就不用指回他所属的list了, 因为由他的位置kernel就可以知道他属
于哪一国的.
皆下来提到Buddy System, 这是和power of two不太一样的配置法,
优点是free()之后的临接空间可以聚合起来成为较大的可用空间.
(power-of-two这方面作的并不好). 这个优点称为coalescing.并
且Buddy System可以简易的和paging system交换记忆体空间,
使的kernel占用的记忆体空间可以动态的调整. 不过他的performance
不太好, 因为每次release momory,allocator就很贪心的把所有
临接的记忆体空间并起来, 浪费许多时间.
SVR4使用了修改过的Buddy演算法 - Lazy Buddy 作为配置kernel
objects的方法.
Buddy系统和power of two一样, 都是以2^n作为配置记忆体的单位.
Mach, OSF/1使用了另一种方法, Zone Allocator. 这个配置法不
再以2^n作为配置单位, 而是以物件为导向来配置.也就是说allocator
从paging系统要来一块记忆体,把他按照object的大小切成n份,
比如说, port资料结构为104 bytes, 那么mach会把要来的记忆体
(比如说1KB),分成1024/104块来使用. 这很明显提高了记忆体
的利用率. 给一个object用的记忆体称为一个zone, 比如说zone
of ports, zone of inodes等等. 不同的object使用不同的zone,
即使他们的大小一样.
Zone Allocator使用背景的garbage collection程式来回收记忆体.
本章最令人拍案叫绝的是Solaris 2.4的Slab Allocator.
Slab allocator和zone allocator 方向差不多, 以object size
当成配置单位,但是他更进一步分析记忆体的使用情形. 比如说
inode好了.首先我们要一块记忆体 - malloc(sizeof(inode)),
然后initialize inode,接着是正常的使用, 使用完毕后便用free()
归还记忆体. Slab allocator注意到free()之后的记忆体的资料
和刚刚initialize时差不多, 比如说inode的reference count
一定是降为零等等. Kernel有许多资料结构都是还原到和initialize时
一样的时候才会free掉.再说一个例子, 一个mutex lock initalize时
是unlock的状态, free时也是unlock的.
Slab allocator利用这项特性, 事先把所有的(用Mach的语言是zone)初始化,
那么就可以省下不少initialize的时间.
另一个slab allocator注意到的问题是cpu cache的使用率.一般的cache演算法是
cache location = address % cache_size
一般的power of two配置法配置的记忆体都会经过align, 并且大多数程式
的习惯会把最常用的资料栏位放在一个结构的最前面. 这两个效应合在一起,
造成这些栏位互相的清掉彼此的cache. 512kb的cache可能只有部分有作用.
更甚者, 如果主记忆体使用interleave的方式, 比如说SPARC center 2000
使用两个bus, 较低的256byte使用第一个bus,较高的256byte使用第二个bus,
那么所有的data可能会集中在第一个bus上, 造成不平衡现象.
Slab的解决方法是在向paging系统取得一块block之后, (假设为1KB),
Slab把他要用的资料摆在这个block最后面, 假设占y bytes. 假设所
要配置的是inode, 大小跟前面Mach的例子一样皆是104. 那么这块记
忆体可以提供(1024-y)/104个inode. 并且有一些馀数, 也就是剩下
一些多馀的记忆体.Slab善用这些记忆体, 将之二等分, 一份摆在这
块记忆体的最前面,一块摆在最后面. 最前面那块称为coloring area.
Slab设法在每次配置的page上使用不同大小的coloring area, 以有效的
分散资料map到cache中的位置,增加cache rate.
Allocator Footprint指的是Allocator在配置记忆体的时候将自己,
以及所参考到的资料写到cpu cache/ TLB (translation lookaside
buffer), 在cache/TLB上面产生的"脚印". Allocator在cache/TLB内
所留下的资料基本上是没有用的, 并且妨碍真正有用的资料留在cache
上. buddy演算法需要参考许多资料才能配置记忆体, 会产生大量的
"footprint", 导致cache miss增加. McKusick-Karels和zone allocator
的足迹皆很小, 原因是配置记忆体的时候直接从free list上把第一个
element抓出来而已. 所以一个好的配置法应该使用简单的演算来配置
物件. Slab也是使用相同的原则, 不论是配置或者是释放,都是简单的
一两行运算而已,所以foot print也很小.
Chapter 13 Virtual Memory
本章对virtual memory作个通论, 如paging, segmentation, swaping,
virtual memory等等作个介绍, 跟作业系统的书讲的差不多. 然后个案
讨论了几个热门的CPU的MMU. MIPS R3000比较特别, CPU没有自动处理
TLB, 而是提供了一堆TLB暂存器让kernel自己玩.
现代Unix皆使用paging的机制来提供虚拟记忆体. 不过通常CPU对
paging的机制都不完全. kernel除了维护cpu所需的paging table
之外, 自己还需要维护一份相对应的表格, 以满足所需.
本章最后讨论了4.3BSD的Virtual Memory系统. 4.3BSD使用cmap[]
的资料结构来辅助paging管理. cmap的方式是在VAX-11的架构下设
计的, 没有shared memory也没有shared library, 没有memory
mapped file, 没有copy-on-write等等的支援,不胜枚举, 在现代
已经可以作古了. 不过4.3BSD的架构仍然为日后的发展奠立的良好
的基础.
BSD对swap space的处理颇为保守. 要求所有在主记忆体的page
在配置前都必须要先有一块swap space. 所以swap space的大
小限制了可以执行的程式数量.不过这也保证程式只有在fork或
exec时才会发生记忆体不足的现象, 而不会执行到一半要被swap
出去, 却找不到swap space可用的窘况.也就是说如果你的电脑有
64MB的记忆体,但是只划了16MB的记忆体,这样的系统只愿意让你使
用16MB而已, 这也是有些系统管理的书籍建议你swap space不要比
main memory小的原因.
Chapter 14 The SVR4 VM Architecture
SVR4的VM Architecture源自于SunOS 4.0引进的VM技术(Virtual
Memory之意).SunOS发展VM的用途在于提供memory sharing,
shared libraries, memory-mapped files. 又因为SunOS可以在
M68K, I386, SPARC上执行, 所以VM架构十分的portable.
之后, 在Sun和AT&T合力之下, 以VM为基础, 设计了SVR4的virtual
memory系统.取代SVR3以前使用的regions架构. regions架构在Bach
的书上有提到.
Memory-Mapped Files是透过virtual memory技术, 把档案的内
容映到程式的定址空间, 使得程式可以直接以存取记忆体的方
法存取档案. kernel提供了mmap() 系统呼叫来作为此机制的介面.
SVR4 VM的设计概念, 可以说是颠覆了传统对记忆体的观念.
在VM里面,physical memory变成是virtual address space的
cache而已. 怎么说呢? 一个程式的位址空间可以被VM赋予不同
的意义, 比如说某一段表示 text, 指向硬碟上可执行档的text区段,
某一段表示data, 指向swap area, 某段指向某个档案, 为memory
mapped file的空间等等, VM的工作就是把这些实体的资料根据page
fault把他载入"cache" -- 主记忆体(以 page 为单位),好让cpu可以
存取.在主记忆体上的资料都是暂时的, 他们都有个实体的贮存装置
作为长时间记录资料的地方.(这里的长时间指的是process block着,
sleep时, 或者被swap out的意思).
前一段提到data区段指向swap area是为了说明方便起见瞎掰的. 真正的
作法是使用anonymous page来收容这些无家可归的小孩. data区段本来
指向档案, 但是设下一个flag, 只要一被修改, 就变成 anonymous page,
anonymous page就会自动使用swap area当作回存的装置.
一个记忆体空间映到不同的东西, 就应该有不同的程式来处理. SVR4把
一段空间称为一个segment. 处理这种segment的程式就是segment driver.
本节并提到vnode和paging系统的相互作用.
Solaris 2.x对SVR4的改进为提出virtual swap space的方法, 把
swap space扩展成swap area + physical memory (所以swap大小
可以小于physicalmem了?!)并且可以动态的重新分配swap区. 之前
的作法是某块swap区只要配给哪个page, 那整个process的生命周
期内, 这块swap就是许配给这个process的特定page,不会再换了,
这样的缺点是不能动态的移除swap disk/file.为了达到这个
功效, Sloaris设计了swapfs, 用来管理swap space. anonymous
page从此就回存到swapfs上, 而不是直接pass过filesystem,
存到swap上了.
当VM从SunOS移植到SVR4上时, 效能和regions架构相比很不理想. 经过
分析SVR4的fault rate太高了, 所以可以作些改善.
因为VM太懒了, 所有的东西都是page fault之后再作, 而page fault
的代价甚高.所以optimization朝向将一些显然会发生的page fault
减少. 比如说在fork和exec中间一般程式都会做一些事, 所以把
paging table initialize完整是件好事.exec时, 也把新的paging
table initialize好,省得一执行又产生page fault.
exec也会检查新执行的程式是否有text page在主记忆体里,
有的话就顺便include进来.
最后一个改进则是改进copy-on-write. 在fork后, kernel检查parent在
主记忆体内的anonymous page, 把他们都先拷贝起来. 前面提到,
会变成anonymous page的资料,都是有被修改过的, 而此page会在
主记忆体里,没有被swap out,表示最近曾被修改过.事先将这些
page复制的理由就是基于最近被修改的资料, 可能child也会修改之.
这种情况在shell下面最常发生. shell常常自己fork很多次.
而每次fork后都会以相同的pattern来修改变数.
最后本章提了一个测量结果, page fault次数有了明显的改进, 已经改善到
比SVR3时好了.
Chapter 15 More Memory Management Topics
本章提了Mach的virtual memory管理. 虽然是不同的设计, 术语也不同,
但是和SVR4的VM架构有许多地方都是相同的. Mach的设计比较清楚易懂,
如果Chapter 14看不懂,可以先看本章. 4.4BSD VM架构就是基于Mach的.
不过4.4BSD的系统管理比较向SVR4.
本章另一个重点是TLB一致性的处理. 这是在多处理器下发生的问题.
如果kernel改了某个page的资料, 他怎么让其他的处理器知道并且更正
TLB的内容. 基本上这是一件很麻烦的问题, 尤其是CPU没什么支援的状况下.
Mach的方法最简单,也最通用(不需要cpu支援,只要有个inter-processor lock),
但是浪费许多时间在synchronize上, 没什么效率. 处理TLB应该算是
multiprocessor support内最麻烦的问题了, 处理不好,一堆processor
都会浪费时间在synchronize上.
盲目的synchronize造成不少的浪费, 比较聪明的作法是分析什么时候会修改TLB.
如果是发生在kernel的定址空间, 那么kernel可以透过谨慎的设计来避开TLB的修改,
那么需要修改TLB的时机就只剩kernel所无法掌握的user processes了. 而剩下的
这些状况也不是每种都要马上更改其他processor的tlb不可. 因此可以省下了许多
不必要的麻烦.
Chapter 16 Device Drivers and I/O
介绍与device driver, io 相关的课题. 以及device driver 与file system
间的互相配合, dynamic loading unloading等等. 基本上就是device driver
必须提供哪些介面给kernel, 可以使用kernel的哪些function call,和变数.
随Unix版本而异...
Chapter 17 STREAMS
STREAMS架构本来是为了解决character devices重复发展太多程式码和
buffering的问题, 不过STREAMS设计得太强悍了, 使得terminal driver,
pipe和网路driver都利用他来完成. STREAMS已被大多数的UNIX厂商所支持,
成为广为接受的标准, 也是用来写网路driver较受欢迎的架构.
STREAMS使用模组化的方式, 让使用者可以依堆叠的方式循序推入处理模组,
而资料流则是通过一层层的模组达到驱动程式. 反之亦然. terminal driver
就可以专心的处理与terminal沟通的细节, 而与Unix系统其他的部份,以及使
用者介面,可以丢给上层的模组处理就好了. STREAMS架构详细的订定各模组间
要如何沟通和应有的"举止".
System V也定义了一个Transport Provider Interface及Transport Layer
Interface(TPI/TLI), 功用类似BSD的socket介面,用来提供高阶程式设计
的标准介面.
虽然STREAMS/TLI在本书写作之时好像颇具潜力,但是socket介面实在太强势了,
又有winsock助长声势, 显然STREAMS在网路上没成气候, 但是在其他方面
则发展得十分良好.
后记
就这样把这本书有趣的内容整理完了. 希望我的取材可以让那些略懂
Unix的人有更深入的空间,提高层次. 其中我省略了许多Unix Kernel基
础的概念,希望不会让人不知所云才好. 反倒是对Unix的简介好像写的
太详细了, 这是因为我发现有很多中文书在这方面写错了....
前言提到的那几本书(含本书)都很值得想了解Unix者阅读. 有许多的细
节都是要仔细的整篇阅读才会了解的, 像这样的摘要并不能完整的表达.
书中对4.4BSD, Mach, SVR4/Solaris的描述, 我都尽量提及了. 希望对
Linux/FreeBSD/Solaris以及未来的GNU Hurd 以及苹果的狂想曲的了解
有所帮助. 另外你是否也跟我一样发现Sun真的不是一盏省油的灯, 确
实有两三把刷子呢?
本书有个缺点就是校正不太完整. 内文reference到Section 0好几次, 但是
并没有Section 0. 应该是美中不足的地方吧!
本书没有提到tcp/ip网路的部份, 应参考Stevens, TCP/IP Illustrated
Volumne 1,2,3.
本书对shared library以及执行档的结构没有作深入的讨论, 也是一个
遗憾...
--
Shiau Yong-Ching
--
Buck barks in the darkness
[code][/code] |
| |