Boykin, J. “Operating Systems” The Electrical Engineering Handbook Ed. Richard C. Dorf Boca Raton: CRC Press LLC, 2000 96 Operating Systems 96.1Introduction 96.2Types of Operating Systems 96.3Distributed Computing Systems 96.4Fault-Tolerant Systems 96.5Parallel Processing 96.6Real-Time Systems 96.7Operating System Structure 96.8Industry Standards 96.9Conclusions 96.1 Introduction An operating system is just another program running on a computer. It is unlike any other program, however. An operating system’s primary function is the management of all hardware and software resources. It manages processors, memory, I/O devices, and networks. It enforces policies such as protection of one program from another and fairness to ensure that users have equal access to system resources. It is privileged in that it is the only program that can perform specialized hardware operations. The operating system is the primary program upon which all other programs rely. To understand modern operating systems we must begin with some history [Boykin and LoVerso, 1990]. The modern digital computer is only about 40 years old. The first machines were giant monoliths housed in special rooms, and access to them was carefully controlled. To program one of these systems the user scheduled access time well in advance, for in those days the user had sole access to the machine. The program such a user ran was the only program running on the machine. It did not take long to recognize the need for better control over computer resources. This began in the mid- 1950s with the dawn of batch processing and early operating systems that did little more than load programs and manage I/O devices. In the 1960s we saw more general-purpose systems. New operating systems that provided time-sharing and real-time computing were developed. This was the time when the foundation for all modern operating systems was laid. Today’s operating systems are sophisticated pieces of software. They may contain millions of lines of code and provide such services as distributed file access, security, fault tolerance, and real-time scheduling. In this chapter we examine many of these features of modern operating systems and their use to the practicing engineer. 96.2 Types of Operating Systems Different operating systems (OS) provide a wide range of functionality. Some are designed as single-user systems and some for multiple users. The operating system, with appropriate hardware support, can protect one executing program from malicious or inadvertent attempts of another to modify or examine its memory. When connected to a storage device such as a disk drive, the OS implements a file system to permit storage of files. Joseph Boykin Clarion Advanced Storage ? 2000 by CRC Press LLC The file system often includes security features to protect against file access by unauthorized users. The system may be connected to other computers via a network and thus provide access to remote system resources. Operating systems are often categorized by the major functionality they provide. This functionality includes distributed computing, fault tolerance, parallel processing, real-time, and security. While no operating system incorporates all of these capabilities, many have characteristics from each category. An operating system does not need to contain every modern feature to be useful. For example, MS-DOS 1 is a single-user system with few of the features now common in other systems. Indeed, this system is little more than a program loader reminiscent of operating systems from the early 1960s. Unlike those vintage systems, there are numerous applications that run under MS-DOS. It is the abundance of programs that solve problems from word processing to spreadsheets to graphics that has made MS-DOS popular. The simplicity of these systems is exactly what makes them popular for the average person. Systems capable of supporting multiple users are termed time-sharing systems; the system is shared among all users, with each user having the view that he or she has all system resources available. Multiuser operating systems provide protection for both the file system and the contents of main memory. The operating system must also mediate access to peripheral devices. For example, only one user may have access to a tape drive at a time. Fault-tolerant systems rely on both hardware and software to ensure that the failure of any single hardware component, or even multiple components, does not cause the system to cease operation. To build such a system requires that each critical hardware component be replicated at least once. The operating system must be able to dynamically determine which resources are available and, if a resource fails, move a running program to an operational unit. Security has become more important during recent years. Theft of data and unauthorized access to data are prevented in secure systems. Within the United States, levels of security are defined by a government-produced document known as the Orange Book. This document defines seven levels of security, denoted from lowest to highest as D, C1, C2, B1, B2, B3, and A1. Many operating systems provide no security and are labeled D. Most time-sharing systems are secure enough that they could be classified at the C1 level. The C2 and B1 levels are similar, and this is where most secure operating systems are currently classified. During the 1990s B2 and B3 systems will become readily available from vendors. The A1 level is extremely difficult to achieve, although several such systems are being worked on. In the next several sections we expand upon the topics of distributed computing, fault-tolerant systems, parallel processing, and real-time systems. 96.3 Distributed Computing Systems The ability to connect multiple computers through a communications network has existed for many years. Initially, computer-to-computer communication consisted of a small number of systems performing bulk file transfers. The 1980s brought the invention of high-speed local area networks, or LANs. A LAN allows hundreds of machines to be connected together. New capabilities began to emerge, such as virtual terminals that allowed a user to log on to a computer without being physically connected to that system. Networks were used to provide remote access to printers, disks, and other peripherals. The drawback to these systems was the software; it was not sophisticated enough to provide a totally integrated environment. Only small, well-defined interac- tions among machines were permitted. Distributed systems provide the view that all resources from every computer on the network are available to the user. What’s more, access to resources on a remote computer is viewed in the same way as access to resources on the local computer. For example, a file system that implements a directory hierarchy, such as UNIX, 2 may have some directories on a local disk while one or more directories are on a remote system. Figure 96.1 illustrates how much of the directory hierarchy would be on the local system, while user directories (shaded directories) could be on a remote system. 1 MS-DOS is a trademark of Microsoft, Inc. 2 UNIX is a trademark of UNIX Software Laboratories (USL). ? 2000 by CRC Press LLC There are many advantages of distributed systems. Advantages over centralized systems include [Tanenbaum, 1992]: ? Economics: Microprocessors offer a better price/performance than mainframes. ? Speed: A distributed system may have more total computing power than a mainframe. ? Reliability: If one machine crashes, the system as a whole can still survive. ? Incremental growth: Computing power can be added in small increments. Advantages over nonnetworked personal computers include [Tanenbaum, 1992]: ? Data sharing: Allow many users access to a common database. ? Device sharing: Allow many users to share expensive peripherals like color printers. ? Communication: Make human-to-human communication easier, for example, by electronic mail. ? Flexibility: Spread the workload over the available machines in the most cost effective way. FIGURE 96.1 UNIX file system hierarchy in a distributed environment. ? 2000 by CRC Press LLC While there are many advantages to distributed systems, there are also several disadvantages. The primary difficulty is that software for implementing distributed systems is large and complex. Small personal computers could not effectively run modern distributed applications. Software development tools for this environment are not well advanced. Thus, application developers are having a difficult time working in this environment. An additional problem is network speed. Most office networks are currently based on IEEE standard 802.3 [IEEE, 1985], commonly (although erroneously) called Ethernet, which operates at 10 Mb/s (ten million bits per second). With this limited bandwidth, it is easy to saturate the network. While higher-speed networks such as FDDI 1 and ATM 2 networks do exist, they are not yet in common use. While distributed computing has many advantages, we must also understand that without appropriate safeguards, our data may not be secure. Security is a difficult problem in a distributed environment. Whom do you trust when there are potentially thousands of users with access to your local system? A network is subject to security attack by a number of mechanisms. It is possible to monitor all packets going across the network; hence, unencrypted data are easily obtained by an unauthorized user. A malicious user may cause a denial-of-service attack by flooding the network with packets, making all systems inaccessible to legitimate users. Finally, we must deal with the problem of scale. To connect a few dozen or even a few hundred computers together may not cause a problem with current software. However, global networks of computers are now being installed. Scaling our current software to work with tens of thousands of computers running across large geographic boundaries with many different types of networks is a challenge that has not yet been met. 96.4 Fault-Tolerant Systems Most computers simply stop running when they break. We take this as a given. There are many environments, however, where it is not acceptable for the computer to stop working. The space shuttle is a good example. There are other environments where you would simply prefer if the system continued to operate. A business using a computer for order entry can continue to operate if the computer breaks, but the cost and inconvenience may be high. Fault-tolerant systems are composed of specially designed hardware and software that are capable of continuous operation. To build a fault-tolerant system requires both hardware and software modifications. Let’s take a look at an example of a small problem that illustrates the type of changes that must be made. Remember, the goal of such a system is to achieve continuous operation. That means we can never purposely shut the computer off. How then do we repair the system if we cannot shut it off? First, the hardware must be capable of having circuit boards plugged and unplugged while the system is running; this is not possible on most computers. Second, removing a board must be detected by the hardware and reported to the operating system. The operating system, the manager of resources, must then discontinue use of that resource. Each component of the computer system, both hardware and software, must be specially built to handle failures. It should also be obvious that a fault-tolerant system must have redundant hardware. If, for example, a disk controller should fail, there must be another controller communicating with the disks that can take over. One problem with implementing a fault-tolerant system is knowing when something has failed. If a circuit board totally ceases operation, we can determine the failure by its lack of response to commands. Another failure mode exists where the failing component appears to work but is operating incorrectly. A common approach to detect this problem is a voting mechanism. By implementing three hardware replicas the system can detect when any one has failed by its producing output inconsistent with the other two. In that case, the output of the two components in agreement is used. The operating system must be capable of restarting a program from a known point when a component on which the program was running has failed. The system can use checkpoints for this purpose. When an application program reaches a known state, such as when it completes a transaction, it stores the current state of the 1 Fiber distributed data interface. The FDDI standard specifies an optical fiber ring with a data rate of 100 Mb/s. 2 Asynchronous transfer mode. A packet-oriented transfer mode moving data in fixed-size packets called cells. There is no fixed speed for ATM. Typical speed is currently 155 Mb/s, although there are implementations running at 2 Gb/s. ? 2000 by CRC Press LLC program and all I/O operations; this is known as a checkpoint. Should a component on which this program is running fail, the operating system can restart the program from the most recent checkpoint. While the advantage of fault-tolerant systems is obvious, they come at a price. Redundant hardware is expensive, and software capable of recovering from faults runs more slowly. As with many other systems, the price may be more than offset by the advantage of continuous computing. 96.5 Parallel Processing No matter how fast computers become, it seems they are never fast enough. Manufacturers make faster computers by decreasing the amount of time it takes to do each operation. An alternative is to build a computer that performs several operations simultaneously. A parallel computer, also called a multiprocessor, is one that contains more than one CPU. 1 The advantage of a parallel computer is that it can run more than one program simultaneously. In a general- purpose time-sharing environment, parallel computers can greatly enhance overall system throughput. A program shares a CPU with fewer programs. This approach is similar to having several computers connected on a network but has the advantage that all resources are more easily shared. To take full advantage of a parallel computer will require changes to the operating system [Boykin and Langerman, 1990] and application programs. Most programs are easily divided into pieces that can each run at the same time. If each of these pieces is a separate thread of control, they could run simultaneously on a parallel computer. By so dividing the application, the program may run in less time than it would on a single- processor (uniprocessor) computer. Within the application program, each thread runs as if it were the only thread of control. It may call functions, manipulate memory, perform I/O operations, etc. If the threads do not interact with each other, then, to the application programmer, there is little change other than determining how to subdivide the program. However, it would be unusual for these threads not to interact. It is this interaction that makes parallel programming more complex. In principle, the solution is rather simple. Whenever a thread will manipulate memory or perform an I/O operation, it must ensure that it is the only thread that will modify that memory location or do I/O to that file until it has completed the operation. To do so, the programmer uses a lock. A lock is a mechanism that allows only a single thread to execute a given code segment at a time. Consider an application with several threads of control. Each thread performs an action and writes the result to a file—the same file. Within each thread we might have code that looks as follows: thread() { dowork(); writeresult(); } writeresult() { lock(); write(logfid, result, 512); unlock(); } In this example the writeresult function calls the lock function before it writes the result and calls unlock afterward. Other threads simultaneously calling writeresult will wait at the call to lock until the thread that currently holds the lock calls the unlock function. While this approach is simple in principle, in practice it is more difficult. It takes experience to determine how a program may be divided. Even with appropriate experience, it is more difficult to debug a multithreaded 1 Central processing unit, the hardware component that does all arithmetic and logical operations. ? 2000 by CRC Press LLC application. With several threads of control operating simultaneously, it is not simply a matter of stepping through the program line by line to find a mistake. Most often, it is the interaction between threads that is the problem. Multithreading a program may not be a trivial matter. As with most types of programming, however, experience makes the process easier. The benefit is significantly enhanced performance. 96.6 Real-Time Systems Real-time systems are those that guarantee that the system will respond in a predetermined amount of time. We use real-time systems when, for example, computers control an assembly line or run a flight simulator. In such an environment we define an action that must occur and a deadline by which we wish that action to take place. On an assembly line an event may occur, such as a part arriving at a station, and an action, such as painting that part. The deadline we impose will be based on the speed of the assembly line. Obviously, we must paint the part before it passes to the next station. This is called a hard real-time system because the system must meet a strict deadline. Another class of system is termed soft real-time. These are environments in which response time is important, but the consequences are not as serious as, for example, on an assembly line. Airline reservation systems are in this category. Rapid response time to an event, such as an agent attempting to book a ticket, is important and must be considered when the system performs other activities. One way of distinguishing hard and soft real-time systems is by examining the value of a response over time. For example, if a computer was controlling a nuclear reactor and the reactor began to overheat, the command to open the cooling valves has extremely high value until a deadline, when the reactor explodes. After that deadline, there is no value in opening the valves (see Fig. 96.2). Relatively few events require that type of responsiveness. Most events have a deadline, but there continues to be value in responding to that event even past the deadline. In our airline reservation example, the airline may wish to respond to a customer request within, say, 10 seconds. However, if the response comes in 11 seconds, there is still value in the response. The value is lessened because the customer has become upset. As time increases, the customer becomes more and more upset and the value of responding decreases. We illustrate this in Fig. 96.3. 96.7 Operating System Structure Operating systems are large, complex pieces of software. They must handle asynchronous events such as interrupts from I/O devices, control hardware memory management units (MMUs) to implement virtual memory, support multiple simultaneous users, implement complex network protocols, and much more. As with any software of this magnitude, an operating system is logically divided into smaller pieces. The structure of a typical modern operating system is depicted in Fig. 96.4. FIGURE 96.2 Relative value of a response over time in a critical situation. FIGURE 96.3 Relative value of a response over time in a noncritical situation. ? 2000 by CRC Press LLC From the user’s standpoint, the operating system is a collection of system calls—the programmers’ interface. Sometimes this is termed an application program interface, or API. System calls provide the mechanism for an application program to obtain services from the system. System calls exist to perform file operations such as create, open, close, read, and write. For terminals, system calls would perform such functions as changing the baud rate and number of parity bits. Network connections may be established or network protocol options, such as the size of network buffers, are also controlled through system calls. While every operating system provides a system call interface, there is little uniformity to the appearance of that interface. Some systems provide an interface that appears as a simple function call. For example, to open a file under the UNIX operating system, we use the following system call: open("/home/boykin/crc-press/oschapter", O_RDONLY); Other operating systems require a user to fill in complex data structures for various operations. For example, the following code fragment illustrates how to send an IPC message using the Mach operating system’s inter- process communication (IPC) facility [Boykin et al., 1993]: msg_header_t header; header.msg_simple = TRUE; header.msg_size = sizeof(header); header.msg_type = MSG_TYPE_NORMAL; header.msg_local_port = PORT_NULL; header.msg_remote_port = remote_port; header.msg_id = 100; Regardless of the interface format, a programmer should become familiar with the parameters, options, and return codes from each system call to use the system proficiently. Beneath the programming interface lies the heart of the operating system. We can divide the system into two major sections. The first section directly implements the system calls. This includes the file system, terminal FIGURE 96.4 The structure of a modern operating system. ? 2000 by CRC Press LLC handling, etc. The second section provides basic capabilities upon which the rest of the system is built. Interprocess communication, memory management, and process management are all examples of these basic capabilities. A brief explanation of each of these sections will be given shortly. The lowest level of the operating system interfaces directly with the computer hardware. For each physical device, such as a disk, tape, or serial line, a device driver must exist to communicate with the hardware. Device drivers accept requests to read or write data or determine the status of the device. They may do polled I/O or be interrupt driven, although polled I/O is usually only done on small personal computers. Writing a device driver requires a thorough knowledge of the hardware as well as the interface to the operating system. In addition to I/O devices, the system must also manipulate such hardware as counters, timers and memory management units. Timers are used to satisfy user requests such as terminating an operation after a specified length of time. MMUs provide the ability to protect memory. Each time a program is run, the operating system programs the MMU with the physical memory addresses the program may access. Any attempt to access other memory is not allowed by the MMU. An MMU is also required to implement virtual memory. Virtual memory allows a program to use more memory than is physically present on the machine. The operating system implements virtual memory by using an external device, typically a disk, to store portions of the program that are not currently in use. When a program attempts to access memory temporarily stored on disk, the MMU traps 1 to the operating system, which reads the memory from disk and restarts the program. In recent years the structure depicted here has been changing. A new concept, called the microkernel, has begun to emerge. The idea behind a micro-kernel is to dramatically reduce the size of the operating system by placing most OS subsystems in the application layer. A micro-kernel would not be a usable system by itself. A number of programs would be run on top of the micro-kernel to provide such services as a file system and network protocols. In the micro-kernel architecture shown in Fig. 96.5, notice that subsystems traditionally within the operating system are now at the same level as an application program. An application program wishing to, for example, open a file makes its request to the file system program, rather than the micro-kernel. The file system may call upon other OS subsystems or on the micro-kernel to perform an operation. From the user standpoint, there is no programming difference between a micro-kernel structure and the traditional structure. There are two advantages of the micro-kernel approach. The first is that programming 1 A trap is a hardware signal that is received by the operating system. It is very similar to an interrupt from an I/O device FIGURE 96.5 Micro-kernel structure. ? 2000 by CRC Press LLC and debugging at the application layer is inherently simpler than programming at the OS layer. The benefit here is to the OS designer and implementors who can now write and debug OS code faster and easier than before. This benefits the user by having an operating system that is more reliable. The second advantage stems from the ability to incorporate several different OS environments on top of the same micro-kernel. In this way, the computer acts as though it is running several operating systems. For example, if both MS-DOS and UNIX coexisted on the same micro-kernel, the user could choose to run an MS-DOS spreadsheet or word processor and communicate using UNIX network commands. The user has gained increased flexibility. 96.8 Industry Standards As computer technologies come into widespread use, users begin to desire standardization. Standardization allows a user to know that a program written to a standard will work without concern for which vendor supplies the programming environment. Operating systems are no exception to this general rule, and there are several standards, both industry standards and de facto standards, that apply. Porting software from one system to another, often an expensive proposition, becomes a trivial task. Perhaps the most notable OS standard is POSIX, standard number 1003 [IEEE, 1990], sponsored by the IEEE Computer Society’s Technical Committee on Operating Systems. POSIX is a family of standards based on the UNIX operating system that includes the system call interface, user-level commands, real-time extensions, and networking extensions. The POSIX system call interface, 1003.1, was adopted by the U.S. government verbatim as a Federal Information Processing Standard, FIPS 151. Many vendors conform to POSIX; thus, a program that conforms to this standard can be ported to many system platforms without change. An example of a de facto standard is the X/Open Portability Guide (XPG) [X/Open, 1989]. X/Open is not a standards-setting body but is a joint initiative by members of the business community to adopt and adapt existing standards into a consistent environment. The X/Open system interface and headers are based on POSIX 1003.1 but also include extensions to POSIX-defined interfaces as well as additional interfaces. The importance of such standards is evidenced by the strong support of such organizations as the Open Software Foundation. OSF’s OSF/1 operating system conforms to various POSIX standards. Where not super- seded by POSIX, it also conforms to XPG and AT&T’s System V Interface Definition (SVID) [AT&T, 1985]. Conforming to these standards is considered critical for the success of OSF/1. Some might consider an operating system such as MS-DOS to be a de facto standard. While MS-DOS is in common use, however, it is proprietary software subject to change without notice. Defining a standard implies an open system on which vendors and users agree. 96.9 Conclusions I have been hearing for the past 15 years about the demise of the operating system. It has been said over and over that the role of the OS will go away. So far, the only change has been to expand on the role the operating system plays. One must remember that the operating system is not the user interface it portrays or the applications that run on it. It is, as it always has been, the manager of all resources on a computer system. While the interface to computers has changed and the use to which we apply computer technology has changed, there will always be the need for an operating system. Without question, the OS will change as well. We have already seen micro-kernel architectures begin to emerge from the research labs into commercial operating systems. Distributed computing will become more widespread and force additional changes to the operating system. Regardless of the changes that come, it will always be the operating system on which all other programs rely. Defining Terms Distributed computing: An environment in which multiple computers are networked together and the resources from more than one computer are available to a user. Those resources are accessed in a manner identical to accessing resources on a local computer system. ? 2000 by CRC Press LLC Fault-tolerant systems: A computer system with both hardware and software that are capable of continuous operation even in the event hardware components fail. File system: The logical organization of files on a storage device, typically a disk drive. The file system may support a hierarchical structure with directories and subdirectories (sometimes called folders). Interprocess communication: The transfer of information between two cooperating programs. Communi- cation may take the form of a signal (the arrival of an event) or the transfer of data. Parallel processing: A parallel computer is one that contains more than one CPU. Parallel processing is when a program is divided into multiple threads of control, each of which is capable of running simultaneously. On a parallel computer, multiple threads could be running at the same time, thus resulting in better performance than on a uniprocessor system. Process: A single executable program. A process is the context in which an operating system places a running program. It contains the program itself as well as allocated memory, open files, network connections, etc. Real-time computing: Support for environments in which response time to an event must occur within a predetermined amount of time. Real-time systems may be categorized into hard and soft real-time. Related Topics 90.3 Programming Methodology ? 95.2 Classifications References AT&T, System V Interface Definition, Spring 1985, Issue 1, AT&T Customer Information Center, Indianapolis, Indiana. J. Boykin, D. Kirschen, A. Langerman, and S. LoVerso, Programming Under Mach, Reading, Mass.: Addison- Wesley, 1993. J. Boykin and A. Langerman, “Mach/4.3BSD: Parallelization without reimplementation,” Computing Systems Journal, vol. 3, no. 1, 1990. J. Boykin and S. LoVerso, “Recent developments in operating systems,” Computer, vol. 23, no. 5, 1990. H.M. Dietel, Operating Systems, 2nd ed., Reading, Mass.: Addison-Wesley, 1990. IEEE, Carrier Sense Multiple Access with Collision Detection (CSMA/CD) Access Method and Physical Layer Specifications, American National Standard ANSI/IEEE Std. 802.3, 1985. IEEE, Information Technology—Portable Operating System Interface (POSIX) Part 1: System Application Program Interface (API) [C Language], New York: IEEE, 1990. A. Silberschatz, J.L. Peterson, and P.B. Galvin, Operating Systems Concepts, 3rd ed., Reading, Mass.: Addison- Wesley, 1991. A.S. Tanenbaum, Modern Operating Systems, Englewood Cliffs, N.J.: Prentice-Hall, 1992. X/Open Portability Guide, X/Open Company Ltd., Englewood Cliffs, N.J.: Prentice-Hall, 1989. Further Information Many textbooks describe operating system concepts. The three cited in the reference section [Dietel, 1990; Silberschatz et al., 1991; and Tanenbaum, 1992] are excellent. The IEEE Computer Society has a number of tutorials on operating system related topics such as fault tolerance, real-time, local area networks and distributed processing. Readers should contact the Computer Society Press office at 10662 Los Vaqueros Circle, Los Alamitos, Calif. 90720. Phone: 714-821-8380. For those interested in learning more about the implementation of specific operating systems, M.J. Bach, The Design of the UNIX Operating System, Prentice-Hall, 1986, describes the implementation of AT&T System V. The 4.3BSD operating system is described in Leffler et al., The Design and Implementation of the 4.3BSD UNIX Operating System, Addison-Wesley, 1990. ? 2000 by CRC Press LLC