(Extensibility Vss. Virtualization) Give the differences and similarities between the goals of extensibility (a la SPIN/Exokernel/L3) with the goals of virtualization (a la Xen/VMware ESX server).
Similarities
Give the differences and similarities between the mechanisms in Exokernel for extenssibility with those in Xen for virtualization.
Give two plausible reasons that could make system calls more expensive than simple procedure calls.
Consider a byte-addressable 32-bit architecutre. The virtual address space is 2^32 bytes.. The TLB supporets entries to be tagged with an address space id. We are building a microkernel-based design of an operating system for this architecture, in which wvery subsystem will be implemented as a process above the kernel. Suggest what design decisions you will take to make sure that the performance will be as good as a monolithic OS.
Why is a shadow page table needed?
(full virtualization) How many shadow page tables are there
There is one S-PT per guest OS currently running
(full virtualization) How big is the shadow page table?
It is proportional to the number of processes currently running in that guest OS
(full virtualization) Guest OS is servicing a page fault. It finds a free PPN and decides to make the mapping of the vpn for the faulting process to this ppn. list the steps from here on taken by the guest and the hypervisor that ensures that the process will not fault on this vpn again
Guest OS decides to context switch from the currently running process to another. List the steps taken by the guest and the hypervisor until the new process is running on the processor.
The guest OS is running multiple processes within it. The guest OS itself appears as a “process” so far as the hypervisor is concerned. How is each of the processes inside the guest OS guaranteed memory isolation/integrity/independence from one another?
Upon process creation, guest-OS assigns a distinct PT data structure for the newly created process (+2)
As part of creating a memory footprint for the process, the guest-OS creates VPN to PPN mappings in the PT by executing privileged operations (+2)
These privileged operations result in traps to the hypervisor which emulates them on behalf of the guest-OS (both populating the PT data structure for the newly created process in the guest-OS as well as the S-PT data structure in the hypervisor). (+2)
The distinct PT data structure for EACH process within the guest-OS thus gives the isolation/integrity/independence guarantees for the processes from one another.
An application process starts up in a guest OS running on top of Xen. List the steps taken by the guest OS and Xen to set up the page table for this process. What is the mechanism by which the guest OS schedules this process to run on the processor? You can assume any reasonable hypervisor calls without worrying about the exact syntax. State your assumptions.
One of the techniques for efficient use of memory which is a critical resource for performance is to dynamically adjust the memory allocated to a VM by taking away some or all of the “idle memory” (unused memory) from a VM. This is referred to in the literature as “taxing” the VM proportional to the amount of idle memory it has. Why is a 100% tax rate (i.e. taking away ALL the idle memory from a VM) not a good idea?
Any sudden increase in the working set size of the VM will result in poor performance of that VM. This might violate the service agreement for that VM.
VM1 and VM2 are hosted on top of VMware ESX server. VM1 has a page fault at VPN = 100. The page is being brought into MPN = 4000. The contents hash for this page matches another MPN = 8000, which belongs to VM2 at VPN = 200. Give the steps taken by VMware ESX server including the data structures used to possibly optimize the memory usage by using VM oblivious page sharing.
It is impossible to support sequential consistency memory model in a multiprocessor that has caches associated with each processor. (An answer without any justification gets zero credit.)
False. (+1)
What is needed is a mechanism to get exclusive access to a memory location before writing to it in the local cache and ensuring that that memory location is purged in the peer caches.
For example, in an SMP, a hardware solution would be for the writing processor to acquire the bus and indicate to its peers that it is writing this memory location so that the peers can take appropriate actions in their local caches (invalidate/update if that memory location exists in the cache)
(Anderson’s queue lock)
Imagine a 100-processor cache coherent multiprocessor. The wordsize of the processor is 32 bits, which is the unit of memory access by the processor’s instruction set.
The cache blocksize is 16 bytes. We would like to implement the Anderson’s queue-based lock algorithm. Recall that in this algorithm, each processor marks its unique spin position in a “flags” array (associated with each lock) so that it can locally spin on that variable without disrupting the work of other processors until its turn comes to acquire the lock.
How much space is needed for each flags array associated with a lock?
To ensure there is no false sharing, we need to allocate each spin variable in a distinct cache line. Space for one spin variable = cache blocksize = 16 bytes
(+2)
We will assume that the maximum number of threads in an application cannot exceed the number of processors. Thus in the worst case we need 100 distinct spin locations in the flags array.
(no points deducted for not stating this assumption)
So total space needed in the flags array for each lock variable = 100*16 = 1600 bytes
(+2)
(MCS lock)
Recall that the MCS lock algorithm implements a queue of lock requestors using a linked list.
Let’s say the state of a lock is as follows: T1 is the current lock holder and there is no other request in the queue yet; and a “new” request for the lock is just starting to appear.
What could happen given the above state?
T1 could think there is no one else waiting for the lock and set L to nil, resulting in a livelock (“new” process will be spinning forever waiting for the lock).
Recall that the MCS lock algorithm implements a queue of lock requestors using a linked list.
Let’s say the state of a lock is as follows: T1 is the current lock holder and there is no other request in the queue yet; and a “new” request for the lock is just starting to appear.
What should happen? (assume no other lock requestors appear on the scene)
T1 should recognize that there is a new request forming for the lock, and wait for its next pointer to point to new, so that it can signal it upon releasing the lock.
The MCS algorithm assumes the presence of an atomic instruction:
compare-and-swap(a, b, c), whose semantics is as follow:
If a == b, then set a = c and return 1;
Else do nothing and return 0;
How does MCS lock algorithm use this instruction to make sure that the right thing happens?
In MCS at lock release the releasing processor T1 does the following:
if (compare-and-swap(L, T1, nil) == 0) {
/* spin awaiting the “new” request to set T1’s next pointer */
while (T1->next != nil);
} (+3)
T1->next.got_it = 1; /* signal the next process it has the lock */ (+1)(barrier algorithms)
Imagine a message-passing multiprocessor with point to point communication links between any two processors. Between tournament and dissemination algorithms, which would you expect to do better for barrier synchronization? Why?
Dissemination would do better.
At each round dissemination barrier has O(N) communications. But they can all go on in parallel. The number of rounds of communication in dissemination is ceil(log(N)).
While all the communication can go in parallel in tournament barrier as well, the algorithm traverses the tree twice (arrival + wakeup), resulting in a total of 2 * log(N) rounds of communication.
Recall that light-weight RPC (LRPC) is for cross-domain calls within a single host without going across the network. In a simple implementation of this paradigm, it is understandable that there has to be a copy from the client’s address space into the kernel address space, and from the kernel address space to the server’s address space before executing the server procedure. What are the reasons for the two additional copies (one in the client’s address space, and one in the server’s address space)?
The kernel has no knowledge of the semantics of the call. Therefore, before going to the kernel, on the client-side we have to serialize the arguments of the call into a contiguous sequence of bytes in the client’s address space before the kernel copies it into its kernel buffers.
Similarly, on the server side, we have to populate the contiguous sequence of bytes received from the kernel into the server’s address space into the actual arguments of the call as expected by the server procedure.
(multiprocessor scheduling)
Recall that “limited minimum intervening” thread scheduling policy uses affinity information for “k” processors, where “k” is a subset of the processors that this thread ran on during its lifetime for the scheduler to make scheduling decisions. Sketch the data structure for the TCB (assume k = 4) in this scheduling regime (you need to show only the fields that are to be used for scheduling purposes).
struct affine_type {
int processor_id; /* processor number */
int affinity; /* number of intervening threads */
}; (+2)
struct TCB_type {
/* ….other unrelated info */
affine_type affinity[4]; /* top 4 affinity info for this thread */
};Parallel System Case Studies
6. (6 mins, 10 points) (Tornado)
Tornado suggests using “existence guarantees” with reference counts instead of hierarchical locking to avoid serialization.
Using page fault service as a concrete example, discuss how this helps with reducing the pitfalls of serialization of page fault service in a multi-threaded process.
Imagine two threads T1 and T2 of the same process executing in the multiprocessor sharing the same representation “process” object. Both of them experience page faults SIMULTANOUSLY. Assume T1’s page fault is contained in Region 2; T2’s page fault is contained in Region 1. The page fault service will only update the respective region objects. Therefore, the “process” object need not be locked. But to ensure that some other subsystem (say a load balancer) does not remove the “process” object, the page fault service can increment the ref-count on the “process” object on the forward path of the above figure, and decrement the ref-count on the reverse path (once the service is complete) thus avoiding the serialization of the page fault service for T1 and T2.
Using the “process” object as a concrete example, identify situations where existence guarantees may not be sufficient and you may have to lock an object.
The “process” object is the equivalent of a context block. Every representation of this object is shared by multiple threads that are to be scheduled on the cores of a single processor or multiple processors (depending on the details of the architecture). The “process” data structure has information pertaining to the currently executing threads. If a thread is context switched out, then the scheduler subsystem would need to save the volatile state of the thread, and re-adjust the ready-queue, etc. All of these actions have to be performed atomically which would require locking the process object.
(full points if a convincing reason is given; need not be elaborate as above;
-2 “atomically” changing multiple fields of the process object not mentioned)
A process is currently executing on the processor. The process makes a
system call. List the steps involved in getting this system call serviced
by the operating system that this process belongs to. You should be precise
in mentioning the data structures involved in Exokernel that makes this
service possible.
System Call traps to Exokernel. (+2)
Exokernel identifies the library OS responsible for handling this system
call using Time Slice Vector (+2)
Exokernel uses the PE (Processor Environment) data structure associated with
this library OS to get the “system call context”, which is the entry point
registered with Exokernel by the library OS for system calls. (+2)
Exokernel “upcalls” the library OS using this entry point. (+2)
The library OS services the system call (+1)