Class 9
CS 170
Nov 2, 2020

On the board
------------

1. Last time
2. Deadlock
3. Other progress issues
4. Performance issues
5. Programmability issues

---------------------------------------------------------------------------

1. Last time
    
    practice with concurrent programming

    today's theme: problems brought on by locking (really, the shared
    memory programming model)

2. Deadlock 

    --see video recording or example section below: simple example based on two locks

    --see video recording or example section below: more complex example

	    --M calls N 
	    --N waits
	    --but let's say condition can only become true if N is invoked
	    through M
	    --now the lock inside N is unlocked, but M remains locked; that
	    is, no one is going to be able to enter M and hence N.

    --can also get deadlocks with condition variables

    --lesson: dangerous to hold locks (M's mutex in the case on the
    complex example) when crossing abstraction barriers

    --deadlocks without mutexes:

        real issue is resources and how/when they are required/acquired

        (a) [draw bridge example]

	    --bridge only allows traffic in one direction 

	    --Each section of a bridge can be viewed as a resource. 

	    --If a deadlock occurs, it can be resolved if one car
	    backs up (preempt resources and rollback). 

	    --Several cars may have to be backed up if a deadlock occurs. 

	    --Starvation is possible. 

	(b) another example:
		
	    --one thread/process grabs disk and then tries to grab
	    scanner

	    --another thread/process grabs scanner and then tries to
	    grab disk

    --when does deadlock happen? under four conditions. all of them must
    hold for deadlock to happen:

	1. mutual exclusion
	2. hold-and-wait
	3. no preemption
	4. circular wait


    --what can we do about deadlock?

        (a) ignore it: worry about it when it happens. the so-called
        "ostrich solution"

        (b) detect and recover: not great

	    --could imagine attaching debugger

		--not really viable for production software, but
		works well in development

	    --threads package can keep track of resource-allocation graph

	    --see one of the recommended texts:

		--For each lock acquired, order with other locks held 
		
		--If cycle occurs, abort with error 
	    
		--Detects potential deadlocks even if they do not occur 


        (c) avoid algorithmically

            [not covering]

	    --banker's algorithm (see Tanenbaum text for a desription)

		--very elegant but impractical

		--if you're using banker's algorithm, the gameboard
		looks like this:

		    ResourceMgr::Request(ResourceID resc,
					 RequestorID thrd) {
			acquire(&mutex);
			assert(system in a safe state);
			while (state that would result from giving 
			       resource to thread is not safe) {
			    wait(&cv, &mutex);	
			}
			update state by giving resource to thread
			assert(system in a safe state);
			release(&mutex);
		    }

		    Now we need to determine if a state is safe....

		    To do so, see book

	    --disadvantage to banker's algorithm:

		--requires every single resource request to go
		through a single broker

		--requires every thread to state its maximum
		resource needs up front. unfortunately, if threads
		are conservative and claim they need huge quantities
		of resources, the algorithm will reduce concurrency

        (d) negate one of the four conditions using careful coding:

	    --can sort of negate 1
		--put a queue in front of resources, like the printer
		--virtualize memory

	    --not much hope of negating 2

	    --can sort of negate 3:
		--consider physical memory: virtualized with VM, can
		take physical page away and give to another process! 

	    --what about negating #4?

		--in practice, this is what people do

		--idea: partial order on locks

		    --Establishing an order on all locks and making
		    sure that every thread acquires its locks in
		    that order

		--why this works:

		    --can view deadlock as a cycle in the resource
		    acquisition graph

		    --partial order implies no cycles and hence no
		    deadlock

		--three bummers:

		    1. hard to represent CVs inside this framework.
		    works best only for locks.

		    2. compiler can't check at compile time that
		    partial order is being adhered to because
		    calling pattern is impossible to determine
		    without running the program (thanks to function
		    pointers and the halting problem)

		    3. Picking and obeying the order on *all* locks
		    requires that modules make public their locking
		    behavior, and requires them to know about other
		    modules' locking.  This can be painful and
		    error-prone. 

			--see Linux's filemap.c example below; this is
			complexity that arises by the need for a locking
			order

	(e) Static and dynamic detection tools

	    --See, for example, these citations, citations
	    therein, and papers that cite them:

		Engler, D. and K. Ashcraft. RacerX: effective,
		static detection of race conditions and deadlocks.
		Proc. ACM Symposium on Operating Systems Principles
		(SOSP), October, 2003, pp237-252.
		http://portal.acm.org/citation.cfm?id=945468

		Savage, S., M. Burrows, G. Nelson, P. Sobalvarro,
		and T. Anderson. Eraser: a dynamic data race
		detector for multithreaded programs. ACM
		Transactions on Computer Systems (TOCS), Volume 15,
		No 4., Nov., 1997, pp391-411.
		http://portal.acm.org/citation.cfm?id=265927

		a long literature on this stuff

	    --Disadvantage to dynamic checking: slows program down

	    --Disadvantage to static checking: many false alarms
	    (tools says "there is deadlock", but in fact there is
	    none) or else missed problems

	    --Note that these tools get better every year. I believe
	    that Valgrind has a race and deadlock detection tool


3. Other progress issues

    Deadlock was one kind of progress (or liveness) issue. Here are two
    others...

    Starvation

	--thread waiting indefinitely (if low priority and/or if
	resource is contended)

    Priority inversion

	--T1, T2, T3: (highest, middle, lowest priority)

	--T1 wants to get lock, T2 runnable, T3 runnable and holding lock

	--System will preempt T3 and run highest-priority runnable thread, namely T2

	--Solutions:

	    --Temporarily bump T3 to highest priority of any thread that is
	    ever waiting on the lock

	    --Disable interrupts, so no preemption (T3 finishes)
		... not great because OS sometimes needs control
		(not for scheduling, under this assumption, but for
		handling memory [page faults], etc.)

	    --Don't handle it; structure app so only adjacent priority
	    processes/threads share locks

	--Happens in real life. For a real-life example, see:
	http://research.microsoft.com/en-us/um/people/mbj/Mars_Pathfinder/Mars_Pathfinder.html


4. Performance issues and tradeoffs

    (a) Implementation of spinlocks/mutexes can be expensive. Reasons:

        --mutex costs: 
            --the raw instructions required to execute "mutex_acquire"
            --going to sleep and waking up implies context switch, which
            brings a resource cost

    (b) Coarse locks limit available parallelism  ....

        [(still, you should design this way at first!!!)]

	    the fundamental issue with coarse-grained locking is that
	    only one CPU can execute anywhere in the part of your code
	    protected by a lock. If your critical section code is called
	    a lot, this may reduce the performance of an expensive
	    multiprocessor to that of a single CPU.

	    if this happens inside the kernel, it means that
	    applications will inherit the performance problems from the
	    kernel


    (c) ... but fine-grained locking leads to complexity and hence bugs
    (like deadlock)

        --> although finer-grained locking can often lead to better
        performance, it also leads to increased complexity and hence
        risk of bugs (including deadlock).

5. Programmability issues

    Loss of modularity

	--examples above: avoiding deadlock requires understanding
	how programs call each other.

	--also, need to know, when calling a library, whether it's
	thread-safe: printf, malloc, etc. If not, surround call with
	mutex. (Can always surround calls with mutexes conservatively.)

        --basically locks bubble out of the interface

    What's the fundamental problem? The fundamental problem is that the
    shared memory programming model is hard to use correctly (although
    mutexes help a great deal). 

=================================================
Practice problems and example of ordering on locks

1. Simple deadlock example

     T1:
     acquire(mutexA);
     acquire(mutexB);

     // do some stuff

     release(mutexB);
     release(mutexA);

     T2:
     acquire(mutexB);
     acquire(mutexA);

     // do some stuff

     release(mutexA);
     release(mutexB);

=======================================================================

2. More subtle deadlock example

 Let M be a monitor (shared object with methods protected by mutex)
 Let N be another monitor

     class M {
         private:
             Mutex mutex_m;

             // instance of monitor N
             N another_monitor;

             // Assumption: no other objects in the system hold a pointer
             // to our "another_monitor"

         public:
             M();
             ~M();
             void methodA();
             void methodB();
     };

     class N {
         private:
             Mutex mutex_n;
             Cond cond_n;
             int navailable;

         public:
             N();
             ~N();
             void* alloc(int nwanted);
             void free(void*);
     }

     int
     N::alloc(int nwanted) {
         acquire(&mutex_n);
         while (navailable < nwanted) {
         wait(&cond_n, &mutex_n);
     }

         // peel off the memory

         navailable −= nwanted;
         release(&mutex_n);
     }

     void
     N::free(void* returning_mem) {

         acquire(&mutex_n);

         // put the memory back

         navailable += returning_mem;

         broadcast(&cond_n, &mutex_n);

         release(&mutex_n);
     }

     void
     M::methodA() {

         acquire(&mutex_m);

         void* new_mem = another_monitor.alloc(int nbytes);

         // do a bunch of stuff using this nice
         // chunk of memory n allocated for us

         release(&mutex_m);
     }

     void
     M::methodB() {

         acquire(&mutex_m);

         // do a bunch of stuff

         another_monitor.free(some_pointer);

         release(&mutex_m);

     }

     QUESTION: What’s the problem?  

=======================================================================

3. Locking brings a performance vs. complexity trade−off 

    /*
     *  linux/mm/filemap.c
     *
     * Copyright (C) 1994-1999  Linus Torvalds
     */

    /*
     * This file handles the generic file mmap semantics used by
     * most "normal" filesystems (but you don't /have/ to use this:
     * the NFS filesystem used to do this differently, for example)
     */
    #include <linux/export.h>
    #include <linux/compiler.h>
    #include <linux/dax.h>
    #include <linux/fs.h>
    #include <linux/sched/signal.h>
    #include <linux/uaccess.h>
    #include <linux/capability.h>
    #include <linux/kernel_stat.h>
    #include <linux/gfp.h>
    #include <linux/mm.h>
    #include <linux/swap.h>
    #include <linux/mman.h>
    #include <linux/pagemap.h>
    #include <linux/file.h>
    #include <linux/uio.h>
    #include <linux/hash.h>
    #include <linux/writeback.h>
    #include <linux/backing-dev.h>
    #include <linux/pagevec.h>
    #include <linux/blkdev.h>
    #include <linux/security.h>
    #include <linux/cpuset.h>
    #include <linux/hugetlb.h>
    #include <linux/memcontrol.h>
    #include <linux/cleancache.h>
    #include <linux/shmem_fs.h>
    #include <linux/rmap.h>
    #include <linux/delayacct.h>
    #include <linux/psi.h>
    #include "internal.h"

    #define CREATE_TRACE_POINTS
    #include <trace/events/filemap.h>

    /*
     * FIXME: remove all knowledge of the buffer layer from the core VM
     */
    #include <linux/buffer_head.h> /* for try_to_free_buffers */

    #include <asm/mman.h>

    /*
     * Shared mappings implemented 30.11.1994. It's not fully working yet,
     * though.
     *
     * Shared mappings now work. 15.8.1995  Bruno.
     *
     * finished 'unifying' the page and buffer cache and SMP-threaded the
     * page-cache, 21.05.1999, Ingo Molnar <mingo@redhat.com>
     *
     * SMP-threaded pagemap-LRU 1999, Andrea Arcangeli <andrea@suse.de>
     */

    /*
     * Lock ordering:
     *
     *  ->i_mmap_rwsem      (truncate_pagecache)
     *    ->private_lock        (__free_pte->__set_page_dirty_buffers)
     *      ->swap_lock     (exclusive_swap_page, others)
     *        ->i_pages lock
     *
     *  ->i_mutex
     *    ->i_mmap_rwsem        (truncate->unmap_mapping_range)
     *
     *  ->mmap_sem
     *    ->i_mmap_rwsem
     *      ->page_table_lock or pte_lock   (various, mainly in memory.c)
     *        ->i_pages lock    (arch-dependent flush_dcache_mmap_lock)
     *
     *  ->mmap_sem
     *    ->lock_page       (access_process_vm)
     *
     *  ->i_mutex           (generic_perform_write)
     *    ->mmap_sem        (fault_in_pages_readable->do_page_fault)
     *
     *  bdi->wb.list_lock
     *    sb_lock           (fs/fs-writeback.c)
     *    ->i_pages lock        (__sync_single_inode)
     *
     *  ->i_mmap_rwsem
     *    ->anon_vma.lock       (vma_adjust)
     *
     *  ->anon_vma.lock
     *    ->page_table_lock or pte_lock (anon_vma_prepare and various)
     *
     *  ->page_table_lock or pte_lock
     *    ->swap_lock       (try_to_unmap_one)
     *    ->private_lock        (try_to_unmap_one)
     *    ->i_pages lock        (try_to_unmap_one)
     *    ->zone_lru_lock(zone) (follow_page->mark_page_accessed)
     *    ->zone_lru_lock(zone) (check_pte_range->isolate_lru_page)
     *    ->private_lock        (page_remove_rmap->set_page_dirty)
     *    ->i_pages lock        (page_remove_rmap->set_page_dirty)
     *    bdi.wb->list_lock     (page_remove_rmap->set_page_dirty)
     *    ->inode->i_lock       (page_remove_rmap->set_page_dirty)
     *    ->memcg->move_lock    (page_remove_rmap->lock_page_memcg)
     *    bdi.wb->list_lock     (zap_pte_range->set_page_dirty)
     *    ->inode->i_lock       (zap_pte_range->set_page_dirty)
     *    ->private_lock        (zap_pte_range->__set_page_dirty_buffers)
     *
     * ->i_mmap_rwsem
     *   ->tasklist_lock            (memory_failure, collect_procs_ao)
     */

    static void page_cache_delete(struct address_space *mapping,
                       struct page *page, void *shadow)
    {
        XA_STATE(xas, &mapping->i_pages, page->index);
        unsigned int nr = 1;

        ....

    [the point is: fine−grained locking leads to complexity.]