Wednesday 21 December 2016

Radix Tree Test Suite Regression Test 1

According to Wikipedia, Regression testing is a type of software testing that verifies that software previously developed and tested still performs correctly even after it was changed or interfaced with other software. Changes may include software enhancements, patches, configuration changes, etc.

The regression test 1 is used to test a special case which causes a deadlock.

static RADIX_TREE(mt_tree, GFP_KERNEL);
This declares a radix tree object with name mt_tree and and initializes it with flags indicated by the mask GFP_KERNEL. These flags control how memory allocations are to be performed.

static pthread_mutex_t mt_lock = PTHREAD_MUTEX_INITIALIZER;
This declares a pthread_mutex_t lock object and initializes it.

struct page {
        pthread_mutex_t lock;
        struct rcu_head rcu;
        int count;
        unsigned long index;
};
This defines a page structure associated with a lock, an rcu_head object (Updates and reads to radix tree data structures are done via the RCU synchronization mechanism.), a count (is a reference count or number of users of the page) and an index (the number of the page within its file).

Now let's come to the main flow of the case tested by regression test 1.

nr_threads = 2;
pthread_barrier_init(&worker_barrier, NULL, nr_threads);

A barrier is a synchronization mechanism that lets you "corral" several cooperating threads (e.g., in a matrix computation), forcing them to wait at a specific point until all have finished before any one thread can continue. Unlike the pthread_join() function, where you'd wait for the threads to terminate, in the case of a barrier you're waiting for the threads to rendezvous at a certain point. When the specified number of threads arrive at the barrier, we unblock all of them so they can continue to run.
We use two threads working on the function regression1_fn.

if (pthread_barrier_wait(&worker_barrier) == 
                          PTHREAD_BARRIER_SERIAL_THREAD)

When all threads meet at this point, PTHREAD_BARRIER_SERIAL_THREAD is returned for one thread and 0 for all other threads.

The updater thread.
p = page_alloc();
pthread_mutex_lock(&mt_lock);
radix_tree_insert(&mt_tree, 0, p);
pthread_mutex_unlock(&mt_lock);

This piece of code allocates a page object p and inserts it in the radix tree mt_tree at index/key 0.

static inline int radix_tree_insert(struct radix_tree_root *root,
                        unsigned long index, void *entry)
{
        return __radix_tree_insert(root, index, 0, entry);
}

int __radix_tree_insert(struct radix_tree_root *root, unsigned long index,
                        unsigned order, void *item)
{
        struct radix_tree_node *node;
        void **slot;
        int error;

        BUG_ON(radix_tree_is_internal_node(item));

        error = __radix_tree_create(root, index, order, &node, &slot);
        if (error)
                return error;

        error = insert_entries(node, slot, item, order, false);
        if (error < 0) 
                return error;

        if (node) {
                unsigned offset = get_slot_offset(node, slot);
                BUG_ON(tag_get(node, 0, offset));
                BUG_ON(tag_get(node, 1, offset));
                BUG_ON(tag_get(node, 2, offset));
        } else {
                BUG_ON(root_tags_get(root));
        }

        return 0;
}
Essentially a node is created and the slot on that node which corresponds to the key 0, now points to the page p.
Similarly a new page is inserted at the key 1.
Now we delete the page inserted at key 1 and finally the page at key 0 is deleted.

Now consider the workflow of the other thread (the reader).

for (j = 0; j < 100000000; j++) {
        struct page *pages[10];

        find_get_pages(0, 10, pages);
}
static unsigned find_get_pages(unsigned long start,
                            unsigned int nr_pages, struct page **pages)
{
        unsigned int i;
        unsigned int ret;
        unsigned int nr_found;

        rcu_read_lock();
restart:
        nr_found = radix_tree_gang_lookup_slot(&mt_tree,
                                (void ***)pages, NULL, start, nr_pages);
        ret = 0;
        for (i = 0; i < nr_found; i++) {
                struct page *page;
repeat:
                page = radix_tree_deref_slot((void **)pages[i]);
                if (unlikely(!page))
                        continue;

                if (radix_tree_exception(page)) {
                        if (radix_tree_deref_retry(page)) {
                                /*
                                 * Transient condition which can only trigger
                                 * when entry at index 0 moves out of or back
                                 * to root: none yet gotten, safe to restart.
                                 */
                                assert((start | i) == 0);
                                goto restart;
                        }
                        /*
                         * No exceptional entries are inserted in this test.
                         */
                        assert(0);
                }

                pthread_mutex_lock(&page->lock);
                if (!page->count) {
                        pthread_mutex_unlock(&page->lock);
                        goto repeat;
                }
                /* don't actually update page refcount */
                pthread_mutex_unlock(&page->lock);

                /* Has the page moved? */
                if (unlikely(page != *((void **)pages[i]))) {
                        goto repeat;
                }

                pages[ret] = page;
                ret++;
        }
        rcu_read_unlock();
        return ret;
}

It wants to read a maximum of 10 pages starting from index/key 0, into the pages array. The call to radix_tree_gang_lookup_slot() returns either 0 or 1 or 2 depending upon how many pages have been inserted by the updater thread so far and returns their corresponding slots in the pages array. These slots must be de-referenced to get the required pages. We also need to check if the page ref count has become 0 which probably means the page has been deleted, so we want to retry. We also want to retry if the page has moved.

Now where could the deadlock have arisen and how is it checked?
Consider the sequence of events:
  1. Both the pages are inserted (at index 0 and at index 1).
  2. The reader acquires the slots to both of them. 
  3. The page at index 1 is deleted and the page at index 0 is moved to the root of the tree. The place where the index 0 item used to be, is queued up for deletion after the readers finish.
  4. Since the page at index 0 had moved, its de-referencing is tried again.
  5. The updater thread now deletes the page at index 0. It is removed from the direct slot, it remains in the rcu-delayed indirect node.
  6. The reader looks at the index 0 slot, and finds that the page has 0 ref count. So it retries and keeps retrying as the page is not freed because the reader hasn't finished yet and the ref count doesn't change and remains 0. The readers is thus in an infinite loop.
To avoid this deadlock, when the index 0 item is deleted, we have the following code which tags the slot[0] of the root node with RADIX_TREE_INDIRECT_PTR.

if (root->height == 0)
        *((unsigned long *)&to_free->slots[0]) |=
    RADIX_TREE_INDIRECT_PTR;

This causes the check if (radix_tree_exception(page)) to evaluate to true, subsequently the reader is forced to retry the lookup (goes to restart).

No comments:

Post a Comment