Let's Share!: December 2016

Thursday 29 December 2016

Radix Tree Test Suite Regression Test 2

Here's the original patch that describes is detail the bug that caused radix_tree_gang_lookup_tag_slot() to hang up indefinitely, and the fix for the same.

The regression test 2 runs to completion immediately if successful, otherwise it hangs up indefinitely.

#define RADIX_TREE_MAP_SHIFT    (CONFIG_BASE_SMALL ? 4 : 6)
#define RADIX_TREE_MAP_SIZE     (1UL << RADIX_TREE_MAP_SHIFT)
...
int max_slots = RADIX_TREE_MAP_SIZE;
...
for (i = 0; i <= max_slots - 1; i++) {
        p = page_alloc();
        radix_tree_insert(&mt_tree, i, p); 
}
radix_tree_tag_set(&mt_tree, max_slots - 1, PAGECACHE_TAG_DIRTY);

RADIX_TREE_MAP_SIZE denotes the number of slots in each node of a radix tree object (typically 64). First we're going to insert RADIX_TREE_MAP_SIZE number of pages in the radix tree mt_tree, with indices/keys 0 to max_slots - 1. Then we set the item/page at index max_slots - 1 to have the tag PAGECACHE_TAG_DIRTY. For this bug it does not matter which item has the tag, it can be any item.

start = 0;
end = max_slots - 2;
radix_tree_range_tag_if_tagged(&mt_tree, start, end, 1,
                        PAGECACHE_TAG_DIRTY, PAGECACHE_TAG_TOWRITE);

This is to update all items with indices in range 0 to max_slots - 2 and currently have the tag PAGECACHE_TAG_DIRTY to have the new tag PAGECACHE_TAG_TOWRITE, which are none. But root is nevertheless tagged with PAGECACHE_TAG_TOWRITE, as the following diff shows.

p = page_alloc();
radix_tree_insert(&mt_tree, max_slots, p);

Now we insert a new page at key max_slots. This results in creation of a new child node which succeeds the tag status of the root tag. Therefore the tag of this new node has PAGECACHE_TAG_TOWRITE but there is no slot with PAGECACHE_TAG_TOWRITE tag in this new node.

radix_tree_tag_clear(&mt_tree, max_slots - 1, PAGECACHE_TAG_DIRTY);

Next we update the item at index max_slots - 1, which currently has tag PAGECACHE_TAG_DIRTY, to clear this tag.

for (i = max_slots - 1; i >= 0; i--)
        radix_tree_delete(&mt_tree, i);

Now we delete all items in key range 0 to max_slots - 1. Only the item with index RADIX_TREE_MAP_SIZE exists in the tree. The root still has the tag PAGECACHE_TAG_TOWRITE.

// NOTE: start should not be 0 because radix_tree_gang_lookup_tag_slot
//       can return.
start = 1;
end = max_slots - 2;
radix_tree_gang_lookup_tag_slot(&mt_tree, (void ***)pages, start, end,
        PAGECACHE_TAG_TOWRITE);

This calls __lookup_tag, but since the first slot of the tree node is null and the tag corresponding to it has PAGECACHE_TAG_TOWRITE, it keeps trying to get the items but it cannot do so ever. This bug was fixed to change radix_tree_tag_if_tagged so that it doesn't tag the root tag if it doesn't set any tags within the specified range.

Here's how the call to tag_tagged_items() which replaces radix_tree_range_tag_if_tagged(), essentially looks like:

radix_tree_for_each_tagged(slot, root, &iter, start, iftag) {
        if (iter.index > end)
                break;
        radix_tree_iter_tag_set(root, &iter, thentag);
        tagged++;
        if ((tagged % batch) != 0)
                continue;
        slot = radix_tree_iter_resume(slot, &iter);
        if (lock) {
                pthread_mutex_unlock(lock);
                rcu_barrier();
                pthread_mutex_lock(lock);
        }
}

Here we specifically look for slots marked with iftag using radix_tree_for_each_tagged() and only set them with the thentag, if found.

Wednesday 28 December 2016

Iteration tests in Radix tree test suite

According to Wikipedia, unit testing is a software testing method by which individual units of source code, sets of one or more computer program modules together with associated control data, usage procedures, and operating procedures, are tested to determine whether they are fit for use.

Iteration test in Radix tree test suite is a unit test for a bug found by the syzkaller tester.

Iteration tests can be run for radix tree with zero order or multi-order items. Essentially these launch five threads to test parallel iteration over tagged or untagged entries, adding, removing and tagging entries, for an arbitrary period of time.

Let's see what each thread is doing in more detail.

/*
 * Iterate over the tagged entries, doing a radix_tree_iter_retry() as we find
 * things that have been removed and randomly resetting our iteration to the
 * next chunk with radix_tree_iter_resume().  Both radix_tree_iter_retry() and
 * radix_tree_iter_resume() cause radix_tree_next_slot() to be called with a
 * NULL 'slot' variable.
 */
static void *tagged_iteration_fn(void *arg)
{
 struct radix_tree_iter iter;
 void **slot;

 rcu_register_thread();

 while (!test_complete) {
  rcu_read_lock();
  radix_tree_for_each_tagged(slot, &tree, &iter, 0, TAG) {
   void *entry = radix_tree_deref_slot(slot);
   if (unlikely(!entry))
    continue;

   if (radix_tree_deref_retry(entry)) {
    slot = radix_tree_iter_retry(&iter);
    continue;
   }

   if (rand_r(&seeds[0]) % 50 == 0) {
    slot = radix_tree_iter_resume(slot, &iter);
    rcu_read_unlock();
    rcu_barrier();
    rcu_read_lock();
   }
  }
  rcu_read_unlock();
 }

 rcu_unregister_thread();

 return NULL;
}

Iterating the tree can be done through radix_tree_iter object. This radix tree iterator works in terms of "chunks" of slots. A chunk is a sub-interval of slots contained within one radix tree leaf node. Thus we need to track the 'index' of the current slot and 'next_index' (one beyond the last index for this chunk) to track the size of the chunk. tags is a bitmask corresponding to each slot in the chunk for a particular tag. *node denotes the node containing this slot and shift is that node's property that holds the bits remaining for each slot in that node.

struct radix_tree_iter {
        unsigned long   index;
        unsigned long   next_index;
        unsigned long   tags;
        struct radix_tree_node *node;
#ifdef CONFIG_RADIX_TREE_MULTIORDER
        unsigned int    shift;
#endif
};

#define radix_tree_for_each_tagged(slot, root, iter, start, tag)        \
        for (slot = radix_tree_iter_init(iter, start) ;                 \
             slot || (slot = radix_tree_next_chunk(root, iter,          \
                              RADIX_TREE_ITER_TAGGED | tag)) ;          \
             slot = radix_tree_next_slot(slot, iter,                    \
                                RADIX_TREE_ITER_TAGGED | tag))

This traverses the tree starting from the slot with key start, chunk by chunk and slot by slot within a chunk. The RADIX_TREE_ITER_TAGGED mask in the tag tells the radix_tree_next_chunk() function that we are interested in the lookup of tagged slots and the tag argument is the tag we are looking for.

Once we look-up a slot we need to de-reference it to get the entry/item that is stored in the slot. Before proceeding we need to check if another thread has moved the item stored in the slot to another location (using radix_tree_deref_retry()), in which case we need to retry the look-up.

radix_tree_iter_retry(&iter) works by updating the iter->next_index to iter->index, iter->tags to 0 and slot to NULL, so that the subsequent call to radix_tree_next_slot() returns NULL and the subsequent call to radix_tree_next_chunk() returns the first slot of the chunk associated with iter, which was the slot for which we needed to repeat the look-up.

Now we can simply continue the iteration in this manner.

void **radix_tree_iter_resume(void **slot, struct radix_tree_iter *iter)
{
        struct radix_tree_node *node;

        slot++;
        iter->index = __radix_tree_iter_add(iter, 1);
        node = rcu_dereference_raw(*slot);
        skip_siblings(&node, slot, iter);
        iter->next_index = iter->index;
        iter->tags = 0; 
        return NULL;
}

Once in every fifty times, we are done with the iteration and we want to resume it at a later point of time. So we update the iter and slot so that when we have the read lock again, we can resume iteration from the desired slot, with that slot indicating the start of our next chunk.

untagged_iteration_fn() does the same thing, with the only difference being that now we are not iterating over entries with a particular tag but all slots which are non-empty, via :

#define radix_tree_for_each_slot(slot, root, iter, start)               \
        for (slot = radix_tree_iter_init(iter, start) ;                 \
             slot || (slot = radix_tree_next_chunk(root, iter, 0)) ;    \
             slot = radix_tree_next_slot(slot, iter, 0))

/* relentlessly fill the tree with tagged entries */
static void *add_entries_fn(void *arg)
{
        rcu_register_thread();

        while (!test_complete) {
                unsigned long pgoff;
                int order;

                for (pgoff = 0; pgoff < MAX_IDX; pgoff++) {
                        pthread_mutex_lock(&tree_lock);
                        for (order = max_order; order >= 0; order--) {
                                if (item_insert_order(&tree, pgoff, order)
                                                == 0) {
                                        item_tag_set(&tree, pgoff, TAG);
                                        break;
                                }
                        }
                        pthread_mutex_unlock(&tree_lock);
                }
        }

        rcu_unregister_thread();

        return NULL;
}

This keeps inserting entries with indices/keys 0-MAX_IDX and tagging them with the given TAG. For multi order case we pass in the order as the maximum possible and keep reducing it until we get a successful insertion. item_insert_order() returns 0 for a successful insertion and non-zero for error cases such as not being able to create a required new node, or not being able to extend the tree in terms of its shift, due to failed memory allocation (ENOMEM), or if an item already exists at that index (EEXIST).

/*
 * Randomly remove entries to help induce radix_tree_iter_retry() calls in the
 * two iteration functions.
 */
static void *remove_entries_fn(void *arg)
{
        rcu_register_thread();

        while (!test_complete) {
                int pgoff;

                pgoff = rand_r(&seeds[2]) % MAX_IDX;

                pthread_mutex_lock(&tree_lock);
                item_delete(&tree, pgoff);
                pthread_mutex_unlock(&tree_lock);
        }

        rcu_unregister_thread();

        return NULL;
}

remove_entries_fn() works by randomly selecting a key to delete and item_delete() works by freeing the item at passed index and updating the tags in the tree, or returns NULL if the item was not present.

static void *tag_entries_fn(void *arg)
{
        rcu_register_thread();

        while (!test_complete) {
                tag_tagged_items(&tree, &tree_lock, 0, MAX_IDX, 10, TAG,
                                        NEW_TAG);
        }
        rcu_unregister_thread();
        return NULL;
}

We want to update items between indices 0 to MAX_IDX which are tagged with TAG, to NEW_TAG in batches of 10.

To conclude, when all these functions are happening in parallel, we want to test if there are races between them.

Wednesday 21 December 2016

Radix Tree Test Suite Regression Test 1

According to Wikipedia, Regression testing is a type of software testing that verifies that software previously developed and tested still performs correctly even after it was changed or interfaced with other software. Changes may include software enhancements, patches, configuration changes, etc.

The regression test 1 is used to test a special case which causes a deadlock.

static RADIX_TREE(mt_tree, GFP_KERNEL);

This declares a radix tree object with name mt_tree and and initializes it with flags indicated by the mask GFP_KERNEL. These flags control how memory allocations are to be performed.

static pthread_mutex_t mt_lock = PTHREAD_MUTEX_INITIALIZER;

This declares a pthread_mutex_t lock object and initializes it.

struct page {
        pthread_mutex_t lock;
        struct rcu_head rcu;
        int count;
        unsigned long index;
};

This defines a page structure associated with a lock, an rcu_head object (Updates and reads to radix tree data structures are done via the RCU synchronization mechanism.), a count (is a reference count or number of users of the page) and an index (the number of the page within its file).

Now let's come to the main flow of the case tested by regression test 1.

nr_threads = 2;
pthread_barrier_init(&worker_barrier, NULL, nr_threads);

A barrier is a synchronization mechanism that lets you "corral" several cooperating threads (e.g., in a matrix computation), forcing them to wait at a specific point until all have finished before any one thread can continue. Unlike the pthread_join() function, where you'd wait for the threads to terminate, in the case of a barrier you're waiting for the threads to rendezvous at a certain point. When the specified number of threads arrive at the barrier, we unblock all of them so they can continue to run.
We use two threads working on the function regression1_fn.

if (pthread_barrier_wait(&worker_barrier) == 
                          PTHREAD_BARRIER_SERIAL_THREAD)

When all threads meet at this point, PTHREAD_BARRIER_SERIAL_THREAD is returned for one thread and 0 for all other threads.

The updater thread.

p = page_alloc();
pthread_mutex_lock(&mt_lock);
radix_tree_insert(&mt_tree, 0, p);
pthread_mutex_unlock(&mt_lock);

This piece of code allocates a page object p and inserts it in the radix tree mt_tree at index/key 0.

static inline int radix_tree_insert(struct radix_tree_root *root,
                        unsigned long index, void *entry)
{
        return __radix_tree_insert(root, index, 0, entry);
}

int __radix_tree_insert(struct radix_tree_root *root, unsigned long index,
                        unsigned order, void *item)
{
        struct radix_tree_node *node;
        void **slot;
        int error;

        BUG_ON(radix_tree_is_internal_node(item));

        error = __radix_tree_create(root, index, order, &node, &slot);
        if (error)
                return error;

        error = insert_entries(node, slot, item, order, false);
        if (error < 0) 
                return error;

        if (node) {
                unsigned offset = get_slot_offset(node, slot);
                BUG_ON(tag_get(node, 0, offset));
                BUG_ON(tag_get(node, 1, offset));
                BUG_ON(tag_get(node, 2, offset));
        } else {
                BUG_ON(root_tags_get(root));
        }

        return 0;
}

Essentially a node is created and the slot on that node which corresponds to the key 0, now points to the page p.
Similarly a new page is inserted at the key 1.
Now we delete the page inserted at key 1 and finally the page at key 0 is deleted.

Now consider the workflow of the other thread (the reader).

for (j = 0; j < 100000000; j++) {
        struct page *pages[10];

        find_get_pages(0, 10, pages);
}

static unsigned find_get_pages(unsigned long start,
                            unsigned int nr_pages, struct page **pages)
{
        unsigned int i;
        unsigned int ret;
        unsigned int nr_found;

        rcu_read_lock();
restart:
        nr_found = radix_tree_gang_lookup_slot(&mt_tree,
                                (void ***)pages, NULL, start, nr_pages);
        ret = 0;
        for (i = 0; i < nr_found; i++) {
                struct page *page;
repeat:
                page = radix_tree_deref_slot((void **)pages[i]);
                if (unlikely(!page))
                        continue;

                if (radix_tree_exception(page)) {
                        if (radix_tree_deref_retry(page)) {
                                /*
                                 * Transient condition which can only trigger
                                 * when entry at index 0 moves out of or back
                                 * to root: none yet gotten, safe to restart.
                                 */
                                assert((start | i) == 0);
                                goto restart;
                        }
                        /*
                         * No exceptional entries are inserted in this test.
                         */
                        assert(0);
                }

                pthread_mutex_lock(&page->lock);
                if (!page->count) {
                        pthread_mutex_unlock(&page->lock);
                        goto repeat;
                }
                /* don't actually update page refcount */
                pthread_mutex_unlock(&page->lock);

                /* Has the page moved? */
                if (unlikely(page != *((void **)pages[i]))) {
                        goto repeat;
                }

                pages[ret] = page;
                ret++;
        }
        rcu_read_unlock();
        return ret;
}

It wants to read a maximum of 10 pages starting from index/key 0, into the pages array. The call to radix_tree_gang_lookup_slot() returns either 0 or 1 or 2 depending upon how many pages have been inserted by the updater thread so far and returns their corresponding slots in the pages array. These slots must be de-referenced to get the required pages. We also need to check if the page ref count has become 0 which probably means the page has been deleted, so we want to retry. We also want to retry if the page has moved.

Now where could the deadlock have arisen and how is it checked?
Consider the sequence of events:

Both the pages are inserted (at index 0 and at index 1).
The reader acquires the slots to both of them.
The page at index 1 is deleted and the page at index 0 is moved to the root of the tree. The place where the index 0 item used to be, is queued up for deletion after the readers finish.
Since the page at index 0 had moved, its de-referencing is tried again.
The updater thread now deletes the page at index 0. It is removed from the direct slot, it remains in the rcu-delayed indirect node.
The reader looks at the index 0 slot, and finds that the page has 0 ref count. So it retries and keeps retrying as the page is not freed because the reader hasn't finished yet and the ref count doesn't change and remains 0. The readers is thus in an infinite loop.

To avoid this deadlock, when the index 0 item is deleted, we have the following code which tags the slot[0] of the root node with RADIX_TREE_INDIRECT_PTR.

if (root->height == 0)
        *((unsigned long *)&to_free->slots[0]) |=
    RADIX_TREE_INDIRECT_PTR;

This causes the check if (radix_tree_exception(page)) to evaluate to true, subsequently the reader is forced to retry the lookup (goes to restart).

Tuesday 6 December 2016

Building and running Radix Tree Test Suite

Radix Trees are data structures in the linux kernel. The Radix tree test suite can be found in the tools/testing/radix-tree directory.

How to build the tests:

Navigate to the tools/testing/radix-tree directory from the repository directory.

cd tools/testing/radix-tree

Make sure you have the libraries pthread and urcu.

sudo apt-get install libpthread-stubs0-dev
sudo apt-get install liburcu-dev

Try running make. If you still get a long list of errors like undefined reference to 'rcu_register_thread_memb', then I suggest editing the Makefile to put $(LDFLAGS) at the end of the line rather than in the middle of the line.

make

Running the tests:

```
./main
```
This is the default (short) run.
```
./main -l
```
This is the long run which runs the big_gang_check() and copy_tag_check() for thousands of iterations compared to only 3 in the default run.
```
./main -s <custom seed>
```
This lets the user declare a custom seed (for random functions) in the program. By default, the seed is initialised as:
```
unsigned int seed = time(NULL);
```