Sunday 3 December 2017

Open Source Summit NA 2017

The Linux foundation holds a lot of events where companies can reach the maintainers and developers across all the important open source software projects in enterprise, networking, embedded, IoT and cloud infrastructure.

Open Source Summit is the premier open source technical conference in North America, gathering 2,000+ developers, operators and community leadership professionals to collaborate, share information and learn about the latest in open technologies, including Linux, containers, cloud computing and more. This time they also organised a Diversity Summit.


I was really lucky and excited to be a part of the OSS NA 2017. I had sent a proposal earlier in May for delivering a talk based on my Outreachy internship with Linux Kernel from December 2016 - March 2017, which was accepted. The best part was connecting with my mentor Matthew Wilcox as well as a fellow intern who was also my co-speaker for the talk, Sandhya Bankar, though we all missed having Rik Van Riel, also my mentor for the Outreachy internship, and Julia Lawall, who manages and mentors the outreachy internships and has been really encouraging throughout.



With my mentor Matthew Wilcox

With Greg kroah-hartman
Coming to the conference, I had never experienced such a diverse collection of talks, people and organisations as in OSS NA 17. Even the backdrop of Los Angeles, its amazing food and places and wonderful networking opportunities truly made the experience enriching and magical.

With Sandhya Bankar at the partner reception at The Rooftop at The Standard.
At the evening event at Paramount Studios
At the Walt Disney Concert Hall with Jaminy Prabha
I couldn't understand the talks from CloudOpen and ContainerCon tracks much, but going around the sponsor showcase, I got a lot of background on what these very popular and upcoming technologies are about and got inspired to explore them when I go back. I had some very interesting discussions there. I attended many LinuxCon and Diversity track talks as well as keynotes which I found most interesting and took a lot home from them.

At the sponsor showcase with this amazing web representation of all the technologies there.
The opening keynote by Jim Zemlin really set up the excitement about open source and the next 4 days.



Next I really liked knowing about the CHAOSS Project and found it relevant to my research area at college. CHAOSS is a new Linux Foundation project aimed at producing integrated, open source software for analyzing software development, together with defining implementation-agnostic metrics for measuring community activity, contributions, and health.

Another highlight of the first day was the Women in Open Source Lunch sponsored by Intel. The kind of positivity, support, ideas and inspiration in that room was really one of its kind.

Being one of the few students at the summit, I found the talk, "Increasing student participation in open source" very interesting. Similarly the career fair was fun, getting to talk to engineers in popular companies and the job profiles that they are seeking. 

All the keynotes over the rest of the conference were really interesting, but the afternoon talks sometimes required too much attention to really get them and I was jet-lagged.

I particularly enjoyed the keynote by Tanmay Bakshi as well as meeting him and his family later. He talked about how he’s using cognitive and cloud computing to change the world, through his open-source initiatives, for instance, “The Cognitive Story”, meant to augment and amplify human capabilities; and “AskTanmay”, the world’s first Web-Based NLQA System, built using IBM Watson’s Cognitive Capabilities. It was inspiring to learn how passionate he is about technology at such a young age.


A highlight from the third day was my mentor's talk on Replacing the Radix Tree which was a follow-up thread to my outreachy internship and inspired me to contribute to the new XArray API and test suite.

I am grateful to all from the Linux community and all those who support the outreachy internships, and Marina Zhurakhinskaya and Sarah Sharp. I'm also really grateful to The Linux foundation and Outreachy programme for sponsoring my trip.

Friday 20 January 2017

Performance benchmarks in radix tree test suite

Currently we have tests to check performance of tagged or normal iteration, in terms of speed or time of iteration. Let's see how these work.

static void benchmark_size(unsigned long size, unsigned long step, int order)
{
        RADIX_TREE(tree, GFP_KERNEL);
        long long normal, tagged;
        unsigned long index;

        for (index = 0 ; index < size ; index += step) {
                item_insert_order(&tree, index, order);
                radix_tree_tag_set(&tree, index, 0); 
        }

        tagged = benchmark_iter(&tree, true);
        normal = benchmark_iter(&tree, false);

        printf("Size %ld, step %6ld, order %d tagged %10lld ns, normal %10lld ns\n",
                size, step, order, tagged, normal);

        item_kill_tree(&tree);
        rcu_barrier();
}

We start with an arbitrary size of the radix tree say 210 or 220. A step of 128 means that starting from index 0 till the max index i.e. size, every 128th key index will be used to insert a tagged item. Order denotes the number of indices covered by a particular key. So a given key covers 2order indices around it. So if the order is non zero, the step size passed in will be 2order times it.

So we insert an item at every index which is an integral multiple of 'step', between 0 to size, and tag it with TAG 0, then execute tagged followed by normal iteration.

static long long benchmark_iter(struct radix_tree_root *root, bool tagged)
{
        volatile unsigned long sink = 0;
        struct radix_tree_iter iter;
        struct timespec start, finish;
        long long nsec;
        int l, loops = 1;
        void **slot;

#ifdef BENCHMARK
again:
#endif
        clock_gettime(CLOCK_MONOTONIC, &start);
        for (l = 0; l < loops; l++) {
                if (tagged) {
                        radix_tree_for_each_tagged(slot, root, &iter, 0, 0)
                                sink ^= (unsigned long)slot;
                } else {
                        radix_tree_for_each_slot(slot, root, &iter, 0)
                                sink ^= (unsigned long)slot;
                }
        }
        clock_gettime(CLOCK_MONOTONIC, &finish);

        nsec = (finish.tv_sec - start.tv_sec) * NSEC_PER_SEC +
               (finish.tv_nsec - start.tv_nsec);

#ifdef BENCHMARK
        if (loops == 1 && nsec * 5 < NSEC_PER_SEC) {
                loops = NSEC_PER_SEC / nsec / 4 + 1;
                goto again;
        }
#endif

        nsec /= loops;
        return nsec;
}

By default these tests are executed with RADIX_TREE_MAP_SHIFT value of 3, but if we want to get performance comparable to in kernel performance, we can compile using BENCHMARK=1 which sets RADIX_TREE_MAP_SHIFT to 6.

We track the time elapsed for the iteration in nanoseconds in variable nsec. If RADIX_TREE_MAP_SHIFT is 6, i.e. benchmark is 1, we may get nsec too small (less than 0.2th fraction of NSEC_PER_SEC), in which case we want to measure time elapsed for multiple iterations and take the average. The number of iterations is decided by how small nsec initially is. 

Sunday 1 January 2017

Radix Tree Test Suite Regression Test 3

We start by creating two void pointers ptr0 with value 0x4 and ptr with value 0x8.

void *ptr0 = (void *)4ul;
void *ptr = (void *)8ul;

Next we insert the entry ptr0 (0x4) into the tree and tag it with TAG 0.

radix_tree_insert(&root, 0, ptr0);
radix_tree_tag_set(&root, 0, 0);

Now we begin the tagged iteration. We get the first tagged slot 0x4 at index 0, then insert a new item 0x8 at index 1, so as to trigger radix_tree_deref_retry, as slot at index 0 is moved after the insertion.

radix_tree_for_each_tagged(slot, &root, &iter, 0, 0) {
        printv(2, "tagged %ld %p\n", iter.index, *slot);
        if (first) {
                radix_tree_insert(&root, 1, ptr);
                radix_tree_tag_set(&root, 1, 0);
                first = false;
        }
        if (radix_tree_deref_retry(*slot)) {
                printv(2, "retry at %ld\n", iter.index);
                slot = radix_tree_iter_retry(&iter);
                continue;
        }
}

radix_tree_iter_retry(&iter) works by updating the iter->next_index to iter->index, iter->tags to 0 and slot to NULL, so that the subsequent call to radix_tree_next_slot() returns NULL and the subsequent call to radix_tree_next_chunk() returns the first slot of the chunk associated with iter, which was the slot for which we needed to repeat the look-up.

static __always_inline void **
radix_tree_next_slot(void **slot, struct radix_tree_iter *iter, unsigned flags)
{
        if (flags & RADIX_TREE_ITER_TAGGED) {
                iter->tags >>= 1;
                if (unlikely(!iter->tags))
                        return NULL;
...

However, if care is not taken we can get a segfault if we try to de-reference a NULL slot. This is checked with the if (unlikely(!iter->tags)).

Similar is the case with iteration for all non-empty slots via the radix_tree_for_each_slot(). We start with the tree containing only one item (0x4) at index 0. After looking it up we insert another item 0x8 at index 1, so as to trigger radix_tree_deref_retry, as slot at index 0 is moved after the insertion.

radix_tree_for_each_slot(slot, &root, &iter, 0) {
        printv(2, "slot %ld %p\n", iter.index, *slot);
        if (first) {
                radix_tree_insert(&root, 1, ptr);
                first = false;
        }
        if (radix_tree_deref_retry(*slot)) {
                printv(2, "retry at %ld\n", iter.index);
                slot = radix_tree_iter_retry(&iter);
                continue;
        }
}

In this case segfault is avoided by proceeding with de-referencing the slot only if radix_tree_chunk_size(iter) returns a positive value.

static __always_inline long
radix_tree_chunk_size(struct radix_tree_iter *iter)
{
        return (iter->next_index - iter->index) >> iter_shift(iter);
}

radix_tree_iter_retry(&iter) works by updating the iter->next_index to iter->index so count is returned as 0 (for 0 order slots).

Similar is the case for iteration over all contiguous slots, until the first empty slot, via the radix_tree_for_each_contig(). We start with the tree containing only one item (0x4) at index 0. After looking it up we insert another item 0x8 at index 1, so as to trigger radix_tree_deref_retry, as slot at index 0 is moved after the insertion.

radix_tree_for_each_contig(slot, &root, &iter, 0) {
        printv(2, "contig %ld %p\n", iter.index, *slot);
        if (first) {
                radix_tree_insert(&root, 1, ptr);
                first = false;
        }
        if (radix_tree_deref_retry(*slot)) {
                printv(2, "retry at %ld\n", iter.index);
                slot = radix_tree_iter_retry(&iter);
                continue;
        }
}

In this case segfault is avoided by the following type of check in radix_tree_next_slot() and radix_tree_next_chunk().

if (flags & RADIX_TREE_ITER_CONTIG) {
        /* forbid switching to the next chunk */
        iter->next_index = 0;
        break;
}
...
return NULL;

The next three tests check the same behavior of radix_tree_next_slot() to prevent a segfault, the difference here being that the slot is made NULL by radix_tree_iter_resume().

Thursday 29 December 2016

Radix Tree Test Suite Regression Test 2

Here's the original patch that describes is detail the bug that caused radix_tree_gang_lookup_tag_slot() to hang up indefinitely, and the fix for the same.

The regression test 2 runs to completion immediately if successful, otherwise it hangs up indefinitely.

#define RADIX_TREE_MAP_SHIFT    (CONFIG_BASE_SMALL ? 4 : 6)
#define RADIX_TREE_MAP_SIZE     (1UL << RADIX_TREE_MAP_SHIFT)
...
int max_slots = RADIX_TREE_MAP_SIZE;
...
for (i = 0; i <= max_slots - 1; i++) {
        p = page_alloc();
        radix_tree_insert(&mt_tree, i, p); 
}
radix_tree_tag_set(&mt_tree, max_slots - 1, PAGECACHE_TAG_DIRTY);

RADIX_TREE_MAP_SIZE denotes the number of slots in each node of a radix tree object (typically 64). First we're going to insert  RADIX_TREE_MAP_SIZE number of pages in the radix tree mt_tree, with indices/keys 0 to max_slots - 1. Then we set the item/page at index max_slots - 1 to have the tag PAGECACHE_TAG_DIRTY. For this bug it does not matter which item has the tag, it can be any item.

start = 0;
end = max_slots - 2;
radix_tree_range_tag_if_tagged(&mt_tree, start, end, 1,
                        PAGECACHE_TAG_DIRTY, PAGECACHE_TAG_TOWRITE);

This is to update all items with indices in range 0 to max_slots - 2 and currently have the tag PAGECACHE_TAG_DIRTY to have the new tag PAGECACHE_TAG_TOWRITE, which are none. But root is nevertheless tagged with  PAGECACHE_TAG_TOWRITE, as the following diff shows.















p = page_alloc();
radix_tree_insert(&mt_tree, max_slots, p); 

Now we insert a new page at key max_slots. This results in creation of a new child node which succeeds the tag status of the root tag. Therefore the tag of this new node has PAGECACHE_TAG_TOWRITE but there is no slot with PAGECACHE_TAG_TOWRITE tag in this new node.

radix_tree_tag_clear(&mt_tree, max_slots - 1, PAGECACHE_TAG_DIRTY);

Next we update the item at index max_slots - 1, which currently has tag PAGECACHE_TAG_DIRTY, to clear this tag.

for (i = max_slots - 1; i >= 0; i--)
        radix_tree_delete(&mt_tree, i);

Now we delete all items in key range 0 to max_slots - 1. Only the item with index RADIX_TREE_MAP_SIZE exists in the tree. The root still has the tag PAGECACHE_TAG_TOWRITE.

// NOTE: start should not be 0 because radix_tree_gang_lookup_tag_slot
//       can return.
start = 1;
end = max_slots - 2;
radix_tree_gang_lookup_tag_slot(&mt_tree, (void ***)pages, start, end,
        PAGECACHE_TAG_TOWRITE);

This calls __lookup_tag, but since the first slot of the tree node is null and the tag corresponding to it has PAGECACHE_TAG_TOWRITE, it keeps trying to get the items but it cannot do so ever. This bug was fixed to change radix_tree_tag_if_tagged so that it doesn't tag the root tag if it doesn't set any tags within the specified range.

Here's how the call to tag_tagged_items() which replaces radix_tree_range_tag_if_tagged(), essentially looks like:

radix_tree_for_each_tagged(slot, root, &iter, start, iftag) {
        if (iter.index > end)
                break;
        radix_tree_iter_tag_set(root, &iter, thentag);
        tagged++;
        if ((tagged % batch) != 0)
                continue;
        slot = radix_tree_iter_resume(slot, &iter);
        if (lock) {
                pthread_mutex_unlock(lock);
                rcu_barrier();
                pthread_mutex_lock(lock);
        }
}

Here we specifically look for slots marked with iftag using radix_tree_for_each_tagged() and only set them with the thentag, if found.

Wednesday 28 December 2016

Iteration tests in Radix tree test suite

According to Wikipedia, unit testing is a software testing method by which individual units of source code, sets of one or more computer program modules together with associated control data, usage procedures, and operating procedures, are tested to determine whether they are fit for use.

Iteration test in Radix tree test suite is a unit test for a bug found by the syzkaller tester.

Iteration tests can be run for radix tree with zero order or multi-order items. Essentially these launch five threads to test parallel iteration over tagged or untagged entries, adding, removing and tagging entries, for an arbitrary period of time.

Let's see what each thread is doing in more detail.

/*
 * Iterate over the tagged entries, doing a radix_tree_iter_retry() as we find
 * things that have been removed and randomly resetting our iteration to the
 * next chunk with radix_tree_iter_resume().  Both radix_tree_iter_retry() and
 * radix_tree_iter_resume() cause radix_tree_next_slot() to be called with a
 * NULL 'slot' variable.
 */
static void *tagged_iteration_fn(void *arg)
{
 struct radix_tree_iter iter;
 void **slot;

 rcu_register_thread();

 while (!test_complete) {
  rcu_read_lock();
  radix_tree_for_each_tagged(slot, &tree, &iter, 0, TAG) {
   void *entry = radix_tree_deref_slot(slot);
   if (unlikely(!entry))
    continue;

   if (radix_tree_deref_retry(entry)) {
    slot = radix_tree_iter_retry(&iter);
    continue;
   }

   if (rand_r(&seeds[0]) % 50 == 0) {
    slot = radix_tree_iter_resume(slot, &iter);
    rcu_read_unlock();
    rcu_barrier();
    rcu_read_lock();
   }
  }
  rcu_read_unlock();
 }

 rcu_unregister_thread();

 return NULL;
}

Iterating the tree can be done through radix_tree_iter object. This radix tree iterator works in terms of "chunks" of slots. A chunk is a sub-interval of slots contained within one radix tree leaf node. Thus we need to track the 'index' of the current slot and 'next_index' (one beyond the last index for this chunk) to track the size  of the chunk. tags is a bitmask corresponding to each slot in the chunk for a particular tag. *node denotes the node containing this slot and shift is that node's property that holds the bits remaining for each slot in that node.

struct radix_tree_iter {
        unsigned long   index;
        unsigned long   next_index;
        unsigned long   tags;
        struct radix_tree_node *node;
#ifdef CONFIG_RADIX_TREE_MULTIORDER
        unsigned int    shift;
#endif
};

#define radix_tree_for_each_tagged(slot, root, iter, start, tag)        \
        for (slot = radix_tree_iter_init(iter, start) ;                 \
             slot || (slot = radix_tree_next_chunk(root, iter,          \
                              RADIX_TREE_ITER_TAGGED | tag)) ;          \
             slot = radix_tree_next_slot(slot, iter,                    \
                                RADIX_TREE_ITER_TAGGED | tag))

This traverses the tree starting from the slot with key start, chunk by chunk and slot by slot within a chunk. The RADIX_TREE_ITER_TAGGED mask in the tag  tells the radix_tree_next_chunk() function that we are interested in the lookup of tagged slots and the tag argument is the tag we are looking for.

Once we look-up a slot we need to de-reference it to get the entry/item that is stored in the slot. Before proceeding  we need to check if another thread has moved the item stored in the slot to another location (using radix_tree_deref_retry()), in which case we need to retry the look-up.

radix_tree_iter_retry(&iter) works by updating the iter->next_index to iter->index, iter->tags to 0 and slot to NULL, so that the subsequent call to radix_tree_next_slot() returns NULL and the subsequent call to radix_tree_next_chunk() returns the first slot of the chunk associated with iter, which was the slot for which we needed to repeat the look-up.

Now we can simply continue the iteration in this manner.

void **radix_tree_iter_resume(void **slot, struct radix_tree_iter *iter)
{
        struct radix_tree_node *node;

        slot++;
        iter->index = __radix_tree_iter_add(iter, 1);
        node = rcu_dereference_raw(*slot);
        skip_siblings(&node, slot, iter);
        iter->next_index = iter->index;
        iter->tags = 0; 
        return NULL;
}

Once in every fifty times, we are done with the iteration and we want to resume it at a later point of time. So we update the iter and slot so that when we have the read lock again, we can resume iteration from the desired slot, with that slot indicating the start of our next chunk.

untagged_iteration_fn() does the same thing, with the only difference being that now we are not iterating over entries with a particular tag but all slots which are non-empty, via :

#define radix_tree_for_each_slot(slot, root, iter, start)               \
        for (slot = radix_tree_iter_init(iter, start) ;                 \
             slot || (slot = radix_tree_next_chunk(root, iter, 0)) ;    \
             slot = radix_tree_next_slot(slot, iter, 0))

/* relentlessly fill the tree with tagged entries */
static void *add_entries_fn(void *arg)
{
        rcu_register_thread();

        while (!test_complete) {
                unsigned long pgoff;
                int order;

                for (pgoff = 0; pgoff < MAX_IDX; pgoff++) {
                        pthread_mutex_lock(&tree_lock);
                        for (order = max_order; order >= 0; order--) {
                                if (item_insert_order(&tree, pgoff, order)
                                                == 0) {
                                        item_tag_set(&tree, pgoff, TAG);
                                        break;
                                }
                        }
                        pthread_mutex_unlock(&tree_lock);
                }
        }

        rcu_unregister_thread();

        return NULL;
}

This keeps inserting entries with indices/keys 0-MAX_IDX and tagging them with the given TAG. For multi order case we pass in the order as the maximum possible and keep reducing it until we get a successful insertion. item_insert_order() returns 0 for a successful insertion and non-zero for error cases such as not being able to create a required new node, or not being able to extend the tree in terms of its shift, due to failed memory allocation (ENOMEM), or if an item already exists at that index (EEXIST).

/*
 * Randomly remove entries to help induce radix_tree_iter_retry() calls in the
 * two iteration functions.
 */
static void *remove_entries_fn(void *arg)
{
        rcu_register_thread();

        while (!test_complete) {
                int pgoff;

                pgoff = rand_r(&seeds[2]) % MAX_IDX;

                pthread_mutex_lock(&tree_lock);
                item_delete(&tree, pgoff);
                pthread_mutex_unlock(&tree_lock);
        }

        rcu_unregister_thread();

        return NULL;
}

remove_entries_fn() works by randomly selecting a key to delete and item_delete() works by freeing the item at passed index and updating the tags in the tree, or returns NULL if the item was not present.

static void *tag_entries_fn(void *arg)
{
        rcu_register_thread();

        while (!test_complete) {
                tag_tagged_items(&tree, &tree_lock, 0, MAX_IDX, 10, TAG,
                                        NEW_TAG);
        }
        rcu_unregister_thread();
        return NULL;
}

We want to update items between indices 0 to MAX_IDX which are tagged with TAG, to NEW_TAG in batches of 10.

To conclude, when all these functions are happening in parallel, we want to test if there are races between them.

Wednesday 21 December 2016

Radix Tree Test Suite Regression Test 1

According to Wikipedia, Regression testing is a type of software testing that verifies that software previously developed and tested still performs correctly even after it was changed or interfaced with other software. Changes may include software enhancements, patches, configuration changes, etc.

The regression test 1 is used to test a special case which causes a deadlock.

static RADIX_TREE(mt_tree, GFP_KERNEL);
This declares a radix tree object with name mt_tree and and initializes it with flags indicated by the mask GFP_KERNEL. These flags control how memory allocations are to be performed.

static pthread_mutex_t mt_lock = PTHREAD_MUTEX_INITIALIZER;
This declares a pthread_mutex_t lock object and initializes it.

struct page {
        pthread_mutex_t lock;
        struct rcu_head rcu;
        int count;
        unsigned long index;
};
This defines a page structure associated with a lock, an rcu_head object (Updates and reads to radix tree data structures are done via the RCU synchronization mechanism.), a count (is a reference count or number of users of the page) and an index (the number of the page within its file).

Now let's come to the main flow of the case tested by regression test 1.

nr_threads = 2;
pthread_barrier_init(&worker_barrier, NULL, nr_threads);

A barrier is a synchronization mechanism that lets you "corral" several cooperating threads (e.g., in a matrix computation), forcing them to wait at a specific point until all have finished before any one thread can continue. Unlike the pthread_join() function, where you'd wait for the threads to terminate, in the case of a barrier you're waiting for the threads to rendezvous at a certain point. When the specified number of threads arrive at the barrier, we unblock all of them so they can continue to run.
We use two threads working on the function regression1_fn.

if (pthread_barrier_wait(&worker_barrier) == 
                          PTHREAD_BARRIER_SERIAL_THREAD)

When all threads meet at this point, PTHREAD_BARRIER_SERIAL_THREAD is returned for one thread and 0 for all other threads.

The updater thread.
p = page_alloc();
pthread_mutex_lock(&mt_lock);
radix_tree_insert(&mt_tree, 0, p);
pthread_mutex_unlock(&mt_lock);

This piece of code allocates a page object p and inserts it in the radix tree mt_tree at index/key 0.

static inline int radix_tree_insert(struct radix_tree_root *root,
                        unsigned long index, void *entry)
{
        return __radix_tree_insert(root, index, 0, entry);
}

int __radix_tree_insert(struct radix_tree_root *root, unsigned long index,
                        unsigned order, void *item)
{
        struct radix_tree_node *node;
        void **slot;
        int error;

        BUG_ON(radix_tree_is_internal_node(item));

        error = __radix_tree_create(root, index, order, &node, &slot);
        if (error)
                return error;

        error = insert_entries(node, slot, item, order, false);
        if (error < 0) 
                return error;

        if (node) {
                unsigned offset = get_slot_offset(node, slot);
                BUG_ON(tag_get(node, 0, offset));
                BUG_ON(tag_get(node, 1, offset));
                BUG_ON(tag_get(node, 2, offset));
        } else {
                BUG_ON(root_tags_get(root));
        }

        return 0;
}
Essentially a node is created and the slot on that node which corresponds to the key 0, now points to the page p.
Similarly a new page is inserted at the key 1.
Now we delete the page inserted at key 1 and finally the page at key 0 is deleted.

Now consider the workflow of the other thread (the reader).

for (j = 0; j < 100000000; j++) {
        struct page *pages[10];

        find_get_pages(0, 10, pages);
}
static unsigned find_get_pages(unsigned long start,
                            unsigned int nr_pages, struct page **pages)
{
        unsigned int i;
        unsigned int ret;
        unsigned int nr_found;

        rcu_read_lock();
restart:
        nr_found = radix_tree_gang_lookup_slot(&mt_tree,
                                (void ***)pages, NULL, start, nr_pages);
        ret = 0;
        for (i = 0; i < nr_found; i++) {
                struct page *page;
repeat:
                page = radix_tree_deref_slot((void **)pages[i]);
                if (unlikely(!page))
                        continue;

                if (radix_tree_exception(page)) {
                        if (radix_tree_deref_retry(page)) {
                                /*
                                 * Transient condition which can only trigger
                                 * when entry at index 0 moves out of or back
                                 * to root: none yet gotten, safe to restart.
                                 */
                                assert((start | i) == 0);
                                goto restart;
                        }
                        /*
                         * No exceptional entries are inserted in this test.
                         */
                        assert(0);
                }

                pthread_mutex_lock(&page->lock);
                if (!page->count) {
                        pthread_mutex_unlock(&page->lock);
                        goto repeat;
                }
                /* don't actually update page refcount */
                pthread_mutex_unlock(&page->lock);

                /* Has the page moved? */
                if (unlikely(page != *((void **)pages[i]))) {
                        goto repeat;
                }

                pages[ret] = page;
                ret++;
        }
        rcu_read_unlock();
        return ret;
}

It wants to read a maximum of 10 pages starting from index/key 0, into the pages array. The call to radix_tree_gang_lookup_slot() returns either 0 or 1 or 2 depending upon how many pages have been inserted by the updater thread so far and returns their corresponding slots in the pages array. These slots must be de-referenced to get the required pages. We also need to check if the page ref count has become 0 which probably means the page has been deleted, so we want to retry. We also want to retry if the page has moved.

Now where could the deadlock have arisen and how is it checked?
Consider the sequence of events:
  1. Both the pages are inserted (at index 0 and at index 1).
  2. The reader acquires the slots to both of them. 
  3. The page at index 1 is deleted and the page at index 0 is moved to the root of the tree. The place where the index 0 item used to be, is queued up for deletion after the readers finish.
  4. Since the page at index 0 had moved, its de-referencing is tried again.
  5. The updater thread now deletes the page at index 0. It is removed from the direct slot, it remains in the rcu-delayed indirect node.
  6. The reader looks at the index 0 slot, and finds that the page has 0 ref count. So it retries and keeps retrying as the page is not freed because the reader hasn't finished yet and the ref count doesn't change and remains 0. The readers is thus in an infinite loop.
To avoid this deadlock, when the index 0 item is deleted, we have the following code which tags the slot[0] of the root node with RADIX_TREE_INDIRECT_PTR.

if (root->height == 0)
        *((unsigned long *)&to_free->slots[0]) |=
    RADIX_TREE_INDIRECT_PTR;

This causes the check if (radix_tree_exception(page)) to evaluate to true, subsequently the reader is forced to retry the lookup (goes to restart).

Tuesday 6 December 2016

Building and running Radix Tree Test Suite

Radix Trees are data structures in the linux kernel. The Radix tree test suite can be found in the tools/testing/radix-tree directory.

How to build the tests:

  1. Navigate to the tools/testing/radix-tree directory from the repository directory.
  2. cd tools/testing/radix-tree
  3. Make sure you have the libraries pthread and urcu.
  4. sudo apt-get install libpthread-stubs0-dev
    sudo apt-get install liburcu-dev
  5. Try running make. If you still get a long list of errors like undefined reference to 'rcu_register_thread_memb', then I suggest editing the Makefile to put $(LDFLAGS) at the end of the line rather than in the middle of the line.
  6. make
Running the tests:

  1. ./main
    
    This is the default (short) run.
  2. ./main -l
    This is the long run which runs the big_gang_check() and copy_tag_check() for thousands of iterations compared to only 3 in the default run.
  3. ./main -s <custom seed>
    This lets the user declare a custom seed (for random functions) in the program. By default, the seed is initialised as:
    unsigned int seed = time(NULL);