Thursday 29 December 2016

Radix Tree Test Suite Regression Test 2

Here's the original patch that describes is detail the bug that caused radix_tree_gang_lookup_tag_slot() to hang up indefinitely, and the fix for the same.

The regression test 2 runs to completion immediately if successful, otherwise it hangs up indefinitely.

#define RADIX_TREE_MAP_SHIFT    (CONFIG_BASE_SMALL ? 4 : 6)
#define RADIX_TREE_MAP_SIZE     (1UL << RADIX_TREE_MAP_SHIFT)
...
int max_slots = RADIX_TREE_MAP_SIZE;
...
for (i = 0; i <= max_slots - 1; i++) {
        p = page_alloc();
        radix_tree_insert(&mt_tree, i, p); 
}
radix_tree_tag_set(&mt_tree, max_slots - 1, PAGECACHE_TAG_DIRTY);

RADIX_TREE_MAP_SIZE denotes the number of slots in each node of a radix tree object (typically 64). First we're going to insert  RADIX_TREE_MAP_SIZE number of pages in the radix tree mt_tree, with indices/keys 0 to max_slots - 1. Then we set the item/page at index max_slots - 1 to have the tag PAGECACHE_TAG_DIRTY. For this bug it does not matter which item has the tag, it can be any item.

start = 0;
end = max_slots - 2;
radix_tree_range_tag_if_tagged(&mt_tree, start, end, 1,
                        PAGECACHE_TAG_DIRTY, PAGECACHE_TAG_TOWRITE);

This is to update all items with indices in range 0 to max_slots - 2 and currently have the tag PAGECACHE_TAG_DIRTY to have the new tag PAGECACHE_TAG_TOWRITE, which are none. But root is nevertheless tagged with  PAGECACHE_TAG_TOWRITE, as the following diff shows.















p = page_alloc();
radix_tree_insert(&mt_tree, max_slots, p); 

Now we insert a new page at key max_slots. This results in creation of a new child node which succeeds the tag status of the root tag. Therefore the tag of this new node has PAGECACHE_TAG_TOWRITE but there is no slot with PAGECACHE_TAG_TOWRITE tag in this new node.

radix_tree_tag_clear(&mt_tree, max_slots - 1, PAGECACHE_TAG_DIRTY);

Next we update the item at index max_slots - 1, which currently has tag PAGECACHE_TAG_DIRTY, to clear this tag.

for (i = max_slots - 1; i >= 0; i--)
        radix_tree_delete(&mt_tree, i);

Now we delete all items in key range 0 to max_slots - 1. Only the item with index RADIX_TREE_MAP_SIZE exists in the tree. The root still has the tag PAGECACHE_TAG_TOWRITE.

// NOTE: start should not be 0 because radix_tree_gang_lookup_tag_slot
//       can return.
start = 1;
end = max_slots - 2;
radix_tree_gang_lookup_tag_slot(&mt_tree, (void ***)pages, start, end,
        PAGECACHE_TAG_TOWRITE);

This calls __lookup_tag, but since the first slot of the tree node is null and the tag corresponding to it has PAGECACHE_TAG_TOWRITE, it keeps trying to get the items but it cannot do so ever. This bug was fixed to change radix_tree_tag_if_tagged so that it doesn't tag the root tag if it doesn't set any tags within the specified range.

Here's how the call to tag_tagged_items() which replaces radix_tree_range_tag_if_tagged(), essentially looks like:

radix_tree_for_each_tagged(slot, root, &iter, start, iftag) {
        if (iter.index > end)
                break;
        radix_tree_iter_tag_set(root, &iter, thentag);
        tagged++;
        if ((tagged % batch) != 0)
                continue;
        slot = radix_tree_iter_resume(slot, &iter);
        if (lock) {
                pthread_mutex_unlock(lock);
                rcu_barrier();
                pthread_mutex_lock(lock);
        }
}

Here we specifically look for slots marked with iftag using radix_tree_for_each_tagged() and only set them with the thentag, if found.

Wednesday 28 December 2016

Iteration tests in Radix tree test suite

According to Wikipedia, unit testing is a software testing method by which individual units of source code, sets of one or more computer program modules together with associated control data, usage procedures, and operating procedures, are tested to determine whether they are fit for use.

Iteration test in Radix tree test suite is a unit test for a bug found by the syzkaller tester.

Iteration tests can be run for radix tree with zero order or multi-order items. Essentially these launch five threads to test parallel iteration over tagged or untagged entries, adding, removing and tagging entries, for an arbitrary period of time.

Let's see what each thread is doing in more detail.

/*
 * Iterate over the tagged entries, doing a radix_tree_iter_retry() as we find
 * things that have been removed and randomly resetting our iteration to the
 * next chunk with radix_tree_iter_resume().  Both radix_tree_iter_retry() and
 * radix_tree_iter_resume() cause radix_tree_next_slot() to be called with a
 * NULL 'slot' variable.
 */
static void *tagged_iteration_fn(void *arg)
{
 struct radix_tree_iter iter;
 void **slot;

 rcu_register_thread();

 while (!test_complete) {
  rcu_read_lock();
  radix_tree_for_each_tagged(slot, &tree, &iter, 0, TAG) {
   void *entry = radix_tree_deref_slot(slot);
   if (unlikely(!entry))
    continue;

   if (radix_tree_deref_retry(entry)) {
    slot = radix_tree_iter_retry(&iter);
    continue;
   }

   if (rand_r(&seeds[0]) % 50 == 0) {
    slot = radix_tree_iter_resume(slot, &iter);
    rcu_read_unlock();
    rcu_barrier();
    rcu_read_lock();
   }
  }
  rcu_read_unlock();
 }

 rcu_unregister_thread();

 return NULL;
}

Iterating the tree can be done through radix_tree_iter object. This radix tree iterator works in terms of "chunks" of slots. A chunk is a sub-interval of slots contained within one radix tree leaf node. Thus we need to track the 'index' of the current slot and 'next_index' (one beyond the last index for this chunk) to track the size  of the chunk. tags is a bitmask corresponding to each slot in the chunk for a particular tag. *node denotes the node containing this slot and shift is that node's property that holds the bits remaining for each slot in that node.

struct radix_tree_iter {
        unsigned long   index;
        unsigned long   next_index;
        unsigned long   tags;
        struct radix_tree_node *node;
#ifdef CONFIG_RADIX_TREE_MULTIORDER
        unsigned int    shift;
#endif
};

#define radix_tree_for_each_tagged(slot, root, iter, start, tag)        \
        for (slot = radix_tree_iter_init(iter, start) ;                 \
             slot || (slot = radix_tree_next_chunk(root, iter,          \
                              RADIX_TREE_ITER_TAGGED | tag)) ;          \
             slot = radix_tree_next_slot(slot, iter,                    \
                                RADIX_TREE_ITER_TAGGED | tag))

This traverses the tree starting from the slot with key start, chunk by chunk and slot by slot within a chunk. The RADIX_TREE_ITER_TAGGED mask in the tag  tells the radix_tree_next_chunk() function that we are interested in the lookup of tagged slots and the tag argument is the tag we are looking for.

Once we look-up a slot we need to de-reference it to get the entry/item that is stored in the slot. Before proceeding  we need to check if another thread has moved the item stored in the slot to another location (using radix_tree_deref_retry()), in which case we need to retry the look-up.

radix_tree_iter_retry(&iter) works by updating the iter->next_index to iter->index, iter->tags to 0 and slot to NULL, so that the subsequent call to radix_tree_next_slot() returns NULL and the subsequent call to radix_tree_next_chunk() returns the first slot of the chunk associated with iter, which was the slot for which we needed to repeat the look-up.

Now we can simply continue the iteration in this manner.

void **radix_tree_iter_resume(void **slot, struct radix_tree_iter *iter)
{
        struct radix_tree_node *node;

        slot++;
        iter->index = __radix_tree_iter_add(iter, 1);
        node = rcu_dereference_raw(*slot);
        skip_siblings(&node, slot, iter);
        iter->next_index = iter->index;
        iter->tags = 0; 
        return NULL;
}

Once in every fifty times, we are done with the iteration and we want to resume it at a later point of time. So we update the iter and slot so that when we have the read lock again, we can resume iteration from the desired slot, with that slot indicating the start of our next chunk.

untagged_iteration_fn() does the same thing, with the only difference being that now we are not iterating over entries with a particular tag but all slots which are non-empty, via :

#define radix_tree_for_each_slot(slot, root, iter, start)               \
        for (slot = radix_tree_iter_init(iter, start) ;                 \
             slot || (slot = radix_tree_next_chunk(root, iter, 0)) ;    \
             slot = radix_tree_next_slot(slot, iter, 0))

/* relentlessly fill the tree with tagged entries */
static void *add_entries_fn(void *arg)
{
        rcu_register_thread();

        while (!test_complete) {
                unsigned long pgoff;
                int order;

                for (pgoff = 0; pgoff < MAX_IDX; pgoff++) {
                        pthread_mutex_lock(&tree_lock);
                        for (order = max_order; order >= 0; order--) {
                                if (item_insert_order(&tree, pgoff, order)
                                                == 0) {
                                        item_tag_set(&tree, pgoff, TAG);
                                        break;
                                }
                        }
                        pthread_mutex_unlock(&tree_lock);
                }
        }

        rcu_unregister_thread();

        return NULL;
}

This keeps inserting entries with indices/keys 0-MAX_IDX and tagging them with the given TAG. For multi order case we pass in the order as the maximum possible and keep reducing it until we get a successful insertion. item_insert_order() returns 0 for a successful insertion and non-zero for error cases such as not being able to create a required new node, or not being able to extend the tree in terms of its shift, due to failed memory allocation (ENOMEM), or if an item already exists at that index (EEXIST).

/*
 * Randomly remove entries to help induce radix_tree_iter_retry() calls in the
 * two iteration functions.
 */
static void *remove_entries_fn(void *arg)
{
        rcu_register_thread();

        while (!test_complete) {
                int pgoff;

                pgoff = rand_r(&seeds[2]) % MAX_IDX;

                pthread_mutex_lock(&tree_lock);
                item_delete(&tree, pgoff);
                pthread_mutex_unlock(&tree_lock);
        }

        rcu_unregister_thread();

        return NULL;
}

remove_entries_fn() works by randomly selecting a key to delete and item_delete() works by freeing the item at passed index and updating the tags in the tree, or returns NULL if the item was not present.

static void *tag_entries_fn(void *arg)
{
        rcu_register_thread();

        while (!test_complete) {
                tag_tagged_items(&tree, &tree_lock, 0, MAX_IDX, 10, TAG,
                                        NEW_TAG);
        }
        rcu_unregister_thread();
        return NULL;
}

We want to update items between indices 0 to MAX_IDX which are tagged with TAG, to NEW_TAG in batches of 10.

To conclude, when all these functions are happening in parallel, we want to test if there are races between them.

Wednesday 21 December 2016

Radix Tree Test Suite Regression Test 1

According to Wikipedia, Regression testing is a type of software testing that verifies that software previously developed and tested still performs correctly even after it was changed or interfaced with other software. Changes may include software enhancements, patches, configuration changes, etc.

The regression test 1 is used to test a special case which causes a deadlock.

static RADIX_TREE(mt_tree, GFP_KERNEL);
This declares a radix tree object with name mt_tree and and initializes it with flags indicated by the mask GFP_KERNEL. These flags control how memory allocations are to be performed.

static pthread_mutex_t mt_lock = PTHREAD_MUTEX_INITIALIZER;
This declares a pthread_mutex_t lock object and initializes it.

struct page {
        pthread_mutex_t lock;
        struct rcu_head rcu;
        int count;
        unsigned long index;
};
This defines a page structure associated with a lock, an rcu_head object (Updates and reads to radix tree data structures are done via the RCU synchronization mechanism.), a count (is a reference count or number of users of the page) and an index (the number of the page within its file).

Now let's come to the main flow of the case tested by regression test 1.

nr_threads = 2;
pthread_barrier_init(&worker_barrier, NULL, nr_threads);

A barrier is a synchronization mechanism that lets you "corral" several cooperating threads (e.g., in a matrix computation), forcing them to wait at a specific point until all have finished before any one thread can continue. Unlike the pthread_join() function, where you'd wait for the threads to terminate, in the case of a barrier you're waiting for the threads to rendezvous at a certain point. When the specified number of threads arrive at the barrier, we unblock all of them so they can continue to run.
We use two threads working on the function regression1_fn.

if (pthread_barrier_wait(&worker_barrier) == 
                          PTHREAD_BARRIER_SERIAL_THREAD)

When all threads meet at this point, PTHREAD_BARRIER_SERIAL_THREAD is returned for one thread and 0 for all other threads.

The updater thread.
p = page_alloc();
pthread_mutex_lock(&mt_lock);
radix_tree_insert(&mt_tree, 0, p);
pthread_mutex_unlock(&mt_lock);

This piece of code allocates a page object p and inserts it in the radix tree mt_tree at index/key 0.

static inline int radix_tree_insert(struct radix_tree_root *root,
                        unsigned long index, void *entry)
{
        return __radix_tree_insert(root, index, 0, entry);
}

int __radix_tree_insert(struct radix_tree_root *root, unsigned long index,
                        unsigned order, void *item)
{
        struct radix_tree_node *node;
        void **slot;
        int error;

        BUG_ON(radix_tree_is_internal_node(item));

        error = __radix_tree_create(root, index, order, &node, &slot);
        if (error)
                return error;

        error = insert_entries(node, slot, item, order, false);
        if (error < 0) 
                return error;

        if (node) {
                unsigned offset = get_slot_offset(node, slot);
                BUG_ON(tag_get(node, 0, offset));
                BUG_ON(tag_get(node, 1, offset));
                BUG_ON(tag_get(node, 2, offset));
        } else {
                BUG_ON(root_tags_get(root));
        }

        return 0;
}
Essentially a node is created and the slot on that node which corresponds to the key 0, now points to the page p.
Similarly a new page is inserted at the key 1.
Now we delete the page inserted at key 1 and finally the page at key 0 is deleted.

Now consider the workflow of the other thread (the reader).

for (j = 0; j < 100000000; j++) {
        struct page *pages[10];

        find_get_pages(0, 10, pages);
}
static unsigned find_get_pages(unsigned long start,
                            unsigned int nr_pages, struct page **pages)
{
        unsigned int i;
        unsigned int ret;
        unsigned int nr_found;

        rcu_read_lock();
restart:
        nr_found = radix_tree_gang_lookup_slot(&mt_tree,
                                (void ***)pages, NULL, start, nr_pages);
        ret = 0;
        for (i = 0; i < nr_found; i++) {
                struct page *page;
repeat:
                page = radix_tree_deref_slot((void **)pages[i]);
                if (unlikely(!page))
                        continue;

                if (radix_tree_exception(page)) {
                        if (radix_tree_deref_retry(page)) {
                                /*
                                 * Transient condition which can only trigger
                                 * when entry at index 0 moves out of or back
                                 * to root: none yet gotten, safe to restart.
                                 */
                                assert((start | i) == 0);
                                goto restart;
                        }
                        /*
                         * No exceptional entries are inserted in this test.
                         */
                        assert(0);
                }

                pthread_mutex_lock(&page->lock);
                if (!page->count) {
                        pthread_mutex_unlock(&page->lock);
                        goto repeat;
                }
                /* don't actually update page refcount */
                pthread_mutex_unlock(&page->lock);

                /* Has the page moved? */
                if (unlikely(page != *((void **)pages[i]))) {
                        goto repeat;
                }

                pages[ret] = page;
                ret++;
        }
        rcu_read_unlock();
        return ret;
}

It wants to read a maximum of 10 pages starting from index/key 0, into the pages array. The call to radix_tree_gang_lookup_slot() returns either 0 or 1 or 2 depending upon how many pages have been inserted by the updater thread so far and returns their corresponding slots in the pages array. These slots must be de-referenced to get the required pages. We also need to check if the page ref count has become 0 which probably means the page has been deleted, so we want to retry. We also want to retry if the page has moved.

Now where could the deadlock have arisen and how is it checked?
Consider the sequence of events:
  1. Both the pages are inserted (at index 0 and at index 1).
  2. The reader acquires the slots to both of them. 
  3. The page at index 1 is deleted and the page at index 0 is moved to the root of the tree. The place where the index 0 item used to be, is queued up for deletion after the readers finish.
  4. Since the page at index 0 had moved, its de-referencing is tried again.
  5. The updater thread now deletes the page at index 0. It is removed from the direct slot, it remains in the rcu-delayed indirect node.
  6. The reader looks at the index 0 slot, and finds that the page has 0 ref count. So it retries and keeps retrying as the page is not freed because the reader hasn't finished yet and the ref count doesn't change and remains 0. The readers is thus in an infinite loop.
To avoid this deadlock, when the index 0 item is deleted, we have the following code which tags the slot[0] of the root node with RADIX_TREE_INDIRECT_PTR.

if (root->height == 0)
        *((unsigned long *)&to_free->slots[0]) |=
    RADIX_TREE_INDIRECT_PTR;

This causes the check if (radix_tree_exception(page)) to evaluate to true, subsequently the reader is forced to retry the lookup (goes to restart).

Tuesday 6 December 2016

Building and running Radix Tree Test Suite

Radix Trees are data structures in the linux kernel. The Radix tree test suite can be found in the tools/testing/radix-tree directory.

How to build the tests:

  1. Navigate to the tools/testing/radix-tree directory from the repository directory.
  2. cd tools/testing/radix-tree
  3. Make sure you have the libraries pthread and urcu.
  4. sudo apt-get install libpthread-stubs0-dev
    sudo apt-get install liburcu-dev
  5. Try running make. If you still get a long list of errors like undefined reference to 'rcu_register_thread_memb', then I suggest editing the Makefile to put $(LDFLAGS) at the end of the line rather than in the middle of the line.
  6. make
Running the tests:

  1. ./main
    
    This is the default (short) run.
  2. ./main -l
    This is the long run which runs the big_gang_check() and copy_tag_check() for thousands of iterations compared to only 3 in the default run.
  3. ./main -s <custom seed>
    This lets the user declare a custom seed (for random functions) in the program. By default, the seed is initialised as:
    unsigned int seed = time(NULL);
    

Saturday 22 October 2016

The immortal

What if I told you I am more than a 100 centuries old, would you believe me? Even if I managed to convince you, could you ever be sure? What would such a man be like if he were to exist for that long period of time?

Here's my story.

I have preached with the Gods. I have sailed with explorers. I have theorized with the historians. I have suffered with the world during outbreaks. I have witnessed a myriad historically significant episodes of terror, peace, cultural and creativity blooms and giant leaps in Science & Technology. I have seen all you'd ever want in a lifetime. And yet I have lived through more than a hundred lifetimes.

I was born during the stone age, when man lived in caves, used stone tools and lit his own fire with them. The terrain was mountainous and the climate was cold. Even though the earth's landscape has changed tremendously since then, I was fortunate enough to re-visit the Cave of Altamira in Spain in 2009 and confirm that it indeed was my childhood home. Some glimpses from my first lifetime are permanently etched on the canvas of my mind. Like the Great hall of policromes of Altamira. Those paintings are special. I remember father teaching me and my sister, the art of using adoha – modern-day charcoal, to pigment the walls to depict the beautiful variety of fauna that lived around us. I'll tell you something off the textbooks and archaeological studies. Those paintings were actually used to hunt for wild boars and bison who used to come flock the area, attracted to those paintings. Stone age man was no less in intellect than you all and yet the smart historians of 1880s argue that we lacked any capability of artistic expression. Yes I'm also the famous historian from 1880s who led the movement to change the perception of prehistoric man. But I'll come to that later.

So stone age was a beautiful time until I found out that I didn't really belong with my tribe. They began to think I'm a different kind of powerful, wild creature and not one of them. I didn't scar. My wounds healed at the speed of light (of course not literally. Who knew about light as a phenomenon let alone know how to count! Hint: I'm a famous scientist associated with discovery of quantum properties of light. But more on that later.) and I stopped aging when my body was in its prime. I went to live alone by myself for a while trying to figure out if I was really that invincible as my tribe thought.

Well I kind of was. I was not afraid of anything. Until one day I passed out due to

dehydration. I woke up in the arms of a beautiful woman with a child-like face. I almost exclaimed with joy, “Dua!”. That was my biological sister's name. I thought I had died and was meeting her in an alternate world. But then I saw how different she looked except for the face. She wore cloth and metal ornaments instead of twigs and spoke a proper, more sophisticated language. The truth was stark. All my family, my tribe, my home had gone and I remained.

I was now a bronze age man.

It was a glorious period. We used to worship the river that made it possible for us to practice agriculture and on which our civilization thrived. The modern ideas of culture, language, writing, mineral exploration, trade, military, religion, mathematics, medicine, art and architecture all root back to this era. I myself was involved in the design of the Step Pyramid of Djoser. Yes I'm none other than Imhotep. After centuries of experience, I had also learned the art of poetry and philosophy. The reason historians believe that I self-constructed my tomb which is hidden from the beginning till date, despite efforts to find it, is that there is no tomb. The truth is, I simply vanished one day and no one knew where I went.

I traveled for a long time. I had all the time in the world. Again, I was not afraid of anything. One day, in the middle of nowhere, I saw a man, his skin pale as snow, his hair locks of gold. I wondered what sane man would be out here and thought he may be just like me, afraid of nothing. Before I could figure it out, he shot at me at a lightening speed and tried to bite at my neck but spit it out as if it were poison. And there. Gone in the blink of an eye. It was many decades later that I heard about the mythical creature called a vampire. Apparently it isn't a myth. He was not just an ordinary, hungry man, who would bite at another man. But one with a thirst for blood. For I saw him again, some centuries back. This time he introduced himself as 'The Ripper'. I see my 'Ripper' friend from time to time now, in the busiest subways as well as the loneliest colonies during my travels.

Spooked? Then what if I told you I was Jesus Christ, and my beliefs were based on the teachings of Gautam Buddha? That I survived the crucifixion by "blocking the pain", a technique I had learned in India? That most of the Bible's contents are myths to make people believe in my teachings? Would your faith be shattered?

I consider myself lucky to have been in Europe during the Renaissance period. New ideas in art, music, culture, radical thinking, politics, science, literature and mathematics flourished. I became particularly interested in theorizing with famous historians of this era. Ironically, later I became a historian who theorized about the Renaissance era itself. The origin of this period was the Black death or Plague outbreak in Florence which resulted in a shift in the world view of people, causing them to dwell more on their lives on Earth, rather than on spirituality and the afterlife. I say so because I myself was hit by the Plague. It was horrific but I survived.

Come 20th century. What a bloom for Science. Now now, do you not see the resemblance? I was Max Planck in 1900s. And yes I discovered the light quanta. I also witnessed the two World Wars closely and highly condemn them. Since this was one of the most famous and special roles of my life, I began to fake that I was aging, with make up, so I could continue to pursue Science that I loved.

In my 300 lifetimes on earth, I have received more love than hate. Yet the modern world presents a bleak possibility that it might become the opposite soon. The way we are exploiting nature, I wonder how long it is before resources run out and we begin to kill each other for our own living.

And here I am today. I still care. I am still a mortal. And I am still alive.

Thursday 6 October 2016

Poster: Our desire to imagine the unreal...

Recently I took an introductory course to Humanities. This is the result of a beginner exercise to get our creative juices flowing. Basically it describes a peculiar human habit.
Click on the image to enlarge it.