Login or Sign up

Speeding up data insertion in CouchDB

Posted by: littlemog on Dec. 1, 2009

So last Sunday I received another batch of records to be inserted/updated to the Couch, and it looked simple enough. 40 minutes later, after writing a script for doing the job, I left to shower and came back horrified by the fact that I was doing 2 records per minute.

Basically the job was to dump in user record data, but I also had to perform a check and:

  1. If user didn't exist before, then create new document in CouchDB
  2. If user existed, update with new fields

With around 17000 new records to check/create/update, 2 per minute translated to around 141 hours yikes!. That wasn't going to do, and part of my self-check turned up my big mistake when I first jumped into CouchDB - UUIDs aren't necessary a good idea for most stuff. In my case, email addresses of the users would serve as a better volley instead of UUIDs, because my previous Javascript view did:

function(doc) { if(doc.doc_type=='Subscriber')
{ 
    if(doc.email_address)
        {emit(doc.email_address, doc);}
}

One of the perks of CouchDB is that unless new data was being created, views aren't rebuilt and that was a speed factor I wasn't using. So I decided to abandon ship and rebuild a new CouchDB database using Couchdbkit using email addresses as ids instead of the auto generated UUIDs:

for i in result_list:
    doc_holder = {}
    doc_id = i.email_address
    doc_holder['_id'] = doc_id
    for k in i:     
        doc_key = str(k[0])
        doc_value = str(i.__getitem__(k[0]))
        doc_holder[doc_key] = doc_value

That took me pretty quick, 10 minutes or so since I wrote a view to get all the records and after that the view isn't rebuilt and the whole process goes swiftly. Without doing anything further, I tried the job again.

Time check: 5 records per minute. Better, twice as fast but that'll still take me 56 hours - what?!

I looked at the data given to me and realized that there were duplicates, and my earlier solution was to let the script do a check again and a duplicate would flag up when CouchDB said that the user now exists (due to earlier insertion).

I thought it might be a better idea to just cache the records retrieved from CouchDB and do a matching against that, and whenever a new record was inserted, add the email address to the list:

result_list = Subscriber.view(title_query)
    results = []
    for i in result_list.all():
        results.append(str(i.email_address))
        results.sort()

We're now up to 8 records or so on the next test. Not much.

I then realized as that the checking was iterating through the entire master list for each new record, which wasn't efficient (linearly) and one of the main drags. I could have done a .startswith() to get the index of the email addresses in the master list that matched, and bring down the time but I had just learnt about bisection and decided to go that way with a bit of recursion(notice the master list is sorted above):

email_index = email_add[0]
result_length = len(results_dump) + 1
result_index = result_length / 2

if email_index > results_dump[result_index][0]:
    results_dump = results_dump[(result_index-1):(result_length+1)]
    final = retrieve_subscriber_id(email_add,results_dump)
    return final

elif email_index < results_dump[result_index][0]:
    results_dump = results_dump[0:(result_index+1)]
    final = retrieve_subscriber_id(email_add,results_dump)
    return final

elif email_index == results_dump[result_index][0]:
    for i in results_dump:
        if (email_add.lower() == i.lower()):
            print "WE GOT A MATCH!"
            return True 
    return False

Noobish attempt and that got the whole process to about 12-14 records per second, or 720 per minute. For 17000 records, it took me a whole 23 minutes versus the original days. Cool!

_Feel free to improve me by commenting on the noobish code above.

Comments on This Post:

Please Login (or Sign Up) to leave a comment