So last Sunday I received another batch of records to be inserted/updated to the Couch, and it looked simple enough. 40 minutes later, after writing a script for doing the job, I left to shower and came back horrified by the fact that I was doing 2 records per minute.
Basically the job was to dump in user record data, but I also had to perform a check and:
With around 17000 new records to check/create/update, 2 per minute translated to around 141 hours yikes!. That wasn't going to do, and part of my self-check turned up my big mistake when I first jumped into CouchDB - UUIDs aren't necessary a good idea for most stuff. In my case, email addresses of the users would serve as a better volley instead of UUIDs, because my previous Javascript view did:
function(doc) { if(doc.doc_type=='Subscriber')
{
if(doc.email_address)
{emit(doc.email_address, doc);}
}
One of the perks of CouchDB is that unless new data was being created, views aren't rebuilt and that was a speed factor I wasn't using. So I decided to abandon ship and rebuild a new CouchDB database using Couchdbkit using email addresses as ids instead of the auto generated UUIDs:
for i in result_list:
doc_holder = {}
doc_id = i.email_address
doc_holder['_id'] = doc_id
for k in i:
doc_key = str(k[0])
doc_value = str(i.__getitem__(k[0]))
doc_holder[doc_key] = doc_value
That took me pretty quick, 10 minutes or so since I wrote a view to get all the records and after that the view isn't rebuilt and the whole process goes swiftly. Without doing anything further, I tried the job again.
Time check: 5 records per minute. Better, twice as fast but that'll still take me 56 hours - what?!
I looked at the data given to me and realized that there were duplicates, and my earlier solution was to let the script do a check again and a duplicate would flag up when CouchDB said that the user now exists (due to earlier insertion).
I thought it might be a better idea to just cache the records retrieved from CouchDB and do a matching against that, and whenever a new record was inserted, add the email address to the list:
result_list = Subscriber.view(title_query)
results = []
for i in result_list.all():
results.append(str(i.email_address))
results.sort()
We're now up to 8 records or so on the next test. Not much.
I then realized as that the checking was iterating through the entire master list for each new record, which wasn't efficient (linearly) and one of the main drags. I could have done a .startswith() to get the index of the email addresses in the master list that matched, and bring down the time but I had just learnt about bisection and decided to go that way with a bit of recursion(notice the master list is sorted above):
email_index = email_add[0]
result_length = len(results_dump) + 1
result_index = result_length / 2
if email_index > results_dump[result_index][0]:
results_dump = results_dump[(result_index-1):(result_length+1)]
final = retrieve_subscriber_id(email_add,results_dump)
return final
elif email_index < results_dump[result_index][0]:
results_dump = results_dump[0:(result_index+1)]
final = retrieve_subscriber_id(email_add,results_dump)
return final
elif email_index == results_dump[result_index][0]:
for i in results_dump:
if (email_add.lower() == i.lower()):
print "WE GOT A MATCH!"
return True
return False
Noobish attempt and that got the whole process to about 12-14 records per second, or 720 per minute. For 17000 records, it took me a whole 23 minutes versus the original days. Cool!
_Feel free to improve me by commenting on the noobish code above.