Login or Sign up

Dumping data to CouchDB

Posted by: littlemog on Nov. 9, 2009

So I have this database at work of people who subscribed to our mailing lists + some other dump from our master database, and I really didn't want to deal with SQL queries. Besides, CouchDB's document schema deal works better with modeling these people records as 'business profile' like entities... ok and I just wanted to relax and all... so, here's what I did...

Firstly, as a noob developer, I decided that I'll always start by looking for a nice library to do the heavy lifting. But it was a hard route because I took the first step with couchdb-python, which unfortunately have bugs related to httplib that's yet to be ironed out when I took it on. So code code, and then fail. But here's where a window opens when a door closes - Couchdbkit. Lovely APIs that make me smile at night.

So armed with the latest trunk of CouchDB (0.11.bxxx at this point), Couchdbkit and Python (+SQLAlchemy and some other goodies - like I said, I really hate SQL), I wrote the following:

# Sleep_on_couch
# -
# To convert current data to JSON and dump into a CouchDB database


# API for Python: couchdb-python
from couchdbkit import *
from couchdbkit.loaders import FileSystemDocsLoader
from uuid import uuid4
import datetime
import time
import csv

# sqlite to couch dump code start

from sqlalchemy import *
from sqlalchemy.orm import mapper
import os
import unicodedata

def sqlite_to_couch_dump():
  
  # create engine from current db file
  engine = create_engine('sqlite:///db/raw2.db', echo=False)

  # table reflection
  metadata = MetaData(bind=engine, reflect=True)
  subscribers_table = metadata.tables['fbm_enews']
  
  print "Getting ready to retrieve users from Central DB(SQLite)..."
  dummy = raw_input("Press [Enter] to begin ...")
  emails_list = []
  stmt = select([subscribers_table.c.Email])
  for email_add in stmt.execute():
	emails_list.append(email_add)
  #big_n = 0
  for email in emails_list:
      email = email[0].encode()
      stmt = subscribers_table.select(whereclause=text('fbm_enews."Email"="%s"' % str(email)))
      for row in stmt.execute().fetchall():
	#big_n += 1
	subscriber = []
	for col in row:
	    subscriber.append(str(col))
	print subscriber
	first_name = subscriber[1]
	last_name = subscriber[2]
	email_add = subscriber[3]
	cws = subscriber[4]
	cwm = subscriber[5]
	cio = subscriber[6]
	mis = subscriber[7]
	mps = subscriber[8]
	mgc = subscriber[9]
	mfg = subscriber[10]
	mcc = subscriber[11]
	user_create(first_name,last_name,email_add,cws,cwm,cio,mis,mps,mgc,mfg,mcc)

# sqlite to couch dump code ended

At this point, being a noob developer, I'll like to just apologise if the above is not Pythonic or Zen - my aim these days is just to get things working. (Feel free to advise me by leaving comments!) The aim here in this function is to extract the data from a SQLite table (or easily any other SQL db table) and push the records to another function user_create() that'll recreate them as CouchDB documents. (It's flawed at this point because I actually ended up remodelling the data later on to fit CouchDB's document philosophy but you get the idea.)

Next, for user_create():

def user_create(first_name,last_name,email_add,cws,cwm,cio,mis,mps,mgc,mfg,mcc):
  #server_connect()
  server = Server()
  try:
    # Try access existing db
    db = server.get_or_create_db('central_db_ver1_1_1')
    print 'Connected to Central DB version 1.11 on Couch'
    # associate with db
    contain(db, Subscriber)
  except ResourceNotFound:
    print "Database not found, passing by then..."
    pass
    #db = server.create('central_db_ver1_1') # create db if not yet created
  finally:
    print 'Database connection/checking process ended.'
  # Create dummy users
  print "Subscriber data to be added: %s %s %s %s %s %s %s %s %s %s %s" % (first_name,last_name,email_add,cws,cwm,cio,mis,mps,mgc,mfg,mcc)
  subscriber = Subscriber()
  subscriber.first_name = first_name
  subscriber.last_name = last_name
  subscriber.email_address = email_add
  subscriber.enews_CWS = cws
  subscriber.enews_CWM = cwm
  subscriber.enews_CIO = cio
  subscriber.enews_MIS = mis
  subscriber.enews_MPS = mps
  subscriber.enews_MGC = mgc
  subscriber.enews_MFG = mfg
  subscriber.enews_MCC = mcc
  subscriber.date_added = datetime.datetime.now()
  subscriber.remarks = ""
  print subscriber
  db = server.get_or_create_db('central_db_ver1_1_1')
  subscriber.set_db(db)
  subscriber.store()
  #print "derivative from class Subscriber done."
  return

Couchdbkit has a Document class that I extended from to model the 'user' in my database as Subscriber():

# Create Subscriber class (document class)
class Subscriber(Document):
      first_name = StringProperty()
      last_name = StringProperty()
      email_address = StringProperty()
      enews_CWS = StringProperty()
      enews_CWM = StringProperty()
      enews_CIO = StringProperty()
      enews_MIS = StringProperty()
      enews_MPS = StringProperty()
      enews_MGC = StringProperty()
      enews_MFG = StringProperty()
      enews_MCC = StringProperty()
      remarks = StringProperty()
      date_unsubscribed = StringProperty()
      date_added = DateTimeProperty()
      date_updated = StringProperty()

While CouchDB is schema-less, it helps to enforce some form of data integrity checking at this point and Couchdbkit comes thankfully with these schema for fields. And that's it... took me 40 minutes to do a dump of about 36000 records+ with this to CouchDB (where it used to take me 9 hours to dump from CSV to SQLite) on my slowball netbook while I took a shower.

I then wrapped up with some other helper functions for map/reduce views creation as well as dumping query results to CSV. And we're just starting!

Comments on This Post:

Please Login (or Sign Up) to leave a comment