So I have this database at work of people who subscribed to our mailing lists + some other dump from our master database, and I really didn't want to deal with SQL queries. Besides, CouchDB's document schema deal works better with modeling these people records as 'business profile' like entities... ok and I just wanted to relax and all... so, here's what I did...
Firstly, as a noob developer, I decided that I'll always start by looking for a nice library to do the heavy lifting. But it was a hard route because I took the first step with couchdb-python, which unfortunately have bugs related to httplib that's yet to be ironed out when I took it on. So code code, and then fail. But here's where a window opens when a door closes - Couchdbkit. Lovely APIs that make me smile at night.
So armed with the latest trunk of CouchDB (0.11.bxxx at this point), Couchdbkit and Python (+SQLAlchemy and some other goodies - like I said, I really hate SQL), I wrote the following:
# Sleep_on_couch
# -
# To convert current data to JSON and dump into a CouchDB database
# API for Python: couchdb-python
from couchdbkit import *
from couchdbkit.loaders import FileSystemDocsLoader
from uuid import uuid4
import datetime
import time
import csv
# sqlite to couch dump code start
from sqlalchemy import *
from sqlalchemy.orm import mapper
import os
import unicodedata
def sqlite_to_couch_dump():
# create engine from current db file
engine = create_engine('sqlite:///db/raw2.db', echo=False)
# table reflection
metadata = MetaData(bind=engine, reflect=True)
subscribers_table = metadata.tables['fbm_enews']
print "Getting ready to retrieve users from Central DB(SQLite)..."
dummy = raw_input("Press [Enter] to begin ...")
emails_list = []
stmt = select([subscribers_table.c.Email])
for email_add in stmt.execute():
emails_list.append(email_add)
#big_n = 0
for email in emails_list:
email = email[0].encode()
stmt = subscribers_table.select(whereclause=text('fbm_enews."Email"="%s"' % str(email)))
for row in stmt.execute().fetchall():
#big_n += 1
subscriber = []
for col in row:
subscriber.append(str(col))
print subscriber
first_name = subscriber[1]
last_name = subscriber[2]
email_add = subscriber[3]
cws = subscriber[4]
cwm = subscriber[5]
cio = subscriber[6]
mis = subscriber[7]
mps = subscriber[8]
mgc = subscriber[9]
mfg = subscriber[10]
mcc = subscriber[11]
user_create(first_name,last_name,email_add,cws,cwm,cio,mis,mps,mgc,mfg,mcc)
# sqlite to couch dump code ended
At this point, being a noob developer, I'll like to just apologise if the above is not Pythonic or Zen - my aim these days is just to get things working. (Feel free to advise me by leaving comments!) The aim here in this function is to extract the data from a SQLite table (or easily any other SQL db table) and push the records to another function user_create() that'll recreate them as CouchDB documents. (It's flawed at this point because I actually ended up remodelling the data later on to fit CouchDB's document philosophy but you get the idea.)
Next, for user_create():
def user_create(first_name,last_name,email_add,cws,cwm,cio,mis,mps,mgc,mfg,mcc):
#server_connect()
server = Server()
try:
# Try access existing db
db = server.get_or_create_db('central_db_ver1_1_1')
print 'Connected to Central DB version 1.11 on Couch'
# associate with db
contain(db, Subscriber)
except ResourceNotFound:
print "Database not found, passing by then..."
pass
#db = server.create('central_db_ver1_1') # create db if not yet created
finally:
print 'Database connection/checking process ended.'
# Create dummy users
print "Subscriber data to be added: %s %s %s %s %s %s %s %s %s %s %s" % (first_name,last_name,email_add,cws,cwm,cio,mis,mps,mgc,mfg,mcc)
subscriber = Subscriber()
subscriber.first_name = first_name
subscriber.last_name = last_name
subscriber.email_address = email_add
subscriber.enews_CWS = cws
subscriber.enews_CWM = cwm
subscriber.enews_CIO = cio
subscriber.enews_MIS = mis
subscriber.enews_MPS = mps
subscriber.enews_MGC = mgc
subscriber.enews_MFG = mfg
subscriber.enews_MCC = mcc
subscriber.date_added = datetime.datetime.now()
subscriber.remarks = ""
print subscriber
db = server.get_or_create_db('central_db_ver1_1_1')
subscriber.set_db(db)
subscriber.store()
#print "derivative from class Subscriber done."
return
Couchdbkit has a Document class that I extended from to model the 'user' in my database as Subscriber():
# Create Subscriber class (document class)
class Subscriber(Document):
first_name = StringProperty()
last_name = StringProperty()
email_address = StringProperty()
enews_CWS = StringProperty()
enews_CWM = StringProperty()
enews_CIO = StringProperty()
enews_MIS = StringProperty()
enews_MPS = StringProperty()
enews_MGC = StringProperty()
enews_MFG = StringProperty()
enews_MCC = StringProperty()
remarks = StringProperty()
date_unsubscribed = StringProperty()
date_added = DateTimeProperty()
date_updated = StringProperty()
While CouchDB is schema-less, it helps to enforce some form of data integrity checking at this point and Couchdbkit comes thankfully with these schema for fields. And that's it... took me 40 minutes to do a dump of about 36000 records+ with this to CouchDB (where it used to take me 9 hours to dump from CSV to SQLite) on my slowball netbook while I took a shower.
I then wrapped up with some other helper functions for map/reduce views creation as well as dumping query results to CSV. And we're just starting!
I have a question about how can I put into couchdb database a .csv or .sql data file. Because I'm trying test the tpc-h test on nosql database.
My target is to try to execute the tpc-h test (if it possibile and in the limit of possibility) because I should compare the sql like schema versus Nosql; I understand that the comparison is not exact but I would know and demostrate(in my thesis) when is better nosql schema and why there are an interests from this type of database.
For create this comparison I need to find a way, after create a couchdb document, for load a big data files saved into csv data file. Can you help me with your script?
To be more precise (hopefully) I summarize the steps:
create a couchddb database as like a document import .csv/.sql data file into a document (so to populate a database) is possible? How? try model and to execute the tpc-h queries