Feeding a Netezza Warehouse

Posted by Frank Fillmore on March 9, 2011 under InfoSphere, Netezza, Q-Replication. Tags: , , , .

By now you’ve probably heard that IBM has acquired Netezza.  So how do I get my data into the darn thing?

We’ve worked a lot with IBM change data capture technologies: SQL Replication, Q Replication, and InfoSphere Change Data Capture (ICDC).  We also work with InfoSphere Warehouse, IBM’s internally developed data warehousing platform.  Usually we use one of the former to feed the latter.  As changes occur in a transactional database, the deltas are shipped near-real-time to the target data warehouse.  There might be some transformation, cleansing, or other manipulation (that’s the “T” in “ETL”) for which we would use DataStage, but a (surprisingly large) number of customers just copy data to the reporting and analysis platform with little massaging.

Enter Netezza.  The usual “drip-feed” (i.e. as soon as it changes in the source, send it to and update the target) methods listed above don’t play as well.  Netezza is optimized for bulk data loading.  One-row-at-a-time INSERT/UPDATE/DELETE propagation by a replication technology isn’t efficient.  The solution is to create intermediate mini-bulk delimited files that can be ingested, say, every five minutes.

There are a couple of ways to create these files.  One is to use InfoSphere Data Event Publisher (EP).  Think of EP as Q Replication without the Apply component that posts the deltas to a target database.  EP can publish XML or delimited files.  Viola!  Our Netezza problem is solved.  If you need complex business rules applied to transform the data, use ICDC or Q Replication to feed DataStage and have it created the delimited files.

3 Comments so far

  1. David Tolleson March 17, 2011 9:17 am

    Hey, Frank,

    Speaking of InfoSphere Data Event Publisher, I get a lot of questions about when people need to buy it and when they don’t. I created a blog post to answer this on ibm.com’s developerWorks:

    https://www.ibm.com/developerworks/mydeveloperworks/blogs/a82063fa-aecf-4716-81a4-8d33425a8735/entry/do_i_need_to_buy_data_event_publisher_for_db214?lang=en

    thanks,
    david

  2. Tom Boesch July 7, 2011 6:20 am

    We have move close to 1 billion rows of data using QReplication within a 24 hour period. People need to understand commit counts and unit of work sizes to accommodate large amounts of data.
    We replicate a 28 billion row database with QReplication and have converted most unlogged load to logged loads with 500 row commit count and can handle the workload with QReplication from DB2 to Oracle.
    We also use Event Publishing but not as described above.
    Since we started capturing statistics on 1/1/2010 we have replicated over 9 billion rows through the QReplication process. We do use an ETL to populate the target, and on special occasions we utilize ETL’s to copy very large volumes, but QReplication can move lots of data very fast.

    Just my 2 cents.

  3. Glenn Steffler February 23, 2012 12:31 pm

    IBM released InfoSphere Data Replication for Netezza (10.1.2).

    This product has native apply for Netezza utilizing high performance batch techniques while not requiring programming or customization such as ETL.

    https://www.ibm.com/developerworks/mydeveloperworks/blogs/a9b542e4-7c66-4cf3-8f7b-8a37a4fdef0c/entry/data_replication_s_enhanced_apply_for_netezza2?lang=en

Leave a Comment

Name

Email

Website

Comments

More