InfoSphere Change Data Delivery combines legacy products from two different IBM acquisitions: Data Mirror Transformation Server (now InfoSphere Change Data Capture) and Ascential DataStage (now InfoSphere DataStage). A recent customer engagement highlights the benefits – and a few challenges – of this hybrid replication/ETL (extract, transform, load) combination.
The B2B eCommerce business problem: a large, privately-held regional retailer wanted to gather data from each of their retail outlets into a single web presence. This “bricks and clicks” approach would enable just-in-time delivery of their products from either a warehouse or the nearest store to the businesses that install the products for end-customers. Each store had a small System i to handle local transactions. But there are hundreds of stores. Standard replication from each of the stores to a single website repository would be unwieldy at best. Coupled with the astronomical licensing charges for replication software at each of the stores, direct replication was a technical and financial non-starter.
The IBM Change Data Delivery (CDD) offering makes local replication “free”. InfoSphere Change Data Capture (ICDC) software can be installed on each of the (in this case) System i store transaction servers at no cost. Data is replicated from journaled application tables/physical files to Consistent Change Data (CCD) tables on the same server. Then DataStage reaches up periodically from a mid-tier server to gather the data from the CCD tables and populate the web application database – in this case MySQL. Once the data is loaded into MySQL from a CCD table, the CCD table is pruned.
We employ ICDC Live Audit replication so that an application table INSERT, followed immediately by an UPDATE to the same row, followed by a DELETE of that row would record three distinct rows in the CCD tables. Each would be identified by a timestamp and an “entity type” (e.g. INSERT, UPDATE) of the activity. DataStage reads the Live Audit records and makes the appropriate change to the target MySQL database tables.
There are 17 application tables of interest: 10 on the warehouse System i and 7 on each of the store System i servers. So there are 17 DataStage jobs. Each DataStage job has four Stages or nodes:
- the first ODBC stage that uses an SQL SELECT to read from a particular System i CCD table
- a Transform stage to place each of the Live Audit row types (i.e. INSERT, UPDATE, DELETE) on one of three links to a second ODBC stage
- the second ODBC stage with the appropriate MySQL Data Manipulation Language (DML) statement (i.e. INSERT, UPDATE, DELETE) for each link
- a third and final ODBC stage branching from the Transform stage to prune the CCD table
All of the DataStage jobs have exactly the same structure. They had to be exceedingly efficient because we are executing them so frequently (~ one per second).
The entity types in the ICDC Live Audit CCD tables are as follows:
- RR – an INSERT record caused by the refresh of a table in a subscription
- PT – a regular application INSERT
- PX – an extra INSERT (found only on the System i)
- UB – before image of an UPDATE
- UP – after image of an UPDATE
- DL – a DELETE
There are also two Live Audit control records: RS (which signals a full refresh) and CR (which is equivalent to a TRUNCATE) which I was able to ignore.
The “secret sauce” is the SQL SELECT statement used to gather the data from each CCD table on the System i servers. There are several reasons for this.
- As much work as possible needed to be done on the System i servers which are being polled individually by DataStage every five minutes. DataStage on the other hand had to run approximately one job per second.
- The System i transaction processing application allowed for the updates of Primary Key fields. As such, I needed to know both the before and after key values for an UPDATE. Yet I didn’t want to code logic to compare consecutive UB and UP records in a Transform stage.
- I needed to ensure that all of the rows were gathered and sent in sequence so that the INSERT followed immediately by an UPDATE of the same row arrived in MySQL in the proper order.
- Finally, I needed a way to to ensure the pruning process did not purge rows from the CCD tables that arrived after the data was gathered by DataStage in the initial SQL SELECT.
So here’s an example:
SELECT CDECFXREF.MFRCODE AS MFRCODE_A,
CDECFXREF.LINECODE AS LINECODE_A,
CAST(NULL AS CHAR(3)) AS MFRCODE_B,
CAST(NULL AS CHAR(3)) AS LINECODE_B,
CDECFXREF.AUD_TS AS AUD_TS,
CAST(NULL AS CHAR(26)) AS MAX_TS,
CDECFXREF.AUD_ENTTYP AS AUD_ENTTYP
FROM
CCDLIB.CDECFXREF AS CDECFXREF
WHERE
CDECFXREF.AUD_ENTTYP IN ( ‘RR’, ‘PT’, ‘PX’, ‘DL’ )
UNION ALL
SELECT A.MFRCODE AS MFRCODE_A,
A.LINECODE AS LINECODE_A,
B.MFRCODE AS MFRCODE_B,
B.LINECODE AS LINECODE_B,
A.AUD_TS AS AUD_TS,
CAST(NULL AS CHAR(26)) AS MAX_TS,
A.AUD_ENTTYP AS AUD_ENTTYP
FROM
CCDLIB.CDECFXREF AS B,
CCDLIB.CDECFXREF AS A
WHERE
B.AUD_TS = A.AUD_TS
AND B.AUD_ENTTYP = ‘UB’
AND A.AUD_ENTTYP = ‘UP’
UNION ALL
SELECT CAST(NULL AS CHAR(3)) AS MFRCODE_A,
CAST(NULL AS CHAR(3)) AS LINECODE_A,
CAST(NULL AS CHAR(3)) AS MFRCODE_B,
CAST(NULL AS CHAR(3)) AS LINECODE_B,
TIMESTAMP(‘9999-12-31-23.59.59.999999’) AS AUD_TS,
T.MAX_TS AS MAX_TS,
CHAR(‘TS’) AS AUD_ENTTYP
FROM
( select char(max(aud_ts)) as max_ts from CCDLIB.CDECFXREF ) as T
ORDER BY 5 ASC
There are three major parts here:
- The first section SELECTs the rows from the CCD tables that are either INSERTs or DELETEs. The before image key values are NULLs. There follows – as in all subsequent sections – the Live Audit timestamp, another NULL (we’ll talk about that later) and the Live Audit entity type.
- The second section uses a recursive join to get both the after image for an UPDATE and the before image Primary Key values in the same row. Note the same table is named twice in the FROM clause, but given different aliases: B for before and A for after. The table can be joined to itself in this way because the Live Audit timestamps for both the before and after images are identical to the micro-second.
- The final section delivers the maximum timestamp found in the CCD table at the time of the SELECT. I use a nested table expression (the parenthetical SELECT statement in the FROM clause). I create a dummy Live Audit timestamp of the maximum timestamp allowed by DB2 (December 31, 9999) so this will collate last.
The last piece was basically a nod to my earliest days as a Cobol programmer. This serves as a “control record”. The answer set is ordered by the Live Audit timestamp (including my 9999-12-31 dummy timestamp) so that INSERTs will arrive at the target MySQL database before an UPDATE of the same row. Once all of the Live Audit records are successfully passed to MySQL, the last DataStage node (the third ODBC stage) is invoked to prune the System i CCD table of all rows less than or equal to the maximum Live Audit timestamp gathered at the time of the SELECT.
Many thanks to Pete Stroback and Ernie Ostic of IBM and the folks who monitor and respond on the www.dsxchange.com website for their help in putting this solution together.
You did a great job putting all the pieces together. That takes creative thinking and applied effort. Nice work Frank!
Ernie