How to Handle the “V” for Volume in Big Data on DB2 for z/OS

Posted by Frank Fillmore on March 14, 2013 under Big Data, DB2 for z/OS, IDAA, Netezza. Tags: , .

One of the challenges of Big Data is, well, its bigness.  There are two approaches you should consider – one fairly new, the other not yet delivered – for your data volume issues on DB2 for z/OS.

The first technique exploits the IBM DB2 Analytics Accelerator (IDAA).  IDAA is a Netezza (now PureData System for Analytics) appliance lashed together with DB2 for z/OS using a high-speed connection.  The same data typically resides in both DB2 and Netezza.  The DB2 optimizer determines if a particular dynamic query can be addressed at a lower cost by offloading that query to Netezza.

Until November 2012, all of the data needed to be present on both Netezza and DB2.  In IDAA version 3, that restriction has been removed.  The attached presentation describes offloading older data to the Netezza and eliminating it from DB2.  In a common use case the most current data (say, the last 30 days) remains in DB2, but older data (days 31 to the preceding 2 years) reside *only* in Netezza.  This is called the High Performance Storage Saver (HPSS).  IDAA HPSS

Based on the cost of mainframe DASD storage and other factors, it might be less expensive to store stale data in a “hot” back-end like the Netezza part of an IDAA than to continue to store that data natively in DB2 for z/OS.  From an API standpoint, you run your queries as if all of the data resides locally in DB2 for z/OS.  That way you can change which data reside in each of the IDAA components with out impacting applications or SQL changes.

The second technique exploits temporal data functionality delivered in DB2 10 for z/OS.  Temporal tables enable you to see values for attributes of a particular business object (account, part, employee, customer, etc.) at a specific point-in-time (PIT).  As SQL UPDATEs are processed against a transaction table, deltas are recorded in a history table associated with the transaction table.  That way you can issue a query like:

SELECT coverage_amt FROM policy FOR SYSTEM_TIME AS OF '12-01-2010' WHERE id = 111

and see the coverage_amt at a point-in-time regardless of any intervening changes.  DB2 11 for z/OS – not yet Generally Available (GA), but in beta with Early Support Program (ESP) customers – builds on this technology by extending it to “archive” data.  The use case is the same as the one described above: for processing efficiency you want the last 30 days of transaction data to reside in one DB2 table and older data (days 31 to the preceding 2 years) to be stored in another.  Think insurance claims: most activity regarding a claim occurs within the first 30 days of a loss.  But for business and compliance reasons, you want to retain up to 2 years of data on spinning disks; even older data might be kept on offline storage.  Using this technique, all of the data resides natively in two different DB2 for z/OS tables (current and archive).  The benefit this time is segregating commonly accessed data for processing efficiency.  As was the case with IDAA/HPSS, the location of the data is transparent to the SQL.  You write the query for the data you want and DB2 determines whether it resides in the current data table, archive table, or both.

In summary

 

Available

Data segregated by

Data movement

Administrative overhead

Static SQL

Value prop

IDAA/HPSS

Today

DB2 and IDAA

Bulk load or replication

Minimal

No

Cost savings

DB2 11 for z/OS

GA

Current and Archive tables

Via DELETE or REORG DISCARD

None

Yes

Processing efficiency

What Your IBM Competitive Sales Specialist Wants You To Know (Recording and Presentation Material)

Posted by Frank Fillmore on March 7, 2013 under DB2 Education, DB2 for Linux Unix Windows, DB2 Migrations, IBM DB2 Services, IBM Information Management Software Sales, IBM Mid Market Customers, InfoSphere, Optim, Oracle. Tags: .

On February 28, 2013 Kim May of The Fillmore Group and Bill Kincaid of IBM collaborated on an excellent DB2 for LUW presentation detailing:

  • administrative tool functionality
  • new features
  • new product bundles that deliver better value at a lower cost

Many thanks to all of those that attended.  If you or a colleague weren’t able to join us, the recording of the webinar can be found on ChannelDB2.  The presentation materials are found here:

Mid Atlantic VUG Presentation Bill Kincaid 2.28.13

Kim May named IBM Champion for 2013

Posted by Frank Fillmore on February 27, 2013 under Baltimore Washington DB2 Users Group, DB2 Education, IBM Champion, InfoSphere. Tags: .

The Fillmore Group’s Kim May has been named an IBM Champion for the 5th straight year.  Click here for Kim’s developWorks profile.  Kim has been especially active in the IBM Information Management (IM) Regional Users Group (RUG) community.

Kim has recently collaborated with IBM Education to deliver the first public Big Data classes on BigInsights and InfoSphere Streams.  These will be appearing on TFG’s 2H13 training schedule.

Hearty, well-earned congratulations to Kim!

Parallel asntdiff Dramatically Improves Performance for Q Rep on z/OS

Posted by Frank Fillmore on February 11, 2013 under DB2 for z/OS, InfoSphere, Q-Replication. Tags: , , .

The asntdiff utility is used to compare the contents of two tables, primarily in the context of SQL and Q Replication.  The problem has been that very large tables can take a very long time to compare.  Moreover, in a dynamic replication environment, the data you are comparing might rapidly be changing.  There were attempts at fixing this problem, primarily by using the asntdiff -f parameter to provide a smaller range of rows to be compared.  Using the -f input file, SQL queries subset the data on a key value (e.g. account number) or transaction timestamp.  The bottom line is that many customers abandoned using asntdiff against production tables with millions or billions of rows.

Parallel asntdiff provides orders-of-magnitude improvements in performance.  The key differences between the processing of the legacy asntdiff and the parallel version are as follows:

  • Parallel processing: partitioning, difference detection, and reporting all in parallel
  • Compare multiple partitions in parallel (source and target table) and retrieve rows from each partition also in parallel from source and target
  • Minimize traffic (only send data needed)
  • Exploit computational capabilities of systems (checksum calculations, etc.)
  • Use a Stored Procedure (ASNTDSP) which runs at source and target systems
  • Avoid storing intermediate results

The results are dramatic.  A European bank killed a legacy asntdiff compare utility run after 17 hours; it might have run for 30.  Parallel asntdiff completed in 5 minutes.

There are some restrictions and advisories:

  1. The source and target tables must have the same code page and collation sequences.
  2. For optimal performance, increase the system resource limit for application threads and set NUMTHREADS to 21.  The minimal requirement is 6 threads.
  3. Since asntdiff can determine the parameter NUMBLOCKS automatically, grant the SELECT privilege on the following catalog tables: SYSIBM.SYSTABLES, SYSIBM.SYSKEYS and SYSIBM.SYSINDEXES.

There are others, but the topic is well worth exploring to speed up asntdiff – or to be able to use it at all.

Thanks to Jayanti Mahapatra and Xiao Li of IBM’s Silicon Valley Lab for some of the material presented here.

“Updating Netezza with Data from Q Replication”

Posted by Frank Fillmore on January 2, 2013 under DB2 for Linux Unix Windows, DB2 for z/OS, IDAA, InfoSphere, Netezza, Q-Replication.

In cooperation with IBM, The Fillmore Group performed a proof-of-concept for a large, well-known midwestern retailer in 2011 to demonstrate the use of IBM InfoSphere Data Replication (IIDR) Event Publisher (EP) feature to feed OLTP data into a Netezza appliance.  There is now a native changed data capture Apply component for Netezza built on technology IBM acquired from Data Mirror.  This provides for heterogeneous near-real-time OLTP replication to a Netezza appliance.  However – and this is a significant “however” – your shop might already be using Q Replication.  If you are replicating from a DB2 OLTP database, does it make sense to introduce a different replication technology just to keep your Netezza appliance current?

The answer is found in the developerworks post “Updating Netezza with Data from Q Replication“.  Built on the template developed by The Fillmore Group, IBM has published a java consumer that will provide mini-batches of comma delimited data files created by IIDR Event Publisher for ingestion into Netezza using native Netezza utilities.  Note that this approach works for both the new PureData System for Analytics as well as IBM DB2 Analytics Accelerator (IDAA) which are built on Netezza technology.

Kudos to IBMer Cecile Madsen of Silicon Valley Lab for her work on this technique.

DB2 Gold Consultant Perspectives on Large Table Design in DB2 10 for z/OS

Posted by Frank Fillmore on December 30, 2012 under DB2 for z/OS, DB2 Gold Consultants. Tags: .

I have just begun a new engagement working with a large, integrated financial services firm on a project migrating IMS and VSAM data to DB2 10 for z/OS.  There had been a lot of discussion regarding whether to break a multi-billion row set of historical data into separate physical tables using schemas (e.g. SCHEMA1.MY_TAB, SCHEMA2.MY_TAB,… ) with each table containing a portion of the data *or* use Range Partitioning.  I reached out to my DB2 Gold Consultant colleagues for their opinions and received four detailed replies.

Martin Hubel, DB2 Gold Consultant:
Use range partitioning.  Almost all of the objections and disadvantages to range partitioning have been removed starting in DB2 V4, and continue to be removed in later versions of DB2 up to including V10.

The use of separate tables is cumbersome, complex, and much harder to maintain.  It is typically people with long memories and short experience who recommend this.

Joel Goldstein, DB2 Gold Consultant:
My first question is – have they re-designed for DB2 or trying to port the data to DB2?  If their IMS systems were decently designed, simply running the application under DB2 will cost at least 2x the CPU based on my experiences, and maybe more.

From a performance perspective, (excluding how the applications must actually access and process the data), range partitioning should give the best performance – but the info provided doesn’t give us any idea (i.e. under IMS is most of their access using HIDAM or HDAM?).

Physically separating the tables into subsets by schema and using an application algorithm to determine which tables to access might be a good option, if this can point to ONE table to access.  If multiple tables, performance dies.

An updateable view with many UNIONs for DML that span subsets—  performance disaster for an OLTP application.

Jürgen Glag, DB2 Gold Consultant:
First design rule: keep the design as simple=elegant as possible.  I would never start with the alternative with separate physical tables

Which benefits does the client expect from the solution with separate tables over the partitioned table?  Why does he think that his homegrown solution which is not tested for robustness will be better than a product (e.g. DB2 for z/OS range partitioning)?

William Miller, DB2 Gold Consultant
In my experience, separating tables into subschemas is NOT the best solution.  I saw this used at the Commerzbank.  While it was a good solution for the problem they had back in the early days when DB2 had limits on the size of the table, it is not a good solution now.  There is a lot of unnecessary overhead involved.  When I query this table with this date, what underlying table do I need to access?  And what happens when I join two tables, where do I find the correct data?  These are all questions that have to be answered by fallible human coding.

Today’s technology allow for truly massive numbers of rows in a single table.  Most often we can divide data into ‘active’ and ‘historical’, the active bit containing open contracts, the historical contains closed contracts.  Due to government regulation you can’t throw away data that is no longer active.  But large amounts of historical data make queries slower (more index levels, etc). 

A truly better solution is the v10 functionality of temporal tables.  I used this technique at the Deutsche Bundesbank.  Basically, you have two identical tables that can be accessed by SQL.  Data that is no longer current automagically wanders from the active table to the historical table.  SQL can access all data, or a range of values.  There is no need for application logic to move active data to a history table.  The active data and the historical data can be indexed differently.  So you have a lot of flexibility here.

The problem with this solution is that it is not very adaptable to current applications.  Every company has its own way of historizing data.  And nobody is interested in throwing away a chunk of application logic and retesting everything anew.  It is impossible to get a company to rewrite a corner of the application that just works.  But from what I read, you are in a way starting from scratch.  You can design temporal data into your solution.

IBM IM Sellers: Partner to Win with The Fillmore Group in 4Q12

Posted by Frank Fillmore on November 28, 2012 under DB2 for Linux Unix Windows, DB2 Gold Consultants, DB2 Migrations, IBM Champion, IBM DB2 Services, IBM Information Management Software Sales, IBM Mid Market Customers, Information on Demand Conference, Oracle. Tags: , .

The fourth quarter of 2012 will close in just over one month.  What will you close between now and then?  If you attended the Break Free Forum at the IBM Information on Demand (IOD) conference in Las Vegas last month, you heard The Fillmore Group’s founder and principal, Frank Fillmore, give real-world perspectives on successful Oracle to DB2 migrations – including a project at a worldwide Top 50 integrated financial services firm.

Listen to a brief ten minute podcast discussion between Frank Fillmore and Kim May, TFG Vice President of Business Development, that provides an overview of a the steps involved in a successful migration – from intital validation to production cutover.

Oracle Takeout – Partner with TFG

If you have prospects that can benefit from these perspectives, contact Kim at kim.may@thefillmoregroup.com or (001) 443-956-0288.

IBM PureData Systems @ #ibmiod

Posted by Frank Fillmore on October 22, 2012 under DB2 for Linux Unix Windows, IBM Smart Analytics System, Information on Demand Conference, Netezza. Tags: .

Announced on October 8, 2012 IBM PureData Systems were heavily featured in the Information on Demand conference opening keynote this morning.  There are three members of the PureData family:

I’m using the abbreviations for convenience and to save keystrokes.

PDST is built on IBM’s pureScale technology; pureScale is the shared-disk cluster technology for DB2 running on Power and Intel processors.  IBM based pureScale on the DB2 for z/OS Parallel Sysplex fault tolerance that mainframers have been using for years.  The relatively slow adoption of pureScale has been a puzzlement.  High-availability database solutions are being adopted by government and industry where the results of even a brief outage can range from bad for business to catastrophic.  One inhibitor to adoption has been removed, that of the InfiniBand requirement for cluster communications.  The PureData System for Transactions uses a simple Enternet connection for lashing together nodes in the cluster.

PDSOA is the next generation of the IBM Smart Analytics System (ISAS).  ISAS was arguably IBM’s first data “appliance” – a preconfigured, balanced processor/memory/storage stack bundled with InfoSphere Warehouse.  InfoSphere Warehouse employs the shared-nothing MPP functionality formerly known as Data Partioning Feature (DPF).  One prospective use of PDSOA would be near-real-time fraud detection.

PDSA is Netezza running under the covers.  Using innovative function-to-data Field-Programmable Gate Array (FPGA) integrated circuits, Netezza achieves orders-of-magnitude greater query performance for deep analytics like market-basket analysis.

The goal of the entire PureSystems portfolio – which includes offerings for WebSphere and custom applications – is to accelerate time-to-value and reduce or eliminate the repetitive, mundane tasks involved in infrastructure deployment.  IBM says that their PureData Systems go from door-stop to “load ready” in 4 hours.  Enterprises that provision database servers regularly commit weeks to these tasks.

One telling anecdote is that when the PureSystems plan was presented to then IBM CEO Sam Palmisano, Sam’s reply was that team was “building the AS/400”, a rock-solid platform known for its integration (hence the “i” in iSeries and System i), simplicity, and ease of maintenance and use.  Indeed, two of the PureData servers are manufactured in the AS/400 Rochester, MN IBM Lab.  PureSystems are nicknamed “AS/500”.

There’s lot’s more detail to come on the PureData varients.  Think of this as a cheat-sheet for telling the siblings apart.

“Introducing Query Capture and Replay” Webinar Collateral

Posted by Frank Fillmore on October 17, 2012 under DB2 for Linux Unix Windows, InfoSphere, Netezza, Optim, Oracle, SQL Tuning. Tags: , , .

As promised, the replay of the “Introducing [InfoSphere Optim] Query Capture and Replay” (QCR) webinar delivered on October 16, 2012 is on ChannelDB2.  The presentation materials for the webinar are here: Query Capture and Replay Webinar 2012-10-16.

For assistance implementing QCR: Infosphere Optim Query Capture and Replay

InfoSphere Optim Query Capture and Replay – What’s the value?

Posted by Frank Fillmore on August 8, 2012 under DB2 for Linux Unix Windows, DB2 for z/OS, InfoSphere, Optim, Oracle. Tags: , .

Database recovery logs have been around for a long time.  They support rolling back transactions that abend – providing a logical unit of work (LUW).  The also provide roll-forward recovery so that if a database crashes, a full backup restored can be supplemented by all of the transactions, in chronological sequence, that have occured since the last backup was taken.

For many years now, clever engineers have been using this record of transactions for other purposes.  The most prominent is changed data capture: the technique of reading the database recovery log and replaying the transactions on another server to, say, feed a data warehouse or load-balance between two transaction servers.

IBM has packaged its Guardium technology to provide a new way to capture in-flight SQL statements: InfoSphere Optim Query Capture and Replay (IOQCR).  Not the most lyrical of names, but the idea is sound.  Rather than exploit the database recovery log, IOQCR imbeds a Guardium S-TAP in a database server to gather the SQL as it occurs.  This SQL can then be replayed in another environment.  The use case is to take SQL running in a production system and rerun it against Performance, QA, or Test/Dev database servers.

In May I participated in the IOQCR beta and installed it in a small DB2 9 for Windows server.  The purpose was to evaluate basic functionality and usability.  The IBM development team graciously assisted me with the installation and configuration.  It took a few tries – remember, this was an early beta – but I was able to successfully use all of the features.  SQL statements executing on a database server were captured and replayed flawlessly on another.

Rather than drive transaction load through scripting or third-party products, you can use IOQCR to validate database configuration changes or even the viability of running an SQL workload developed for one database server (we’re looking at you, Oracle) on a different vendor platform entirely.  This is a nifty way to test DB2 9 support of Oracle APIs using your own production PL/SQL code- especially dynamic SQL generated on the fly or by an IDE.  DB2, Informix, Netezza, Oracle, Teradata, SQL Server, Sybase, and MySQL are all supported in v1.1 announced on July 10th.

There’s a demonstration on developerWorks.  The Announcement Letter is also useful.