Greetings after a little while.
For those of you who have been working with IBM’s data virtualization technology – as I have – since the beginning (we miss you DataJoiner, Federation Server, et al), there’s yet another new name: Watson Query a service of Cloud Pak for Data (CP4D).
DataJoiner began as a research project (code name: “Garlic”) led by then-IBMer Nelson Mattos. The goal was to provide heterogeneous data access via a Db2 for LUW “gateway”. Enterprises in the 1980s began deploying lots of discrete data repositories (primarily Oracle, Sybase, and MS SQL Server along with Db2) to serve line-of-business and departmental applications (e.g. sales, accounting, finance, manufacturing, logistics). The ability to access data from multiple disparate servers to provide a comprehensive view of business processes was a significant advance. DataJoiner was not limited to RDBMS repositories, but could also retrieve data from MS Access, MS Excel, CSV and Text files… pretty much any data on persistent storage with a definable structure.
For a while DataJoiner also provided increased performance and reduced expense benefits as well. IBM sold DataJoiner as a front-end to Oracle because (at least in some cases) a query using the cost-based Db2 optimizer embedded in DataJoiner produced a better access path than the, at that time, rules-based Oracle optimizer. Expenses were reduced due to the per-seat Oracle license charges; DataJoiner was just “one seat” which could support hundreds of concurrent users. Over time Oracle has modified both their optimizer and licensing model.
For those of you who geek-out on the roots of the technology that we apply to common data administration challenges in our day jobs – as I do – you’ll find a seminal paper on the IEEE website: DataJoiner: a practical approach to multi-database access
So why the history lesson?
- At its roots Watson Query continues to use much of the original federation “plumbing”: Wrappers, Servers, Nicknames, etc.
- Watson Query no longer relies on point-to-point connections to the back-end data repositories, but exploits a “computational mesh” which provides increased performance and resiliency. This approach leverages third-party data repository vendor advances in parallelism, caching, and compression. All of this is of enormous importance when enterprise data is distributed not just across multiple in-house geographically disparate locations, but perhaps across multiple cloud vendors as well.
- BUT (there’s usually a but), due to engineering and design within CP4D some of the capabilities in IBM’s legacy federation stack have not yet been incorporated into Watson Query. Two are of particular interest – which is why I have posted two Ideas (the successor to IBM Request for Enhancement – RFE):
- Formerly for data repositories with Indexes that metadata would automatically be captured for a Remote Table in, say, Oracle and propagated back to the Db2 SYSCAT.INDEXES catalog view for use by the Db2 optimizer to help build the best access path for the Remote Table represented by the Nickname. Right now that’s not being done. Hence Idea ASQL-I-16 “Propagation of Indexes for Virtualized Tables to SYSCAT.INDEXES”.
- Federated queries have always used Db2’s SQL dialect. There is a PASSTHRU capability that enables users to employ the native SQL dialect of the back-end data repository. For certain queries this can produce significant performance benefits. See Idea ASQL-I-17 “Federated PASSTHRU Capability in Watson Query (WQ)”.
Please review both of these Ideas and vote!
https://ibm-data-and-ai.ideas.ibm.com/ideas/ASQL-I-16
https://ibm-data-and-ai.ideas.ibm.com/ideas/ASQL-I-17
Thanks in advance.
#IBM #WatsonQuery #DataVirtualization #Federation