SQL to Aid #Db2 for z/OS EBCDIC to Db2 for LUW Unicode Conversion #IBMAnalytics #IBMz

Posted by Frank Fillmore on January 2, 2019 under DB2 for i, DB2 for Linux Unix Windows, DB2 for VSE&VM, DB2 for z/OS, DB2 Gold Consultants, IBM Champion. Tags: , , , .

Dr. Sheldon Cooper of “The Big Bang Theory” regularly records a YouTube video called “Fun with Flags” so I’m going to call this blog post “Fun with SQL”.  Sheldon may be a future Nobel Prize winner, but I’ve been named an IBM Champion for Analytics and System z for 2019. 🙂

The problem statement: a large US government agency is contemplating migrating a packaged application from System z to Linux.  The current and future database repositories are Db2 for z/OS and Db2 for LUW respectively.  As a System z Champion I’ve presented on the continued relevancy and efficacy of the IBM mainframe, but that’s not the focus of this post.  One of the challenges of porting data is the conversion of character encoding from EBCDIC (Extended Binary Coded Decimal Interchange Code) which is the legacy encoding protocol used on IBM mainframes and System i to Unicode.  Db2 for z/OS supports Unicode, but I’m not going to address anything other than EBCDIC encoding at the source and Unicode at the target (Sorry, ASCII).  In brief, EBCDIC represents an 8 bit binary encoding scheme that can represent 256 different characters.  For example: the letter “F” (one of my faves) is represented as 1100 0110 in binary and C6 in hexadecimal. For national language support amid the variety of Single Byte Character Sets (SBCS) used around the world – think the German umlaut (i.e. ä ö ü) – the 256 byte universe of EBCDIC wasn’t nearly enough.  The Unicode solution is to use more than 1 byte to represent some characters.  For example: the “¢” cent sign takes one byte to represent in EBCDIC, but 2 bytes in Unicode.

select length(ebcdic_str('¢')) as EBCDIC, length(unicode_str('¢')) as Unicode from sysibm.sysdummy1;
EBCDIC    UNICODE

     1          2

This problem has manifested itself as an error when moving a CHAR(5) column of data from Db2 for z/OS to Db2 for LUW.  If, say, the pesky “¢” is present in the string and all 5 characters are significant (i.e. no trailing blanks), the string won’t fit in the same column width on the target platform.  Most customers find this out the hard way when the Db2 for LUW LOAD command or IMPORT utility posts errors.  There are two possible solutions:

  • data cleanup (Do we really need the pesky ‘¢’ in a Comments column?)
  • lengthen the target columns to accommodate Unicode expansion

Both can be time-consuming and expensive and will depend on a variety of factors beyond the scope of this blog post.  The purpose here today is to identify the extent of the problem before you actually try to port the data.  I’ve constructed this SQL statement to do just that.

select 'SELECT', rtrim(name)||', HEX('||rtrim(name)||') AS HEX, LENGTH(EBCDIC_STR('||rtrim(name)||')) AS L_EBCDIC, LENGTH(UNICODE_STR('||rtrim(name)||')) AS L_UNICODE FROM', rtrim(tbcreator)||'.'||rtrim(tbname), 'WHERE LENGTH(EBCDIC_STR('||rtrim(name)||')) < LENGTH(UNICODE_STR('||rtrim(name)||')) AND LENGTH(UNICODE_STR('||rtrim(name)||')) >', '( SELECT LENGTH FROM SYSIBM.SYSCOLUMNS WHERE TBCREATOR = '''||rtrim(tbcreator)||''' AND TBNAME = '''||rtrim(tbname)||''' AND NAME = '''||rtrim(name)||''' );' from sysibm.syscolumns where tbname like 'ICM%' and encoding_scheme <> 'E' and coltype in ('CHAR', 'VARCHAR');

So what are we doing here?  The SQL statement above scans the Db2 for z/OS catalog and generates SQL statements like the one below that compares the length of the data in a column represented in EBCDIC with the length of that same data represented in Unicode.  It also compares the length of the Unicode representation of the data with the maximum column length.

SELECT DEFAULTVALUE, HEX(DEFAULTVALUE) AS HEX, LENGTH(EBCDIC_STR(DEFAULTVALUE)) AS L_EBCDIC, LENGTH(UNICODE_STR(DEFAULTVALUE)) AS L_UNICODE FROM XXXX.ICMSTCOMPATTRS WHERE LENGTH(EBCDIC_STR(DEFAULTVALUE)) < LENGTH(UNICODE_STR(DEFAULTVALUE)) AND LENGTH(UNICODE_STR(DEFAULTVALUE)) > ( SELECT LENGTH FROM SYSIBM.SYSCOLUMNS WHERE TBCREATOR = 'XXXX' AND TBNAME = 'ICMSTCOMPATTRS' AND NAME = 'DEFAULTVALUE' );

The columns that are evaluated are chosen by you in the last line of the first SQL statement that generates the list of candidates.  This is where knowledge of the data is crucial.  If, for example, you have a VARCHAR column of a maximum 500 bytes and there’s typically no more than 50 bytes of data, you won’t exceed the maximum length of the column even with a slew of special characters and rampant expansion.  As such you might only be interested in smaller, fixed length columns.  I only chose columns with tables that had a certain name pattern (ICM%) and column types of CHAR and VARCHAR.  My IBM Gold Consultant confrères will point out that the only test that matters is wheter the maximum column length will be exceeded.  The first predicate just determines if any expansion will occur.  But remember: I’m channeling Dr. Sheldon Cooper and err on the side of completeness.  Be sure to test this first in a non-Prod environment before turning it loose on Production.  Also run it during times of light subsystem utilization.  The first, master query runs quickly against the SYSIBM.SYSCOLUMNS table.  The resulting scans will produce a lot of I/O.

Let me know if this has been helpful.  And to all a happy, healthy, and prosperous 2019!

Leave a Comment

Name

Email

Website

Comments

More