Ott-03-0035 Unicode and C Business Functions
Ott-03-0035 Unicode and C Business Functions
What is Unicode?
The Unicode Standard is the universal character encoding standard used for representation of text for computer processing. Unicode provides a unique
number for every character, regardless of the platform, the program, or the language. The Unicode Standard assigns each character a unique numeric value
and name. However, the Unicode Standard does not define glyph images; a visual representation of the character.
There are a few encoding standards, UTF-8, UTF-16 etc. Unicode usually utilizes two bytes per character. This allows for only 64 thousand characters.
Unicode has a mechanism called “surrogates”, using pairs of two bytes to describe characters outside the 64k. This can describe an additional one million
characters. Currently there are about 40 thousand characters in the surrogate area.
By default PeopleSoft EnterpriseOne will assume UCS2 encoding and treat each half of a surrogate as a separate character. 0x00 is a valid byte in a
character. For example, the letter ‘A’ is described as 0x00 0x41. This means normal string functions such as strlen() and strcpy will not work with Unicode
data.
Changes
Because each Unicode character requires two bytes, the char data type is inadequate in c business functions.
To avoid confusion, char will no longer be available as a type. Only JCHAR (Unicode character) and ZCHAR (non-Unicode character) are available.
Byte - Byte storage, is an existing data type that needs to be used in places where the char data type is currently used, but non-character data is stored, byte
file:///C|/Notes/pdf/computer-erp-jde-solution-ott-03-0035-Unicode_and_C_Business_Functions.htm (1 of 11)07/02/2007 10:47:40 AM
HowTo
It is recommended to convert to Unicode and use Unicode (double-byte) in preference to 8-bit characters wherever possible.
New Macros
Two new macros must be used to define string and character literals:
The former literals ‘x’ and “xxx” are no longer available except inside of _J or _Z. Note that _J and _Z arguments must be literals as they are
macros, not conversion functions.
The macro %ls must be used for format strings instead of %s to indicate a Unicode string. For example, sprintf (szString,“Hello %s\n”, szName); will need
to be changed to jdeSprintf (szString,_J (“Hello %ls\n”), szName);
There will be two versions of all string functions, one for Unicode; one for non-Unicode.
Naming standards:
It is recommended that developers use the Unicode versions. The exception is when interfacing with non-Unicode APIs where the data needs to be
manipulated. Convert strings to Unicode at the earliest convenience and use them throughout. Use of traditional string functions such as strcpy, strlen, and
printf will no longer be allowed.
Replacement functions
Former New
strcpy() jdeStrcpy()
file:///C|/Notes/pdf/computer-erp-jde-solution-ott-03-0035-Unicode_and_C_Business_Functions.htm (2 of 11)07/02/2007 10:47:40 AM
HowTo
strlen() jdeStrlen()
strstr() jdeStrstr()
sprintf() jdeSprintf()
strncpy() jdeStrncpy()
... ...
One note about jdeStrcpy(). This function name is already in use, therefore the slimer will change existing jdeStrcpy() to jdeStrncpyTerminate(). Going
forward, developers need to use jdeStrncpyTerminate() where they previously used jdeStrcpy().
Because MATH_NUMERIC data structures' string member is not Unicode, there are new functions to access the member instead of directly going to
elements of the underlying data structure:
● jdeMathGetCurrencyCodeUNI (MATH_NUMERIC *pMn, JCHAR *szCurrencyCode); This gets the Currency Code for the numeric pMn into the
Unicode string szCurrencyCode.
● jdeMathSetCurrencyCodeUNI (MATH_NUMERIC *pMn, JCHAR *szCurrencyCode); This takes your Currency Code in Unicode and sets the
MATH_NUMERIC pMn appropriately.
The old function jdeMathGetRawString() will return a Unicode string, however it is not safe to use and will be obsolete in the future. Either use JCHAR *
jdeMathGetRawStringEx (MATH_NUMERIC* Value, JCHAR* Str); where you have pre-allocated Str; or use FormatMathNumeric();
Conversion Functions
The caller must allocate both buffers. The fourth parameter is a pointer to the code page to convert from or to. When NULL is passed, the Western European
code page will be used. This is what should be used unless some special conversion is intended.
The character converted is returned through the first argument pointer and the function return value. If the first argument pointer is NULL, the character will
be returned only through the function call.
To simplify the use of system functions, such as fopen(), a number of wrapper functions have been created. For example:
Similarly, wrapper functions have been created for non-Unicode strings as well. For example:
jdeMemset(), is a new memset function that sets character by character, rather then byte by byte. jdeMemset() takes a void pointer, a JCHAR and the
number of bytes to set. Example: use jdeMemset (buf, _J (‘ ‘), sizeof (buf)); to set the Unicode string buf so that each character is 0x0020.
New flat file functions have been created to allow PeopleSoft EnterpriseOne to produce and consume encoded text flat files. For these APIs to work, setup
needs to be done using P93081- Work With Flat File Encoding. Available encoding names are stored in UDC H95/FE.
● jdeFwriteConvert(LPBHVRCOM lpBhvrCom, JCHAR *buf, jde_n_char size, size_t count, FILE *stream )
● jdeFreadConvert(LPBHVRCOM lpBhvrCom, JCHAR *buf, jde_n_char size, size_t count, FILE *stream)
● jdeFprintfConvert(LPBHVRCOM lpBhvrCom, FILE *stream, const JCHAR *format, /* [pointer,] */...)
● jdeFscanfConvert(LPBHVRCOM lpBhvrCom, FILE *stream, const JCHAR *format, /* [pointer,] */...)
● jdeFputsConvert(LPBHVRCOM lpBhvrCom, const JCHAR *buf, FILE *stream)
● jdeFgetsConvert(LPBHVRCOM lpBhvrCom, JCHAR *buf, jde_n_char n, FILE *stream)
● jdeFputcConvert(LPBHVRCOM lpBhvrCom, int c, FILE *stream)
● jdeFgetcConvert(LPBHVRCOM lpBhvrCom, FILE *stream)
● jdeGetEncodingNameV1(LPBHVRCOM lpBhvrCom, JCHAR *enc);
fprintf Examples:
a) Character data in file will be encoded the same as it’s encoded in memory:
FILE *fp;
fp = fopen( "c:/testBSFNZ.txt", "w+");
file:///C|/Notes/pdf/computer-erp-jde-solution-ott-03-0035-Unicode_and_C_Business_Functions.htm (4 of 11)07/02/2007 10:47:40 AM
HowTo
b) Data will be written to file in Western European code page. (jdeFprintf does a conversion from UCS2 to default Western European code page)
FILE *fp;
fp = jdeFopen(_J( "c:/testBSFNZ.txt"), _J("w+"));
jdeFprintf(fp, _J(“%s%d\n”), _J(“Line ”), 1);
jdeFclose(fp);
c) Data encoded in the file will be based on the encoding configured using P93081:
FILE *fp;
fp = jdeFopen(_J( "c:/testBSFNZ.txt"), _J("w+"));
jdeFprintfConvert(lpBhvrCom, fp, _J(“%s%d\n”), _J(“Line ”), 1);
jdeFclose(fp);
Slimer
The slimer is an application that will convert on average 90% of the C code from pre-Unicode to Unicode. The remaining 10%, plus all future changes will
need to be performed by a programmer.
Because each Unicode character takes two bytes, you must pay special attention to when you need to specify the number of characters and when you need
to use the number of bytes when programming C business function.
In general, all APIs that use a string variable and its size should use character length, not byte length.
Functions that use a byte array (not necessarily a string), like jdeAlloc, should use byte lengths. If the array is actually a string, it is valid to use jdeStrlen(),
the array's length required by jdeAlloc has to be computed by jdeStrlen() * sizeof (JCHAR). This is critical when doing memory allocations. jdeAlloc
allocates a byte array, not necessarily a string, and so uses a byte count, not a string length:
b = jdeAlloc(0, strlen(a) + 1, 0); will need to be changed to b = jdeAlloc(0, (jdeStrlen(a) + 1) * sizeof (JCHAR), 0);
On the other hand, all the jdeStrxxx functions explicitly handle strings, so character lengths are used, and the sizeof() operator, which returns a byte count,
becomes a problem. Example:
● When using strncpy() the third parameter is the number of characters, not the number of bytes.
● DIM() is a macro that gives the number of characters of an array, Unicode or otherwise.
● Given JCHAR a[10]; DIM(a) will return 10, while sizeof(a) will return 20.
● strncpy (a, b, sizeof (a)); needs to become jdeStrncpy (a, b, DIM (a));
Another area this can cause problems in array is subscripts: If code currently has
char a[10];
a[sizeof(a) – 1] = ‘\0’; /* a[9]='\0'; */
JCHAR a[10];
a[DIM(a) – 1] = _J(‘\0’); /* a[9]=_J('\0'); */
The Posix function memset(), changes memory byte by byte. For example, if buf is 10 bytes long, memset(buf, ’ ’, sizeof (buf)); will set the 10 bytes pointed
to by buf to the value 0x20 (on non-AS/400 machines, the value will be the EBCDIC value of a space on an AS/400).
This still holds true even if ‘ ’ is a Unicode ‘ ’ and has the hex value of 0x0020. This is because memset() truncates the second parameter to a single byte.
If you have code that seta a character array to all spaces using memset(buf, ’ ‘, sizeof(buf)); this would get slimed to memset(buf, _J(‘ ‘), sizeof(buf)); .
However, what actually happens is every byte of buf would get set to 0x20, which means the character buf[0] would be 0x2020, which is the Dagger
character (†) in Unicode.
The basic issue is that we need to use a Unicode character set function (one that sets character by character, rather then byte by byte). To solve this, a new
function, jdeMemset() takes a void pointer, a JCHAR and the number of bytes to set. Use jdeMemset (buf, _J (‘ ‘), sizeof (buf)); to set the Unicode string
buf so that each character is 0x0020. Fortunately, memset (buf, 0, sizeof (buf)); works as it always has. Note that the third argument for jdeMemset() is a
byte count, not character count.
Pointer Arithmetics
Code that currently casts a void* with (char*) to deal with pointer arithmetic in a byte array will need to be modified. The slimer will change the (char*) cast
file:///C|/Notes/pdf/computer-erp-jde-solution-ott-03-0035-Unicode_and_C_Business_Functions.htm (6 of 11)07/02/2007 10:47:40 AM
HowTo
to a (JCHAR*) cast, which means any pointer arithmetic will be operating two bytes at a time. For example (lpVoid is a void*, and points to a structure, not
a string):
There can also be issues when using any memory functions such as memmove, which are all byte array, not string, functions: For example: given the slimed
code
JCHAR* source;
memmove(destination, source, 6);
because memmove takes the number of bytes and in this example the source is a string. In Unicode, the 6 characters will take up 12 bytes.
Again, memxxx functions are byte (integer) oriented and not designed to handle character data. If the source is always a string, conversion to the appropriate
jdeStrxxx function is recommended.
Byte ordering
When sending data across the network to a different platform, the byte order of character data must be taken into account. Unicode characters are unsigned
shorts, so the byte order now matters.
Cache Keys
String cache keys use the number of characters for the size so use DIM instead of sizeof().
When the key is a single character, hard code the nSize = 1 because DIM only works with a character array.
To be able to read Unicode strings do the following: Go to the “Tools” menu, select the “Options” entry, click on the “Debug” tab. Make sure the “Display
Unicode strings” checkbox is checked.
Example 1
Example 2
Example 3
Example 4
Example 5
Resources
Unicode Guidelines
Using CodeChangeCom
For additional information on Unicode refer to the Unicode web site at http://www.unicode.org