Sybase Technical Library - Product Manuals Home
[Search Forms] [Previous Section with Hits] [Next Section with Hits] [Clear Search] Expand Search

Character Attributes: unictype.h [Table of Contents] Chapter 4: Unicode Compression

Unilib Reference Manual

[-] Chapter 3: String Operations and Character Attributes
[-] Transform Operations: unictfrm.h

Transform Operations: unictfrm.h

Many character transform operations can be defined on the Unicode character set. The prototypical transform is that of changing the case of a character.

The basic Unicode transform operations are typically defined on single unichars that return a single unichar. They are defined again on a unistring, operating directly on the string. In some instances, as with case, an operation may also be defined which simply reports whether a character has a transformable property of the defined type. These operations follow an interface template:

unichar unictfrm_XXX( unichar c );

void unictfrm_StrXXX( unistring s );

int unictfrm_IsXXX( unichar c );

where unichar c = a character of the type unichar

and s = a pointer to a string of characters of the type unichar

This section defines a series of individual transform types. Some examples of transforms include those shown in Table 3-9 below.

Table 3-9: Transform types

Transforms

Type of Script

UppercaseLowercase

Latin, Cyrillic, Greek scripts

KatakanaHiraganaRomaji

Japanese writing system

HankakuZenkaku

Japanese writing system

PrecomposedComposed character sequence

Latin and other scripts

Composed with accentBaseform

Latin, and so on, for normalization

HangulJamo sequence

Korean script

Numeric characterNumeric value

Numerous scripts

Horizontal formVertical form

Numerous scripts

Compatibility formPreferred Unicode character

Japanese and Chinese

Note: If a particular transform has no meaning for the character in question (for example, unictfrm_ToLower( (unichar)('$') ), the interface always returns the same character, rather than an error.

The string transform interfaces are defined as void functions and assume UNINULL-terminated strings. Passing them a non-terminated string results in undefined (and generally unsatisfactory) behavior.

The basic Unicode transform operations are fast, language-independent transforms that also have the property of length-invariance. As such, they do the following:

More complex and complete interfaces will be defined for such non-length-invariant transforms in future versions of the Library.

Language-Independent Transforms with Default Case Folding

The Unicode transforms for upper casing and lower casing provide only the simplest default case folding (for speed and fall-back behavior). There is no provision here for one-to-many or many-to-one mappings in the case transform.

Characters with Case Pairs

For many characters that have case-pairs (for example, A/a), the following statements are true:

c º ( unictfrm_ToUpper( unictfrm_ToLower( c ) ) )
c º ( unictfrm_ToLower( unictfrm_ToUpper( c ) ) )

However, exceptions exist, since more than one lowercase character may share a single uppercase form or vice versa. Therefore, changing case should be considered data destructive, as it may not round-trip correctly.

Case Transforms for Strings

The unictfrm_StrToUpper and unictfrm_StrToLower interfaces change the values in the UNINULL-terminated string provided by the client. Therefore, calling them with a string constant would be incorrect.

Also, if the original value is to be retained, the client should first make a copy of the string in question and pass that copy to the unictfrm_StrToUpper or unictfrm_StrToLower interfaces. The string interfaces avoid the extra function call and are, therefore, somewhat faster than calling unictfrm_ToUpper repeatedly on the content of a string.

Instances in which case-changes involve one-to-many mappings are ignored in these interfaces. The most notorious example is the German ess-tset (U+00DF), whose correct uppercase is SS, and which requires lengthening the string for case conversion.

unictfrm_ToUpper

Function

Transforms a character to upper case.

Syntax

unichar unictfrm_ToUpper( unichar c )

Parameters

unichar c - Character to transform

Comments

None

Example

#include <unictfrm.h>
extern unichar c;
{
unichar localC;
localC = unictfrm_ToUpper( c );
}

Returns

Transformed unichar value

unictfrm_ToLower

Function

Transforms a character to lower case.

Syntax

unichar unictfrm_ToLower( unichar c )

Parameters

unichar c - Character to transform

Comments

None

Example

#include <unictfrm.h>
extern unichar c;
{
unichar localC;
localC = unictfrm_ToLower( c );
}

Returns

Transformed unichar value

unictfrm_StrToUpper

Function

Transforms a string to upper case.

Syntax

void unictfrm_StrToUpper( unistring s )

Parameters

unistring s - unistring to transform

Comments

None

Example

#include <unictfrm.h>
extern unistring s;
{
unictfrm_StrToUpper( s );
}

Returns

None

unictfrm_StrToLower

Function

Transforms a string to lower case.

Syntax

void unictrm_StrToLower ( unistring s )

Parameters

unistring s - unistring to transform

Comments

None

Example

#include <unictfrm.h>
extern unistring s;
{
unictfrm_StrToLower( s );
}

Returns

None

unictfrm_IsUpper

Function

Report whether a character is upper case.

Syntax

void unictrm_IsUpper ( unichar c )

Parameters

unichar c - character to report on

Comments

None

Example

#include <unictfrm.h>
extern unichar c;
{
if (unictfrm_IsUpper( c ))
   {
/* character is uppercase */
   }
}

Returns

1 = TRUE (if character is upper case)

0 = FALSE (otherwise)

unictfrm_IsLower

Function

Report whether a character is lower case.

Syntax

void unictrm_IsLower ( unichar c )

Parameters

unichar c- character to report on

Comments

None

Example

#include <unictfrm.h>
extern unichar c;
{
if (unictfrm_IsLower( c ))
   {
/* character is lowercase */
   }
}

Returns

1 = TRUE (if character is lower case)

0 = FALSE (otherwise)

ASCII Folding

The following transforms fold upper- and lowercase zenkaku ASCII into the ASCII range. This is useful for parsing identifiers in Japanese systems.

Note: This transform is exact, but not reversible because it folds different characters together.

unictfrm_FoldASCII

Function

Folds a zenkaku ASCII character into regular ASCII.

Syntax

unichar unictfrm_FoldASCII ( unichar c )

Parameters

unichar c - Character to transform

Comments

None

Example

#include <unictfrm.h>
extern unichar c;
{
unichar localC;
localC = unictfrm_FoldASCII( c );
}

Returns

Transformed unichar value

unictfrm_StrFoldASCII

Function

Folds a zenkaku ASCII string into regular ASCII.

Syntax

void unictfrm_StrFoldASCII ( unistring s )

Parameters

unistring s - unistring to transform

Comments

None

Example

#include <unictfrm.h>
extern unistring s;
{
unictfrm_StrFoldASCII( s );
}

Returns

None

Half-Width/Full-Width Compatibility Area Folding

The following transforms fold all Compatibility Area half- and full-width characters into their standard Unicode values without lower casing. This includes:

This interface treats half-width katakana folding as follows.

Katakana folding occurs on a character-by-character basis without context-sensitive analysis and replacement. This results in somewhat aberrant katakana encodings for voiced and semi-voiced Japanese syllables converted from half-width katakana, as shown below.

U+FF76 (half-width ka) + U+FF9E
(half-width voiced mark) ®


U30AB (ka) + U+309B
(voiced mark)

A more appropriate outcome would be to convert to:

U+30AC (Katakana ga)

To do so, however, requires context analysis and changes string length.

unictfrm_FoldCZone

Function

Normalizes any compatibility area zenkaku and hankaku character to its standard form.

Syntax

unichar unictfrm_FoldCZone ( unichar c )

Parameters

unichar c - Character to transform

Comments

None

Example

#include <unictfrm.h>
extern unichar c;
{
unichar localC;
localC = unictfrm_FoldCZone( c );
}

Returns

Transformed unichar value

unictfrm_StrFoldCZone

Function

Normalizes any compatibility area zenkaku and hankaku string to its standard form.

Syntax

void unictfrm_StrFoldCZone ( unistring s )

Parameters

unistring s - unistring to transform

Comments

None

Example

#include <unictfrm.h>
extern unistring s;
{
unictfrm_StrFoldCZone( s );
}

Returns

None

Numeric Evaluation of Unicode Characters

Three interfaces provide numeric evaluations of Unicode characters. The first two deal only with integers, and return an error value for fractional Unicode characters. The third returns a float and also handles fractional values for Unicode characters.

unictfrm_ToIntValue

Function

This function evaluates the integer value of a Unicode character. It handles all folding properly, evaluating full-width ASCII values without requiring prefolding.

Syntax

int unictfrm_ToIntValue ( unichar c )

Parameters

unichar c - Character to evaluate

Comments

None

Example

  1. if (unictype_IsNumeric ( c ))
    {n = (unicfrm_ToIntValue ( c ) );
    if ( n < 0 )
    /*error process for fractions here*/
    }

or

  1. if (unictype_IsDecimalDigit( c ))
    n = unictfrm_ToIntValue ( c );
    /*guaranteed to succeed for decimal digits*/

Returns

Table 3-10: Return codes for unictfrm_ToIntValue

Return

Value

Its integer value, for example, 9

Numeric Unicode character, for example:

  • U+0039 "9"

    or

  • U+2089 SUBSCRIPT DIGIT NINE

-1

Non-numeric Unicode characters

-2

Numeric with fractional value

unictfrm_ToHexValue

Function

Evaluates the integer value of a Unicode character used as a hex digit. Only valid hex digits(0-9, A-F) are evaluated.

Syntax

int unictfrm_ToHexValue ( unichar c )

Parameters

unichar c - Character to evaluate

Comments

None

Example

if (unictype_IsHexDigit ( c ) )
n = unictfrm_ToHexValue ( c ) ;

Returns

Table 3-11: Return codes for unictfrm_ToHexValue

Return

Value

Its integer value, for example, 13

Hexadecimal numeric Unicode character, for example, U + 0044 "D"

-1

Non-numeric and non-digit Unicode characters

unictfrm_ToFloatValue

Function

The main use of this transform is to evaluate the float values of Unicode fraction characters. This function handles all folding properly, evaluating full-width ASCII without having to be prefolded.

Because it is slower than unictfrm_ToIntValue, it should only be used when dealing with fractions.

Syntax

float unictfrm_ToFloatValue ( unichar c )

Parameters

unichar c - Character to evaluate

Comments

None

Example

if ( unictype_IsNumeric (c) )
ff = unictfrm_ToFloatValue (c);

Returns

Value of a numeric Unicode character as a float, as shown in the example below:

Table 3-12: Return codes for unictfrm_ToFloatValue

Return

Value

0.75

U+00BE "3/4"

9.0

U+0039 "9"

-1.0

Non-numeric Unicode character

Basic String Formatting and Parsing of Integers

Two interfaces are provided to allow for formatting and parsing of simple positive integers. These are convenience routines for programming, and are not meant to substitute for country-specific locale-based formatting of numbers. Effectively, these constitute Unicode-based equivalents to C library interfaces atoi and itoa, upon which they are loosely based.

unictfrm_StrToInt

Function

Convert a simple numeric Unicode string (for example, 1234 or 007AFFFF) into an unsigned 32-bit integer.

Syntax

UNICTFRM_RET unictfrm_StrToInt( UInt32 *dest, unistring src, int radix )

Parameters

UInt32 *dest - pointer to client-supplied variable to fill in

unistring src - unistring to convert

int radix - radix to use in the numeric conversion

Comments

The radix may be any value between 2 and 16 inclusive.

This routine does not parse C-style numeric constant conventions. A trailing l, L, u, or U will cause an error. An initial 0 is not taken as indicating an octal string. An initial 0x will cause an error.

Example

#include <unictfrm.h>
extern unistring numStr;
{
UInt32 value;
UNICTFRM_RET rc;
    rc = unictfrm_StrToInt( &value, numStr, 16 );
    if ( rc != UNICTFRM_OK )
    {
        /* handle error */
    }
}

Returns

Table 3-13: Return codes for unictfrm_StrToInt

Value

Return

UNICTFRM_OK

String was successfully converted to a number.

UNICTFRM_BadRadix

Radix outside range 2..16 was provided.

UNICTFRM_BadInput

Non-parseable string was provided, or string represented a number larger than could be represented by an unsigned 32-bit int.

UNICTFRM_BadDigit

String contained an invalid digit for the specified radix.

unictfrm_IntToStr

Function

Format an unsigned 32-bit integer as a simple numeric Unicode string.

Syntax

UNICTFRM_RET unictfrm_IntToStr( unistring dest, 
int destlen, UInt32 src, int radix, int numDigits )

Parameters

unistring dest - client-supplied buffer

int destlen - length of client-supplied buffer in unichar's

UInt32 src - 32-bit integer to convert

int radix - radix to use in the numeric conversion

int numDigits - number of digits to use in formatting the string

Comments

The radix may be any value between 2 and 16 inclusive. If numDigits is set to 0, the number of digits will depend on the size of the number, and the formatted string will not be zero-filled at the left. If numDigits is set to a positive value from 1 to 32, inclusive, the formatted string will always have that number of digits, and will be zero-filled at the left if necessary. This is especially useful for formatting hexadecimal strings.

Example

#include <unictfrm.h>
extern UInt32 number;
{
unichar buf[40];
UNICTFRM_RET rc;
    rc = unictfrm_IntToStr( buf, 40, number, 16, 8 );
    if ( rc != UNICTFRM_OK )
    {
        /* handle error */
    }
}

Returns

Table 3-14: Return codes for unictfrm_IntToSt

Value

Return

UNICTFRM_OK

String was successfully converted to number.

UNICTFRM_BadRadix

Radix outside range 2..16 was provided.

UNICTFRM_BadDigit

The parameter numDigits is outside the range 0..32.

UNICTFRM_BufferOverflow

The client buffer was too short to contain the formatted string.

Soundex

An implementation of the traditional soundex algorithm is provided to make it easier to support Unicode-based software implementations which require conversion of an ASCII-based soundex.

This implementation of soundex operates directly on a Unicode string, but only interprets the Latin-1 portion of Unicode, for compatibility with existing soundex. More sophisticated algorithms are required for general sound-matching against Unicode textual data.

unictfrm_Soundex

Function

Provide the soundex key for a Unicode string.

Syntax

UNICTFRM unictfrm_Soundex ( char *outbuf, 
int outbuflen, unistring src, int mode )

Parameters

char *outbuf - client-supplied char buffer for soundex key

int outbuflen - length of client-supplied buffer in bytes

unistring src - Null-terminated Unicode string to compute soundex key on

int mode - mode for the soundex key generation

Comments

The soundex key is a four-character string, null-terminated, of the form A123, S014, T001, etc., where the first character is a letter A..Z, and the next three characters are digits 0..6. In order to hold this key, the client-supplied buffer should be 5 bytes or longer. The default value for a soundex key is Z000, which will be returned for a null string or a string containing no alphabetic letters.

The mode values are interpreted as follows:

Returns

Table 3-15: Return codes for unictfrm_Soundex

Value

Return

UNICTFRM_OK

Soundex key was successfully converted to number.

UNICTFRM_BufferOverflow

The client buffer was too short.

UNICTFRM_BadMode

Unrecognized mode value was passed in.


Spanning Character Properties [Table of Contents] Chapter 4: Unicode Compression