![]() | ![]() |
Home |
|
|
Unilib Reference Manual |
|
| Chapter 3: String Operations and Character Attributes |
|
| Transform Operations: unictfrm.h |
Many character transform operations can be defined on the Unicode character set. The prototypical transform is that of changing the case of a character.
The basic Unicode transform operations are typically defined on single unichars that return a single unichar. They are defined again on a unistring, operating directly on the string. In some instances, as with case, an operation may also be defined which simply reports whether a character has a transformable property of the defined type. These operations follow an interface template:
unichar unictfrm_XXX( unichar c );
void unictfrm_StrXXX( unistring s );
int unictfrm_IsXXX( unichar c );
where unichar c = a character of the type unichar
and s = a pointer to a string of characters of the type unichar
This section defines a series of individual transform types. Some examples of transforms include those shown in Table 3-9 below.
Transforms | Type of Script |
|---|---|
UppercaseLowercase | Latin, Cyrillic, Greek scripts |
KatakanaHiraganaRomaji | Japanese writing system |
HankakuZenkaku | Japanese writing system |
PrecomposedComposed character sequence | Latin and other scripts |
Composed with accentBaseform | Latin, and so on, for normalization |
HangulJamo sequence | Korean script |
Numeric characterNumeric value | Numerous scripts |
Horizontal formVertical form | Numerous scripts |
Compatibility formPreferred Unicode character | Japanese and Chinese |
Note: If a particular transform has no meaning for the character in question (for example, unictfrm_ToLower( (unichar)('$') ), the interface always returns the same character, rather than an error.
The string transform interfaces are defined as void functions and assume UNINULL-terminated strings. Passing them a non-terminated string results in undefined (and generally unsatisfactory) behavior.
The basic Unicode transform operations are fast, language-independent transforms that also have the property of length-invariance. As such, they do the following:
More complex and complete interfaces will be defined for such non-length-invariant transforms in future versions of the Library.
The Unicode transforms for upper casing and lower casing provide only the simplest default case folding (for speed and fall-back behavior). There is no provision here for one-to-many or many-to-one mappings in the case transform.
For many characters that have case-pairs (for example, A/a), the following statements are true:
c º ( unictfrm_ToUpper( unictfrm_ToLower( c ) ) )
c º ( unictfrm_ToLower( unictfrm_ToUpper( c ) ) )
However, exceptions exist, since more than one lowercase character may share a single uppercase form or vice versa. Therefore, changing case should be considered data destructive, as it may not round-trip correctly.
The unictfrm_StrToUpper and unictfrm_StrToLower interfaces change the values in the UNINULL-terminated string provided by the client. Therefore, calling them with a string constant would be incorrect.
Also, if the original value is to be retained, the client should first make a copy of the string in question and pass that copy to the unictfrm_StrToUpper or unictfrm_StrToLower interfaces. The string interfaces avoid the extra function call and are, therefore, somewhat faster than calling unictfrm_ToUpper repeatedly on the content of a string.
Instances in which case-changes involve one-to-many mappings are ignored in these interfaces. The most notorious example is the German ess-tset (U+00DF), whose correct uppercase is SS, and which requires lengthening the string for case conversion.
Transforms a character to upper case.
unichar unictfrm_ToUpper( unichar c )
unichar c - Character to transform
None
#include <unictfrm.h>
extern unichar c;
{unichar localC;
localC = unictfrm_ToUpper( c );
}
Transformed unichar value
Transforms a character to lower case.
unichar unictfrm_ToLower( unichar c )
unichar c - Character to transform
None
#include <unictfrm.h>
extern unichar c;
{unichar localC;
localC = unictfrm_ToLower( c );
}
Transformed unichar value
Transforms a string to upper case.
void unictfrm_StrToUpper( unistring s )
unistring s - unistring to transform
None
#include <unictfrm.h>
extern unistring s;
{unictfrm_StrToUpper( s );
}
None
Transforms a string to lower case.
void unictrm_StrToLower ( unistring s )
unistring s - unistring to transform
None
#include <unictfrm.h>
extern unistring s;
{unictfrm_StrToLower( s );
}
None
Report whether a character is upper case.
void unictrm_IsUpper ( unichar c )
unichar c - character to report on
None
#include <unictfrm.h>
extern unichar c;
{if (unictfrm_IsUpper( c ))
{/* character is uppercase */
}
}
1 = TRUE (if character is upper case)
0 = FALSE (otherwise)
Report whether a character is lower case.
void unictrm_IsLower ( unichar c )
unichar c- character to report on
None
#include <unictfrm.h>
extern unichar c;
{if (unictfrm_IsLower( c ))
{/* character is lowercase */
}
}
1 = TRUE (if character is lower case)
0 = FALSE (otherwise)
The following transforms fold upper- and lowercase zenkaku ASCII into the ASCII range. This is useful for parsing identifiers in Japanese systems.
Note: This transform is exact, but not reversible because it folds different characters together.
Folds a zenkaku ASCII character into regular ASCII.
unichar unictfrm_FoldASCII ( unichar c )
unichar c - Character to transform
None
#include <unictfrm.h>
extern unichar c;
{unichar localC;
localC = unictfrm_FoldASCII( c );
}
Transformed unichar value
Folds a zenkaku ASCII string into regular ASCII.
void unictfrm_StrFoldASCII ( unistring s )
unistring s - unistring to transform
None
#include <unictfrm.h>
extern unistring s;
{unictfrm_StrFoldASCII( s );
}
None
The following transforms fold all Compatibility Area half- and full-width characters into their standard Unicode values without lower casing. This includes:
This interface treats half-width katakana folding as follows.
Katakana folding occurs on a character-by-character basis without context-sensitive analysis and replacement. This results in somewhat aberrant katakana encodings for voiced and semi-voiced Japanese syllables converted from half-width katakana, as shown below.
U+FF76 (half-width ka) + U+FF9E |
|
A more appropriate outcome would be to convert to:
U+30AC (Katakana ga)
To do so, however, requires context analysis and changes string length.
Normalizes any compatibility area zenkaku and hankaku character to its standard form.
unichar unictfrm_FoldCZone ( unichar c )
unichar c - Character to transform
None
#include <unictfrm.h>
extern unichar c;
{unichar localC;
localC = unictfrm_FoldCZone( c );
}
Transformed unichar value
Normalizes any compatibility area zenkaku and hankaku string to its standard form.
void unictfrm_StrFoldCZone ( unistring s )
unistring s - unistring to transform
None
#include <unictfrm.h>
extern unistring s;
{unictfrm_StrFoldCZone( s );
}
None
Three interfaces provide numeric evaluations of Unicode characters. The first two deal only with integers, and return an error value for fractional Unicode characters. The third returns a float and also handles fractional values for Unicode characters.
This function evaluates the integer value of a Unicode character. It handles all folding properly, evaluating full-width ASCII values without requiring prefolding.
int unictfrm_ToIntValue ( unichar c )
unichar c - Character to evaluate
None
or
Return | Value |
Its integer value, for example, 9 | Numeric Unicode character, for example:
|
-1 | Non-numeric Unicode characters |
-2 | Numeric with fractional value |
Evaluates the integer value of a Unicode character used as a hex digit. Only valid hex digits(0-9, A-F) are evaluated.
int unictfrm_ToHexValue ( unichar c )
unichar c - Character to evaluate
None
if (unictype_IsHexDigit ( c ) )
n = unictfrm_ToHexValue ( c ) ;
Return | Value |
|---|---|
Its integer value, for example, 13 | Hexadecimal numeric Unicode character, for example, U + 0044 "D" |
-1 | Non-numeric and non-digit Unicode characters |
The main use of this transform is to evaluate the float values of Unicode fraction characters. This function handles all folding properly, evaluating full-width ASCII without having to be prefolded.
Because it is slower than unictfrm_ToIntValue, it should only be used when dealing with fractions.
float unictfrm_ToFloatValue ( unichar c )
unichar c - Character to evaluate
None
if ( unictype_IsNumeric (c) )
ff = unictfrm_ToFloatValue (c);
Value of a numeric Unicode character as a float, as shown in the example below:
Return | Value |
|---|---|
0.75 | U+00BE "3/4" |
9.0 | U+0039 "9" |
-1.0 | Non-numeric Unicode character |
Two interfaces are provided to allow for formatting and parsing of simple positive integers. These are convenience routines for programming, and are not meant to substitute for country-specific locale-based formatting of numbers. Effectively, these constitute Unicode-based equivalents to C library interfaces atoi and itoa, upon which they are loosely based.
Convert a simple numeric Unicode string (for example, 1234 or 007AFFFF) into an unsigned 32-bit integer.
UNICTFRM_RET unictfrm_StrToInt( UInt32 *dest, unistring src, int radix )
UInt32 *dest - pointer to client-supplied variable to fill in
unistring src - unistring to convert
int radix - radix to use in the numeric conversion
The radix may be any value between 2 and 16 inclusive.
This routine does not parse C-style numeric constant conventions. A trailing l, L, u, or U will cause an error. An initial 0 is not taken as indicating an octal string. An initial 0x will cause an error.
#include <unictfrm.h>
extern unistring numStr;
{UInt32 value;
UNICTFRM_RET rc;
rc = unictfrm_StrToInt( &value, numStr, 16 );
if ( rc != UNICTFRM_OK )
{/* handle error */
}
}
Value | Return |
|---|---|
UNICTFRM_OK | String was successfully converted to a number. |
UNICTFRM_BadRadix | Radix outside range 2..16 was provided. |
UNICTFRM_BadInput | Non-parseable string was provided, or string represented a number larger than could be represented by an unsigned 32-bit int. |
UNICTFRM_BadDigit | String contained an invalid digit for the specified radix. |
Format an unsigned 32-bit integer as a simple numeric Unicode string.
UNICTFRM_RET unictfrm_IntToStr( unistring dest,
int destlen, UInt32 src, int radix, int numDigits )
unistring dest - client-supplied buffer
int destlen - length of client-supplied buffer in unichar's
UInt32 src - 32-bit integer to convert
int radix - radix to use in the numeric conversion
int numDigits - number of digits to use in formatting the string
The radix may be any value between 2 and 16 inclusive. If numDigits is set to 0, the number of digits will depend on the size of the number, and the formatted string will not be zero-filled at the left. If numDigits is set to a positive value from 1 to 32, inclusive, the formatted string will always have that number of digits, and will be zero-filled at the left if necessary. This is especially useful for formatting hexadecimal strings.
#include <unictfrm.h>
extern UInt32 number;
{unichar buf[40];
UNICTFRM_RET rc;
rc = unictfrm_IntToStr( buf, 40, number, 16, 8 );
if ( rc != UNICTFRM_OK )
{/* handle error */
}
}
Value | Return |
|---|---|
UNICTFRM_OK | String was successfully converted to number. |
UNICTFRM_BadRadix | Radix outside range 2..16 was provided. |
UNICTFRM_BadDigit | The parameter numDigits is outside the range 0..32. |
UNICTFRM_BufferOverflow | The client buffer was too short to contain the formatted string. |
An implementation of the traditional soundex algorithm is provided to make it easier to support Unicode-based software implementations which require conversion of an ASCII-based soundex.
This implementation of soundex operates directly on a Unicode string, but only interprets the Latin-1 portion of Unicode, for compatibility with existing soundex. More sophisticated algorithms are required for general sound-matching against Unicode textual data.
Provide the soundex key for a Unicode string.
UNICTFRM unictfrm_Soundex ( char *outbuf,
int outbuflen, unistring src, int mode )
char *outbuf - client-supplied char buffer for soundex key
int outbuflen - length of client-supplied buffer in bytes
unistring src - Null-terminated Unicode string to compute soundex key on
int mode - mode for the soundex key generation
The soundex key is a four-character string, null-terminated, of the form A123, S014, T001, etc., where the first character is a letter A..Z, and the next three characters are digits 0..6. In order to hold this key, the client-supplied buffer should be 5 bytes or longer. The default value for a soundex key is Z000, which will be returned for a null string or a string containing no alphabetic letters.
The mode values are interpreted as follows:
Value | Return |
|---|---|
UNICTFRM_OK | Soundex key was successfully converted to number. |
UNICTFRM_BufferOverflow | The client buffer was too short. |
UNICTFRM_BadMode | Unrecognized mode value was passed in. |
|
|