Sybase Technical Library - Product Manuals Home
[Search Forms] [Previous Section with Hits] [Next Section with Hits] [Clear Search] Expand Search

Chapter 3: String Operations and Character Attributes [Table of Contents] Chapter 5: Character Set Conversion

Unilib Reference Manual

[-] Chapter 4: Unicode Compression

Chapter 4

Unicode Compression

Overview

This chapter describes the following topics and associated header files:

Compression: unicmprs.h

This section describes interfaces in the library that allow rapid compression and expansion of Unicode strings.

The compression of such strings into a span-encoded "crunched" form of Unicode works particularly well for data encoded mostly in ISO Latin-1 (such as Windows ANSI and ISO 8859-1) or ASCII proper. Clients can expect a nearly two-to-one compression for such data, with relatively minor computing overhead for the compression and expansion.

Compression and Expansion Strategies

Compression

Unicode is structured into script blocks, with ISO 8859-1 located entirely within a single block of 256 values. This assures that much Unicode data will consist of long contiguous sequences of characters that share the same most significant byte (MSB) value in their full 16-bit encoding. This compression mechanism takes advantage of that fact by run-encoding spans of characters that share the same MSB. The value of the MSB and the number of characters in the span are stored in the data stream. All of the characters in that span are then compressed to their least significant byte (LSB).

Expansion

Expansion is accomplished by identifying each span and reassembling the LSB of each into the full Unicode value by repacking the MSB and LSB into the 16-bit value.

Handling Non-Standard Encoding Sequences

The sentinel value 0xFFFF is used to indicate that an alternate encoding sequence has been encountered, in other words, a span of compressed Unicode LSB values. This value was chosen because it is defined as an illegal Unicode character value and is never encountered in a well-formed Unicode string.

The encoding for the span is stored as a single 16-bit value whose MSB is taken as the MSB for all the characters in the compressed span and whose LSB is taken as an unsigned char value indicating the number of characters in the span.

Meaningful values for the MSB and LSB of the span-encoding word are defined as follows.

Table 4-1: Meaningful MSB and LSB Values

Bytes

Integer Constant

MSB

0x00 - 0xFF

LSB

0x06 - 0xFF

These values have been assigned because the minimum span of characters for which this scheme provides compression is 6 Unicode characters in a row with the same MSB. This is also because the maximum number of characters that can be enumerated in a single unsigned byte value is 256.

Note: Any run of less than 6 Unicode characters in a row with the same MSB will not be compressed since no actual storage savings would result from applying the algorithm in such case.

Spans that contain an odd number of characters are padded out with an uninterpreted 0x00 byte. This guarantees that all compression spans are 16-bit aligned, so that subsequent Unicode characters that may be uncompressed are also 16-bit aligned. The resulting compressed string is UNINULL terminated. This means that any 16-bit string operation on the compressed string that depends on UNINULL termination is guaranteed. However, this is not the mode to use for operating on compressed Unicode strings, since the compression mechanism can result in sequences of 0x00 0x00 in the LSB-compressed span. These sequences would then be interpreted as a premature string termination.

Clients with compressed Unicode strings should keep track of their length (returned by the compression interface) and move them with memcpy or other such operations that do not depend on a UNINULL termination value.

Table 4-2 compares uncompressed Unicode to compressed Unicode.

Table 4-2: Uncompressed vs. compressed Unicode

Uncompressed

U+XXXX U+0055 U+006E U+0069 U+0063 U+006F U+0064 U+0065 U+XXXX U+0000

^

Start sequence of >6 Unicode characters with 0x00 value.


Compressed

U+XXXX 0xFFFF 0x0007 0x556E 0x6963 0x6F64 0x6500 U+XXXX U+0000

^ ^ ^ ^ ^

Span MSB/N Compressed Pad resume Unicode
start LSB sequence byte

The compression for a 7-character sequence is minimal, but the longer the contiguous span, the greater the compression (Table 4-3).

Table 4-3: Compression for a 7- to 256-character sequence

Contiguous Span (unichars)

Uncompressed (bytes)

Compressed (bytes)

Percentage Compressed

7

14

12

86%

16

32

20

62%

256

512

260

51%

Note: Clients should never try to dip into a compressed Unicode string and directly access characters.

The compressed byte values are stored in 16-bit values, which are interpreted as unsigned shorts. This means that the actual order of the bytes depends on the native endianness of a machine's int storage. As a result, any byte swapping issues for compressed Unicode strings are the same as those for uncompressed Unicode strings.

The same block swapping algorithms that handle arrays of datatypes built on shorts or ints on a particular platform can be applied equally to uncompressed or compressed Unicode strings.

Compression and Expansion Interfaces

The following pages describe the compression and expansion interfaces for Unicode text strings.

unicmprs_strCompress

Function

Compresses an entire Unicode string into the client's destination buffer, placing the length of the converted string in the destination in the form of unichars.

Syntax

int unicmprs_strCompress ( unistring dest, 
int destlen, const unistring source, unistring *errorPoint )

Parameters

unistring dest - Pointer to client destination buffer

int destlen - Length of client destination buffer in unichars

const unistring source - Source unistring to compress

unistring *errorPoint - Output parameter that will point to error position in the string if an error occurs

Comments

Example

#include <unicmprs.h>
#define BUFLEN (1024) /* for example */
extern unistring s;
{
int rc;
unistring errString;
unichar destBuf[BUFLEN];
rc = unicmprs_strCompress( destBuf, BUFLEN, s, &errString );
if ( rc < 0 )
    {
        /* Do error processing */
    }
}

Returns

Table 4-4: Return codes for unicmprs_strCompress

Return

Meaning

Result

Error Condition

Positive value-Number of characters in the output string including UNINULL

Compression successful

N/A

1

Input was zero-length unistring

Output is zero-length null-terminated string

N/A

-1

Conversion failed

String may already be compressed (embedded 0xFFFF detected)

-2

Compression exceeded destlen of provided buffer

Sets errorPoint to the point in the input string where the overflow started

Client buffer is too short to contain compressed string

See Also

unicmprs_strExpand

unicmprs_strCompressInPlace

Function

The semantics of this interface are identical to unicmprs_strCompress, except that the source string is compressed in place.

Syntax

int unicmprs_strCompressInPlace ( unistring source, unistring *errorPoint )

Parameters

unistring source - Source unistring to compress

unistring *errorPoint - Output parameter that will point to error position in the string should an error occur

Comments

The source pointer must be a UNINULL-terminated unistring. The source must not be a const unistring since the contents will be overwritten during compression.

Client buffer overruns cannot occur since a compressed string is never longer than the source string.

Note: Preconditions are untested. Failure to meet these input preconditions may result in undefined behavior.

Example

#include <unicmprs.h>
extern unistring s;
{
int rc;
unistring errString;
rc = unicmprs_strCompressInPlace( s, &errString );
if ( rc < 0 )
    {
        /* Do error processing */
    }

Returns

Table 4-5: Return codes for unicmprs_strCompressInPlace

Return

Meaning

Result

Error Conditions

Positive value-Number of characters in the output string including UNINULL

Compression successful

N/A

1

Input was zero-length unistring

Output is zero-length null-terminated string

N/A

-1

Conversion failed

String may already be compressed (embedded 0xFFFF detected)

See Also

unicmprs_strCompress

unicmprs_strExpand

Function

Expands an entire Unicode-compressed string into the client's destination buffer in the form of unichars.

Syntax

int unicmprs_strExpand ( unistring dest, int destlen, const unistring source, unistring *errorPoint )

Parameters

unistring dest - Pointer to client destination buffer

int destlen - Length of client destination buffer in unichars

const unistring source - Source unistring to expand

unistring *errorPoint - Output parameter that will point to error position in the string if an error occurs

Comments

Clients must provide valid destination and source pointers and a meaningful destination length greater than 1.

Note: Preconditions are untested. Failure to meet these input preconditions may result in undefined behavior.

Example

#include <unicmprs.h>
#define BUFLEN (1024) /* for example */
extern unistring s;
{
int rc;
unistring errString;
unichar destBuf[BUFLEN];
rc = unicmprs_strExpand( destBuf, BUFLEN, s, &errString );
if ( rc < 0 )
{
/* Do error processing */
}
}

Returns

Table 4-6: Return codes for unicmprs_strExpand

Return

Meaning

Result

Error Conditions

Positive value¾Number of characters in the output string including UNINULL

Successful expansion

String will be UNINULL-terminated. Length calculated including the UNINULL

1

Input was zero-length unistring

Output is zero-length null-terminated string

N/A

-1

Conversion failed

  • errorPoint is set to the position in the input string where the failure occurred
  • Destination buffer may be arbitrarily truncated, but it is always UNINULL-terminated
  • 0xFFFF in string, but following value indicates a span length to be decompressed that is longer than the actual span
  • Expansion resulted in illegal characters, for example, U+FF01 U+FFFF

-2

Conversion failed

Sets errorPoint to the point in the input string where the overflow started

Client buffer too short to contain expanded string

See Also

unicmprs_strCompress

unicmprs_strLength

Function

Reports the length of a compressed Unicode string without actually expanding it into a buffer.

Note: This interface should not be used to determine whether or not a string is compressed.

Syntax

int unicmprs_strLength ( const unistring s )

Parameters

const unistring s - unistring for which the length must be calculated

Comments

The length of a compressed Unicode string is equivalent to unistrlen + 1 on the expanded string (includes the terminating UNINULL). Therefore, if buffer size is an issue, the value reported by unicmprs_strLength is correct for allocation. This can be used if a client simply wants to know what the expanded size would be, or needs to calculate exactly before allocating a buffer for expansion.

If handed an uncompressed Unicode string, this interface reports a value equal to unistrlen + 1 on that same string.

Example

#include <unicmprs.h>
extern unistring s;
{
int rc;
rc = unicmprs_strLength( s );
/* rc is the uncompressed length of s in unichars + 1 */
}

Returns

Returns the length of the expanded string in unichars, plus one for the terminal UNINULL.

unicmprs_strIsCompressed

Function

Scans for the sentinel value 0xFFFF to verify whether one or more spans have been compressed.

Syntax

int unicomprs_strIsCompressed ( const unistring s)

Parameters

const unistring s - unistring to check for compression

Comments

Example

#include <unicmprs.h>
extern unistring s;
{
    if ( unicmprs_strIsCompressed( s ) )
    {
        /* 0xFFFF was detected as signal of compression */
    }
}

Returns

1 = TRUE if compressed

0 = FALSE if uncompressed

Reuters Compression Scheme for Unicode:unircsu.h

This section describes the Sybase implementation of the Reuters Compression Scheme for Unicode (RSCU) algorithm. This algorithm allows rapid compression of Unicode null-terminated strings or arbitrary buffers of Unicode character values by using a sliding window. These interfaces also allow the strings and buffers to be expanded back into the canonical Unicode form.

The Reuters Compression Scheme for Unicode uses a heuristic mechanism to dynamically move the sliding window to produce best compression results. As such, it generally can compress Unicode data better than the unicmprs interfaces. In the worst case, when no compression is possible, however, the unircsu algorithm drops a sentinel tag byte indicating the fact, so the data may actually expand by a byte.

The unicmprs interface is best suited to implementation of compression in a string class, where the underlying type of the string data store is maintained as the same unsigned 16-bit data type.

The unircsu interface is best suited for compression of Unicode data for streaming interfaces as with passing Unicode data through a bandwidth-limited communications protocol.

The Reuters interfaces return the codes listed in Table 4-7.

Table 4-7: Return codes for unircsu.h

Return

Meaning

Value

UNIRCSU_BufferOverrun

Client buffer was too short

-1

UNIRCSU_IllegalChar

Illegal Unicode value encountered (0xFFFE or 0xFFFF)

-2

UNIRCSU_IllegalData

Unterminated code or otherwise invalid byte encountered in RCSU data stream

-3

unircsu_strCompress

Function

Compresses an entire Unicode UNINULL-terminated string into the client destination buffer.

Syntax

int unircsu_strCompress ( UChar8 *dest, int destlen, const unistring source, unistring *errorPoint )

Parameters

UChar8 *dest - Pointer to client destination buffer

int destlen - Length of client destination buffer in unichars

const unistring source - Source unistring to compress

unistring *errorPoint - Output parameter that will point to the error position in the string if an error occurs

Comments

On successful compression, the interface returns the length of the converted string (in bytes) placed into dest. This string will be terminated with one or two NULL bytes, depending on which mode the compression was in when the end of data was reached. The length returned calculation includes any terminal NULLs.

Note: These preconditions are not tested. Failure to meet these input preconditions will result in undefined behavior.

Example

#include <unircsu.h>
#define BUFLEN (1024) /* for example */
extern unistring s;
{
int rc;
unistring errString;
UChar8 destBuf[BUFLEN];
rc = unircsu_strCompress( destBuf, BUFLEN, s, &errString );
if ( rc < 0 )
    {
        /* Do error processing */
    }
}

Returns

Table 4-8: Return codes for unircsu_strCompress

Return

Result

Meaning

Positive value

Length of the converted string (in bytes), including any terminating NULL bytes

1

An input of a zero-length unistring will result in the output of a zero-length null-terminated string, allowing fail-safe error recovery mechanisms

Signifies the number of characters in the output string, including the single NULL

Negative value

errorPoint is set to the position in the input string where the failure occurred and the contents of the destination buffer remain undefined

Failure to convert

UNIRCSU_BufferOverrun

Sets errorPoint to the point in the input string where the overflow starts

Client buffer is too short to contain the compressed string

UNIRCSU_IllegalChar

Input data contains illegal non-characters (0xFFFE or 0xFFFF)

See Also

unircsu_strExpand

unircsu_strLength

Function

Reports the length of a Unicode string compressed with the RCSU algorithm without expanding it into a buffer.

Syntax

int unircsu_strLength ( const UChar8 *s )

Parameters

const UChar8 *s - Compressed string to measure

Comments

This interface should be used only with RCSU data that have been compressed with unircsu_strCompress, because it assumes that the original data were terminated with a UNINULL.

Example

#include <unircsu.h>
extern UChar8* s;
{
int rc;
rc = unircsu_strLength( s );
/* rc is the uncompressed length of s in unichars + 1 */
}

Returns

Table 4-9: Return codes for unircsu_strLength

Return

Result

Positive value

Length of the converted string in unichars, including the terminating UNINULL

The length of the converted string is equivalent to unistrlen + 1 on the expanded string, including the terminating UNINULL. Therefore, if buffer size is an issue, the value reported by unircsu_strLength is correct for allocation. This value can be used if a client simply wants to know what the expanded size would be or needs to calculate the size before allocating a buffer for expansion.

unircsu_strExpand

Function

Expands an entire RCSU compressed string into the client's destination buffer.

Syntax

int unircsu_strExpand ( unistring dest, int destlen, const UChar8 *source, const UChar8 **errorPoint )

Parameters

unistring dest - unistring destination

int destlen - Length of client destination buffer in unichars

const UChar8 *source - Source string to expand

const UChar8 **errorPoint - Output parameter that will point to the error position in the string if an error occurs

Comments

This interface is the companion to unircsu_strCompress, as it expects to terminate when finding the compressed analog of a terminal UNINULL for the original unistring. It should not be used for expanding RCSU data that was compressed with unircsu_dataCompress, since that data may have arbitrarily embedded UNINULL values. Clients must provide valid destination and source pointers, and a meaningful destlen more than 1.

Note: These preconditions are not tested in the interest of interface speed. Failure to meet these input preconditions may result in undefined behavior.

Example

#include <unircsu.h>
#define BUFLEN (1024) /* for example */
extern UChar8* s;
{
int rc;
UChar8* errString;
unichar destBuf[BUFLEN];
rc = unircsu_strExpand( destBuf, BUFLEN, s, &errString );
if ( rc < 0 )
    {
        /* Do error processing */
    }
}

Returns

Table 4-10: Return codes for unircsu_strExpand

Return

Result

Meaning

Positive value

Length of the converted string in unichars, including the terminating UNINULL.

Successful expansion. This string will be UNINULL-terminated and the length is calculated, including the UNINULL.
errorPoint is undefined.

1

Zero-length null-terminated unistring.

Equals the number of characters in the output string, including the UNINULL.

Negative value

errorPoint is set to the position in the input string where the failure occurred; destination buffer may be arbitrarily truncated, but is always UNINULL-terminated.

Failure to convert.

UNIRCSU_BufferOverrun

Sets errorPoint to the point in the input string where the failure occurred.

Client buffer is too short to contain the expanded string.

UNIRCSU_IllegalChar

Output data contains illegal non-characters (0xFFFE or 0xFFFF).

See Also

unircsu_strCompress

unircsu_dataCompress

Function

Compresses an entire arbitrary data buffer of Unicode characters into the client's destination buffer, using the general RCSU encoding algorithm.

Syntax

int unircsu_dataCompress ( UChar8 *dest, int destlen, const unichar *source, int sourcelen, const unichar **errorPoint )

Parameters

UChar8 *dest - Pointer to client destination buffer

int destlen - Length of client destination buffer in unichars

const unichar *source - Source unistring to compress

int sourcelen - Length of source in unichars

const unichar **errorPoint - Output parameter that will point to the error position in the data if an error occurs

Comments

The semantics of this interface are as for unircsu_strCompress, but the source is a pointer to a buffer with an arbitrary collection of Unicode characters. The sourcelen parameter defines the length of the data in unichars. No assumptions are made about null-termination, and any embedded UNINULLs are compressed along with the rest of the data.

Example

#include <unircsu.h>
#define BUFLEN (1024) /* for example */
extern unichar* data;
extern int datalen;
{
int rc;
unichar* errPoint;
UChar8 destBuf[BUFLEN];
rc = unircsu_dataCompress( destBuf, BUFLEN, data, datalen, &errPoint );
if ( rc < 0 )
    {
        /* Do error processing */
    }
}

Returns

Table 4-11: Return codes for unircsu_dataCompress

Return

Result

Meaning

Positive value

Length of the converted data in bytes

UNIRCSU_BufferOverrun

Client buffer is too short to contain the compressed data

UNIRCSU_IllegalChar

Input data contains illegal non-characters (0xFFFE or 0xFFFF)

unircsu_dataExpand

Function

Expands an arbitrary data buffer of RCSU compressed data into the client's destination buffer, using the general RCSU encoding algorithm.

Syntax

int unircsu_dataExpand ( unichar *dest, int destlen, const UChar8 *source, int sourcelen, const UChar8 **errorPoint )

Parameters

unichar *dest - Pointer to client destination buffer

int destlen - Length of client destination buffer in unichars

const UChar8 *source - Pointer to data buffer to expand

int sourcelen - Length of source data buffer

const UChar8 **errorPoint - Output parameter that will point to the error position in the buffer if an error occurs

Comments

Example

#include <unircsu.h>
#define BUFLEN (1024) /* for example */
extern UChar8* data;
extern int datalen;
{
int rc;
UChar8** errPoint;
unichar destBuf[BUFLEN];
rc = unircsu_dataExpand(destBuf,BUFLEN,data,datalen, &errPoint );
if ( rc < 0 )
    {
        /* Do error processing */
    }
}

Returns

Table 4-12: Return codes for unircsu_dataExpand

Return

Result

Meaning

Positive value

Length of the converted data in unichars

UNIRCSU_BufferOverrun

Client buffer is too short to contain the expanded data

UNIRCSU_IllegalChar

Output data contains illegal non-characters (0xFFFE or 0xFFFF)

UNIRCSU_IllegalData

Input data contains illegal values (for example, only the first byte of a required two-byte sequence)


Transform Operations: unictfrm.h [Table of Contents] Chapter 5: Character Set Conversion