![]() | ![]() |
Home |
|
|
Unilib Reference Manual |
|
| Chapter 4: Unicode Compression |
This chapter describes the following topics and associated header files:
This section describes interfaces in the library that allow rapid compression and expansion of Unicode strings.
The compression of such strings into a span-encoded "crunched" form of Unicode works particularly well for data encoded mostly in ISO Latin-1 (such as Windows ANSI and ISO 8859-1) or ASCII proper. Clients can expect a nearly two-to-one compression for such data, with relatively minor computing overhead for the compression and expansion.
Unicode is structured into script blocks, with ISO 8859-1 located entirely within a single block of 256 values. This assures that much Unicode data will consist of long contiguous sequences of characters that share the same most significant byte (MSB) value in their full 16-bit encoding. This compression mechanism takes advantage of that fact by run-encoding spans of characters that share the same MSB. The value of the MSB and the number of characters in the span are stored in the data stream. All of the characters in that span are then compressed to their least significant byte (LSB).
Expansion is accomplished by identifying each span and reassembling the LSB of each into the full Unicode value by repacking the MSB and LSB into the 16-bit value.
The sentinel value 0xFFFF is used to indicate that an alternate encoding sequence has been encountered, in other words, a span of compressed Unicode LSB values. This value was chosen because it is defined as an illegal Unicode character value and is never encountered in a well-formed Unicode string.
The encoding for the span is stored as a single 16-bit value whose MSB is taken as the MSB for all the characters in the compressed span and whose LSB is taken as an unsigned char value indicating the number of characters in the span.
Meaningful values for the MSB and LSB of the span-encoding word are defined as follows.
Bytes | Integer Constant |
|---|---|
MSB | 0x00 - 0xFF |
LSB | 0x06 - 0xFF |
These values have been assigned because the minimum span of characters for which this scheme provides compression is 6 Unicode characters in a row with the same MSB. This is also because the maximum number of characters that can be enumerated in a single unsigned byte value is 256.
Note: Any run of less than 6 Unicode characters in a row with the same MSB will not be compressed since no actual storage savings would result from applying the algorithm in such case.
Spans that contain an odd number of characters are padded out with an uninterpreted 0x00 byte. This guarantees that all compression spans are 16-bit aligned, so that subsequent Unicode characters that may be uncompressed are also 16-bit aligned. The resulting compressed string is UNINULL terminated. This means that any 16-bit string operation on the compressed string that depends on UNINULL termination is guaranteed. However, this is not the mode to use for operating on compressed Unicode strings, since the compression mechanism can result in sequences of 0x00 0x00 in the LSB-compressed span. These sequences would then be interpreted as a premature string termination.
Clients with compressed Unicode strings should keep track of their length (returned by the compression interface) and move them with memcpy or other such operations that do not depend on a UNINULL termination value.
Table 4-2 compares uncompressed Unicode to compressed Unicode.
Uncompressed |
|---|
U+XXXX U+0055 U+006E U+0069 U+0063 U+006F U+0064 U+0065 U+XXXX U+0000 |
^ |
Start sequence of >6 Unicode characters with 0x00 value. |
Compressed |
U+XXXX 0xFFFF 0x0007 0x556E 0x6963 0x6F64 0x6500 U+XXXX U+0000 |
^ ^ ^ ^ ^ |
Span MSB/N Compressed Pad resume Unicode |
The compression for a 7-character sequence is minimal, but the longer the contiguous span, the greater the compression (Table 4-3).
Contiguous Span (unichars) | Uncompressed (bytes) | Compressed (bytes) | Percentage Compressed |
|---|---|---|---|
7 | 14 | 12 | 86% |
16 | 32 | 20 | 62% |
256 | 512 | 260 | 51% |
Note: Clients should never try to dip into a compressed Unicode string and directly access characters.
The compressed byte values are stored in 16-bit values, which are interpreted as unsigned shorts. This means that the actual order of the bytes depends on the native endianness of a machine's int storage. As a result, any byte swapping issues for compressed Unicode strings are the same as those for uncompressed Unicode strings.
The same block swapping algorithms that handle arrays of datatypes built on shorts or ints on a particular platform can be applied equally to uncompressed or compressed Unicode strings.
The following pages describe the compression and expansion interfaces for Unicode text strings.
Compresses an entire Unicode string into the client's destination buffer, placing the length of the converted string in the destination in the form of unichars.
int unicmprs_strCompress ( unistring dest,
int destlen, const unistring source, unistring *errorPoint )
unistring dest - Pointer to client destination buffer
int destlen - Length of client destination buffer in unichars
const unistring source - Source unistring to compress
unistring *errorPoint - Output parameter that will point to error position in the string if an error occurs
Note: Preconditions are untested. Failure to meet these input preconditions may result in undefined behavior.
#include <unicmprs.h>
#define BUFLEN (1024) /* for example */
extern unistring s;
{int rc;
unistring errString;
unichar destBuf[BUFLEN];
rc = unicmprs_strCompress( destBuf, BUFLEN, s, &errString );
if ( rc < 0 )
{/* Do error processing */
}
}
Return | Meaning | Result | Error Condition |
|---|---|---|---|
Positive value-Number of characters in the output string including UNINULL | Compression successful | N/A | |
1 | Input was zero-length unistring | Output is zero-length null-terminated string | N/A |
-1 | Conversion failed | String may already be compressed (embedded 0xFFFF detected) | |
-2 | Compression exceeded destlen of provided buffer | Sets errorPoint to the point in the input string where the overflow started | Client buffer is too short to contain compressed string |
unicmprs_strExpand
The semantics of this interface are identical to unicmprs_strCompress, except that the source string is compressed in place.
int unicmprs_strCompressInPlace ( unistring source, unistring *errorPoint )
unistring source - Source unistring to compress
unistring *errorPoint - Output parameter that will point to error position in the string should an error occur
The source pointer must be a UNINULL-terminated unistring. The source must not be a const unistring since the contents will be overwritten during compression.
Client buffer overruns cannot occur since a compressed string is never longer than the source string.
Note: Preconditions are untested. Failure to meet these input preconditions may result in undefined behavior.
#include <unicmprs.h>
extern unistring s;
{int rc;
unistring errString;
rc = unicmprs_strCompressInPlace( s, &errString );
if ( rc < 0 )
{/* Do error processing */
}
Return | Meaning | Result | Error Conditions |
|---|---|---|---|
Positive value-Number of characters in the output string including UNINULL | Compression successful | N/A | |
1 | Input was zero-length unistring | Output is zero-length null-terminated string | N/A |
-1 | Conversion failed | String may already be compressed (embedded 0xFFFF detected) |
unicmprs_strCompress
Expands an entire Unicode-compressed string into the client's destination buffer in the form of unichars.
int unicmprs_strExpand ( unistring dest, int destlen, const unistring source, unistring *errorPoint )
unistring dest - Pointer to client destination buffer
int destlen - Length of client destination buffer in unichars
const unistring source - Source unistring to expand
unistring *errorPoint - Output parameter that will point to error position in the string if an error occurs
Clients must provide valid destination and source pointers and a meaningful destination length greater than 1.
Note: Preconditions are untested. Failure to meet these input preconditions may result in undefined behavior.
#include <unicmprs.h>
#define BUFLEN (1024) /* for example */
extern unistring s;
{int rc;
unistring errString;
unichar destBuf[BUFLEN];
rc = unicmprs_strExpand( destBuf, BUFLEN, s, &errString );
if ( rc < 0 )
{/* Do error processing */
}
}
Return | Meaning | Result | Error Conditions |
|---|---|---|---|
Positive value¾Number of characters in the output string including UNINULL | Successful expansion | String will be UNINULL-terminated. Length calculated including the UNINULL | |
1 | Input was zero-length unistring | Output is zero-length null-terminated string | N/A |
-1 | Conversion failed |
|
|
-2 | Conversion failed | Sets errorPoint to the point in the input string where the overflow started | Client buffer too short to contain expanded string |
unicmprs_strCompress
Reports the length of a compressed Unicode string without actually expanding it into a buffer.
Note: This interface should not be used to determine whether or not a string is compressed.
int unicmprs_strLength ( const unistring s )
const unistring s - unistring for which the length must be calculated
The length of a compressed Unicode string is equivalent to unistrlen + 1 on the expanded string (includes the terminating UNINULL). Therefore, if buffer size is an issue, the value reported by unicmprs_strLength is correct for allocation. This can be used if a client simply wants to know what the expanded size would be, or needs to calculate exactly before allocating a buffer for expansion.
If handed an uncompressed Unicode string, this interface reports a value equal to unistrlen + 1 on that same string.
#include <unicmprs.h>
extern unistring s;
{int rc;
rc = unicmprs_strLength( s );
/* rc is the uncompressed length of s in unichars + 1 */
}
Returns the length of the expanded string in unichars, plus one for the terminal UNINULL.
Scans for the sentinel value 0xFFFF to verify whether one or more spans have been compressed.
int unicomprs_strIsCompressed ( const unistring s)
const unistring s - unistring to check for compression
#include <unicmprs.h>
extern unistring s;
{if ( unicmprs_strIsCompressed( s ) )
{/* 0xFFFF was detected as signal of compression */
}
}
1 = TRUE if compressed
0 = FALSE if uncompressed
This section describes the Sybase implementation of the Reuters Compression Scheme for Unicode (RSCU) algorithm. This algorithm allows rapid compression of Unicode null-terminated strings or arbitrary buffers of Unicode character values by using a sliding window. These interfaces also allow the strings and buffers to be expanded back into the canonical Unicode form.
The Reuters Compression Scheme for Unicode uses a heuristic mechanism to dynamically move the sliding window to produce best compression results. As such, it generally can compress Unicode data better than the unicmprs interfaces. In the worst case, when no compression is possible, however, the unircsu algorithm drops a sentinel tag byte indicating the fact, so the data may actually expand by a byte.
The unicmprs interface is best suited to implementation of compression in a string class, where the underlying type of the string data store is maintained as the same unsigned 16-bit data type.
The unircsu interface is best suited for compression of Unicode data for streaming interfaces as with passing Unicode data through a bandwidth-limited communications protocol.
The Reuters interfaces return the codes listed in Table 4-7.
Return | Meaning | Value |
|---|---|---|
UNIRCSU_BufferOverrun | Client buffer was too short | -1 |
UNIRCSU_IllegalChar | Illegal Unicode value encountered (0xFFFE or 0xFFFF) | -2 |
UNIRCSU_IllegalData | Unterminated code or otherwise invalid byte encountered in RCSU data stream | -3 |
Compresses an entire Unicode UNINULL-terminated string into the client destination buffer.
int unircsu_strCompress ( UChar8 *dest, int destlen, const unistring source, unistring *errorPoint )
UChar8 *dest - Pointer to client destination buffer
int destlen - Length of client destination buffer in unichars
const unistring source - Source unistring to compress
unistring *errorPoint - Output parameter that will point to the error position in the string if an error occurs
On successful compression, the interface returns the length of the converted string (in bytes) placed into dest. This string will be terminated with one or two NULL bytes, depending on which mode the compression was in when the end of data was reached. The length returned calculation includes any terminal NULLs.
Note: These preconditions are not tested. Failure to meet these input preconditions will result in undefined behavior.
#include <unircsu.h>
#define BUFLEN (1024) /* for example */
extern unistring s;
{int rc;
unistring errString;
UChar8 destBuf[BUFLEN];
rc = unircsu_strCompress( destBuf, BUFLEN, s, &errString );
if ( rc < 0 )
{/* Do error processing */
}
}
Return | Result | Meaning |
|---|---|---|
Positive value | Length of the converted string (in bytes), including any terminating NULL bytes | |
1 | An input of a zero-length unistring will result in the output of a zero-length null-terminated string, allowing fail-safe error recovery mechanisms | Signifies the number of characters in the output string, including the single NULL |
Negative value | errorPoint is set to the position in the input string where the failure occurred and the contents of the destination buffer remain undefined | Failure to convert |
UNIRCSU_BufferOverrun | Sets errorPoint to the point in the input string where the overflow starts | Client buffer is too short to contain the compressed string |
UNIRCSU_IllegalChar | Input data contains illegal non-characters (0xFFFE or 0xFFFF) |
unircsu_strExpand
Reports the length of a Unicode string compressed with the RCSU algorithm without expanding it into a buffer.
int unircsu_strLength ( const UChar8 *s )
const UChar8 *s - Compressed string to measure
This interface should be used only with RCSU data that have been compressed with unircsu_strCompress, because it assumes that the original data were terminated with a UNINULL.
#include <unircsu.h>
extern UChar8* s;
{int rc;
rc = unircsu_strLength( s );
/* rc is the uncompressed length of s in unichars + 1 */
}
Return | Result |
|---|---|
Positive value | Length of the converted string in unichars, including the terminating UNINULL |
The length of the converted string is equivalent to unistrlen + 1 on the expanded string, including the terminating UNINULL. Therefore, if buffer size is an issue, the value reported by unircsu_strLength is correct for allocation. This value can be used if a client simply wants to know what the expanded size would be or needs to calculate the size before allocating a buffer for expansion.
Expands an entire RCSU compressed string into the client's destination buffer.
int unircsu_strExpand ( unistring dest, int destlen, const UChar8 *source, const UChar8 **errorPoint )
unistring dest - unistring destination
int destlen - Length of client destination buffer in unichars
const UChar8 *source - Source string to expand
const UChar8 **errorPoint - Output parameter that will point to the error position in the string if an error occurs
This interface is the companion to unircsu_strCompress, as it expects to terminate when finding the compressed analog of a terminal UNINULL for the original unistring. It should not be used for expanding RCSU data that was compressed with unircsu_dataCompress, since that data may have arbitrarily embedded UNINULL values. Clients must provide valid destination and source pointers, and a meaningful destlen more than 1.
Note: These preconditions are not tested in the interest of interface speed. Failure to meet these input preconditions may result in undefined behavior.
#include <unircsu.h>
#define BUFLEN (1024) /* for example */
extern UChar8* s;
{int rc;
UChar8* errString;
unichar destBuf[BUFLEN];
rc = unircsu_strExpand( destBuf, BUFLEN, s, &errString );
if ( rc < 0 )
{/* Do error processing */
}
}
Returns
Return | Result | Meaning |
|---|---|---|
Positive value | Length of the converted string in unichars, including the terminating UNINULL. | Successful expansion. This string will be UNINULL-terminated and the length is calculated, including the UNINULL. |
1 | Zero-length null-terminated unistring. | Equals the number of characters in the output string, including the UNINULL. |
Negative value | errorPoint is set to the position in the input string where the failure occurred; destination buffer may be arbitrarily truncated, but is always UNINULL-terminated. | Failure to convert. |
UNIRCSU_BufferOverrun | Sets errorPoint to the point in the input string where the failure occurred. | Client buffer is too short to contain the expanded string. |
UNIRCSU_IllegalChar | Output data contains illegal non-characters (0xFFFE or 0xFFFF). |
unircsu_strCompress
Compresses an entire arbitrary data buffer of Unicode characters into the client's destination buffer, using the general RCSU encoding algorithm.
int unircsu_dataCompress ( UChar8 *dest, int destlen, const unichar *source, int sourcelen, const unichar **errorPoint )
UChar8 *dest - Pointer to client destination buffer
int destlen - Length of client destination buffer in unichars
const unichar *source - Source unistring to compress
int sourcelen - Length of source in unichars
const unichar **errorPoint - Output parameter that will point to the error position in the data if an error occurs
The semantics of this interface are as for unircsu_strCompress, but the source is a pointer to a buffer with an arbitrary collection of Unicode characters. The sourcelen parameter defines the length of the data in unichars. No assumptions are made about null-termination, and any embedded UNINULLs are compressed along with the rest of the data.
#include <unircsu.h>
#define BUFLEN (1024) /* for example */
extern unichar* data;
extern int datalen;
{int rc;
unichar* errPoint;
UChar8 destBuf[BUFLEN];
rc = unircsu_dataCompress( destBuf, BUFLEN, data, datalen, &errPoint );
if ( rc < 0 )
{/* Do error processing */
}
}
Return | Result | Meaning |
|---|---|---|
Positive value | Length of the converted data in bytes | |
UNIRCSU_BufferOverrun | Client buffer is too short to contain the compressed data | |
UNIRCSU_IllegalChar | Input data contains illegal non-characters (0xFFFE or 0xFFFF) |
Expands an arbitrary data buffer of RCSU compressed data into the client's destination buffer, using the general RCSU encoding algorithm.
int unircsu_dataExpand ( unichar *dest, int destlen, const UChar8 *source, int sourcelen, const UChar8 **errorPoint )
unichar *dest - Pointer to client destination buffer
int destlen - Length of client destination buffer in unichars
const UChar8 *source - Pointer to data buffer to expand
int sourcelen - Length of source data buffer
const UChar8 **errorPoint - Output parameter that will point to the error position in the buffer if an error occurs
#include <unircsu.h>
#define BUFLEN (1024) /* for example */
extern UChar8* data;
extern int datalen;
{int rc;
UChar8** errPoint;
unichar destBuf[BUFLEN];
rc = unircsu_dataExpand(destBuf,BUFLEN,data,datalen, &errPoint );
if ( rc < 0 )
{/* Do error processing */
}
}
Return | Result | Meaning |
|---|---|---|
Positive value | Length of the converted data in unichars | |
UNIRCSU_BufferOverrun | Client buffer is too short to contain the expanded data | |
UNIRCSU_IllegalChar | Output data contains illegal non-characters (0xFFFE or 0xFFFF) | |
UNIRCSU_IllegalData | Input data contains illegal values (for example, only the first byte of a required two-byte sequence) |
|
|