UnicodeEncode filter

This page applies to Harlequin v13.1r0 and later; both Harlequin Core and Harlequin MultiRIP.

The implementation of the UnicodeEncode filter has two required parameters, /From and /To , and two optional parameters,/ByteOrderMark and /Substitute.

From	string or name (required)	The name of the encoding used for the source data written to the encoding filter. The encoding names are in any form that ICU accepts (so are not case sensitive; spaces and hyphens are ignored). The encoding name can also be `(Unicode)`, in which case the filter automatically detects the particular Unicode encoding used.
To	string or name (required)	The name of the encoding for the data output by the encoding filter and are in any form that ICU accepts (so are not case sensitive; spaces and hyphens are ignored).
ByteOrderMark	Boolean (optional)	An optional parameter, `/ByteOrderMark`, forces removal or addition of a BOM if the `/To` encoding is a Unicode encoding. The filter accepts buffers with incomplete Unicode character encodings. If not specified a BOM is present at the start of the output only if the input encoding is a Unicode one and the source data starts with a BOM. If the flag is `false`, then any BOM at the start are removed from the output. For non-Unicode output encodings, we recommend that you set the flag to `false`, as BOMs are not usually representable in non-Unicode encodings. BOMs appearing after the start of input are never removed from the output are always re-encoded (that is, replaced with the substitute character, if specified; the output encoding cannot represent the BOM).
Substitute	string (optional)	A string containing a single character in the output encoding that is used in the output stream for any input character that is invalidly encoded or cannot be represented by the output encoding. If the string is present but empty, then any input characters that have an invalid encoding or cannot be represented in the output encoding is dropped from the output. If this entry is not specified and the input character is invalidly encoded or not representable in the output encoding, then the filter raises an error.

The forms that ICU accepts are either IBM-defined encoding names, or aliases. Most of these are direct mappings of other encoding names. So, if you had the full converter set installed, you could do:

TEXT

1024 string dup
<<
  /From /Unicode
  /To (Shift-JIS)
  /ByteOrderMark false 
  /Substitute (x)
>> /UnicodeEncode filter dup
(filename) (r) file 1024 string readline pop
writestring flushfile

% Convert the first line of the file from any Unicode encoding to Shift-JIS,
% substituting "x" for unencodable characters, leaving it on the operand stack.

To convert from UTF-16 to UTF-8 (both Unicode forms), and ensure that there is a Byte Order Mark, you might do:

TEXT

1024 string dup
<<
  /From /UTF-16
  /To /UTF-8
  /ByteOrderMark true
>> /UnicodeEncode filter dup
(filename) (r) file 1024 string readline pop
writestring flushfile

Some limitations follow:

Harlequin RIP SDK does not necessarily include the Unicode conversion base functions that this filter use.
The converters installed depend on the RIP; OEMs can include new converters of their own by creating a new ICU converter package, and including it in the ICU data bundle.
Global Graphics has never documented he procedure to add new converters.
The procedure to add new converters differs between ICU 3.0 (in 7.x RIPs) and ICU 3.4 (in some recent RIPs).