A.4.11 String Encoding
{
AI05-0137-2}
Facilities for encoding, decoding, and converting
strings in various character encoding schemes are provided by packages
Strings.UTF_Encoding, Strings.UTF_Encoding.Conversions, Strings.UTF_Encoding.Strings,
Strings.UTF_Encoding.Wide_Strings, and Strings.UTF_Encoding.Wide_Wide_Strings.
Static Semantics
{
AI05-0137-2}
The encoding library packages have the following
declarations:
{
AI05-0137-2}
package Ada.Strings.UTF_Encoding is
pragma Pure (UTF_Encoding);
-- Declarations common to the string encoding packages
type Encoding_Scheme is (UTF_8, UTF_16BE, UTF_16LE);
subtype UTF_String is String;
subtype UTF_8_String is String;
subtype UTF_16_Wide_String is Wide_String;
Encoding_Error : exception;
BOM_8 : constant UTF_8_String :=
Character'Val(16#EF#) &
Character'Val(16#BB#) &
Character'Val(16#BF#);
BOM_16BE : constant UTF_String :=
Character'Val(16#FE#) &
Character'Val(16#FF#);
BOM_16LE : constant UTF_String :=
Character'Val(16#FF#) &
Character'Val(16#FE#);
BOM_16 : constant UTF_16_Wide_String :=
(1 => Wide_Character'Val(16#FEFF#));
function Encoding (Item : UTF_String;
Default : Encoding_Scheme := UTF_8)
return Encoding_Scheme;
end Ada.Strings.UTF_Encoding;
{
AI05-0137-2}
package Ada.Strings.UTF_Encoding.Conversions is
pragma Pure (Conversions);
-- Conversions between various encoding schemes
function Convert (Item : UTF_String;
Input_Scheme : Encoding_Scheme;
Output_Scheme : Encoding_Scheme;
Output_BOM : Boolean := False) return UTF_String;
function Convert (Item : UTF_String;
Input_Scheme : Encoding_Scheme;
Output_BOM : Boolean := False)
return UTF_16_Wide_String;
function Convert (Item : UTF_8_String;
Output_BOM : Boolean := False)
return UTF_16_Wide_String;
function Convert (Item : UTF_16_Wide_String;
Output_Scheme : Encoding_Scheme;
Output_BOM : Boolean := False) return UTF_String;
function Convert (Item : UTF_16_Wide_String;
Output_BOM : Boolean := False) return UTF_8_String;
end Ada.Strings.UTF_Encoding.Conversions;
{
AI05-0137-2}
package Ada.Strings.UTF_Encoding.Strings is
pragma Pure (Strings);
-- Encoding / decoding between String and various encoding schemes
function Encode (Item : String;
Output_Scheme : Encoding_Scheme;
Output_BOM : Boolean := False) return UTF_String;
function Encode (Item : String;
Output_BOM : Boolean := False) return UTF_8_String;
function Encode (Item : String;
Output_BOM : Boolean := False)
return UTF_16_Wide_String;
function Decode (Item : UTF_String;
Input_Scheme : Encoding_Scheme) return String;
function Decode (Item : UTF_8_String) return String;
function Decode (Item : UTF_16_Wide_String) return String;
end Ada.Strings.UTF_Encoding.Strings;
{
AI05-0137-2}
package Ada.Strings.UTF_Encoding.Wide_Strings is
pragma Pure (Wide_Strings);
-- Encoding / decoding between Wide_String and various encoding schemes
function Encode (Item : Wide_String;
Output_Scheme : Encoding_Scheme;
Output_BOM : Boolean := False) return UTF_String;
function Encode (Item : Wide_String;
Output_BOM : Boolean := False) return UTF_8_String;
function Encode (Item : Wide_String;
Output_BOM : Boolean := False)
return UTF_16_Wide_String;
function Decode (Item : UTF_String;
Input_Scheme : Encoding_Scheme) return Wide_String;
function Decode (Item : UTF_8_String) return Wide_String;
function Decode (Item : UTF_16_Wide_String) return Wide_String;
end Ada.Strings.UTF_Encoding.Wide_Strings;
{
AI05-0137-2}
package Ada.Strings.UTF_Encoding.Wide_Wide_Strings is
pragma Pure (Wide_Wide_Strings);
-- Encoding / decoding between Wide_Wide_String and various encoding schemes
function Encode (Item : Wide_Wide_String;
Output_Scheme : Encoding_Scheme;
Output_BOM : Boolean := False) return UTF_String;
function Encode (Item : Wide_Wide_String;
Output_BOM : Boolean := False) return UTF_8_String;
function Encode (Item : Wide_Wide_String;
Output_BOM : Boolean := False)
return UTF_16_Wide_String;
function Decode (Item : UTF_String;
Input_Scheme : Encoding_Scheme) return Wide_Wide_String;
function Decode (Item : UTF_8_String) return Wide_Wide_String;
function Decode (Item : UTF_16_Wide_String) return Wide_Wide_String;
end Ada.Strings.UTF_Encoding.Wide_Wide_Strings;
{
AI05-0137-2}
{
AI05-0262-1}
The type Encoding_Scheme defines encoding schemes.
UTF_8 corresponds to the UTF-8 encoding scheme defined by Annex D of
ISO/IEC 10646. UTF_16BE corresponds to the UTF-16 encoding scheme defined
by Annex C of ISO/IEC 10646 in 8 bit, big-endian order; and UTF_16LE
corresponds to the UTF-16 encoding scheme in 8 bit, little-endian order.
{
AI05-0137-2}
The subtype UTF_String is used to represent a String
of 8-bit values containing a sequence of values encoded in one of three
ways (UTF-8, UTF-16BE, or UTF-16LE). The subtype UTF_8_String is used
to represent a String of 8-bit values containing a sequence of values
encoded in UTF-8. The subtype UTF_16_Wide_String is used to represent
a Wide_String of 16-bit values containing a sequence of values encoded
in UTF-16.
{
AI05-0137-2}
{
AI05-0262-1}
The BOM_8, BOM_16BE, BOM_16LE, and BOM_16 constants
correspond to values used at the start of a string to indicate the encoding.
{
AI05-0262-1}
{
AI05-0269-1}
Each of the Encode functions takes a String, Wide_String,
or Wide_Wide_String Item parameter that is assumed to be an array of
unencoded characters. Each of the Convert functions takes a UTF_String,
UTF_8_String, or UTF_16_String Item parameter that is assumed to contain
characters whose position values correspond to a valid encoding sequence
according to the encoding scheme required by the function or specified
by its Input_Scheme parameter.
{
AI05-0137-2}
{
AI05-0262-1}
{
AI05-0269-1}
Each of the Convert and Encode functions returns
a UTF_String, UTF_8_String, or UTF_16_String value whose characters have
position values that correspond to the encoding of the Item parameter
according to the encoding scheme required by the function or specified
by its Output_Scheme parameter. For UTF_8, no overlong encoding is returned.
A BOM is included at the start of the returned string if the Output_BOM
parameter is set to True. The lower bound of the returned string is 1.
{
AI05-0137-2}
{
AI05-0262-1}
Each of the Decode functions takes a UTF_String,
UTF_8_String, or UTF_16_String Item parameter which is assumed to contain
characters whose position values correspond to a valid encoding sequence
according to the encoding scheme required by the function or specified
by its Input_Scheme parameter, and returns the corresponding String,
Wide_String, or Wide_Wide_String value. The lower bound of the returned
string is 1.
{
AI05-0137-2}
{
AI05-0262-1}
For each of the Convert and Decode functions, an
initial BOM in the input that matches the expected encoding scheme is
ignored, and a different initial BOM causes Encoding_Error to be propagated.
{
AI05-0137-2}
The exception Encoding_Error is also propagated
in the following situations:
By a Decode function when
a UTF encoded string contains an invalid encoding sequence.
By a Decode function when
the expected encoding is UTF-16BE or UTF-16LE and the input string has
an odd length.
{
AI05-0262-1}
By a Decode function yielding a String when the
decoding of a sequence results in a code point whose value exceeds 16#FF#.
By a Decode function yielding
a Wide_String when the decoding of a sequence results in a code point
whose value exceeds 16#FFFF#.
{
AI05-0262-1}
By an Encode function taking a Wide_String as input
when an invalid character appears in the input. In particular, the characters
whose position is in the range 16#D800# .. 16#DFFF# are invalid because
they conflict with UTF-16 surrogate encodings, and the characters whose
position is 16#FFFE# or 16#FFFF# are also invalid because they conflict
with BOM codes.
{
AI05-0137-2}
function Encoding (Item : UTF_String;
Default : Encoding_Scheme := UTF_8)
return Encoding_Scheme;
{
AI05-0137-2}
{
AI05-0269-1}
Inspects a UTF_String value to determine whether
it starts with a BOM for UTF-8, UTF-16BE, or UTF_16LE. If so, returns
the scheme corresponding to the BOM; otherwise, returns the value of
Default.
{
AI05-0137-2}
function Convert (Item : UTF_String;
Input_Scheme : Encoding_Scheme;
Output_Scheme : Encoding_Scheme;
Output_BOM : Boolean := False) return UTF_String;
Returns
the value of Item (originally encoded in UTF-8, UTF-16LE, or UTF-16BE
as specified by Input_Scheme) encoded in one of these three schemes as
specified by Output_Scheme.
{
AI05-0137-2}
function Convert (Item : UTF_String;
Input_Scheme : Encoding_Scheme;
Output_BOM : Boolean := False)
return UTF_16_Wide_String;
Returns
the value of Item (originally encoded in UTF-8, UTF-16LE, or UTF-16BE
as specified by Input_Scheme) encoded in UTF-16.
{
AI05-0137-2}
function Convert (Item : UTF_8_String;
Output_BOM : Boolean := False)
return UTF_16_Wide_String;
Returns
the value of Item (originally encoded in UTF-8) encoded in UTF-16.
{
AI05-0137-2}
function Convert (Item : UTF_16_Wide_String;
Output_Scheme : Encoding_Scheme;
Output_BOM : Boolean := False) return UTF_String;
Returns
the value of Item (originally encoded in UTF-16) encoded in UTF-8, UTF-16LE,
or UTF-16BE as specified by Output_Scheme.
{
AI05-0137-2}
function Convert (Item : UTF_16_Wide_String;
Output_BOM : Boolean := False) return UTF_8_String;
Returns
the value of Item (originally encoded in UTF-16) encoded in UTF-8.
{
AI05-0137-2}
function Encode (Item : String;
Output_Scheme : Encoding_Scheme;
Output_BOM : Boolean := False) return UTF_String;
{
AI05-0262-1}
Returns the value of Item encoded in UTF-8, UTF-16LE,
or UTF-16BE as specified by Output_Scheme.
{
AI05-0137-2}
function Encode (Item : String;
Output_BOM : Boolean := False) return UTF_8_String;
Returns
the value of Item encoded in UTF-8.
{
AI05-0137-2}
function Encode (Item : String;
Output_BOM : Boolean := False) return UTF_16_Wide_String;
Returns
the value of Item encoded in UTF_16.
{
AI05-0137-2}
function Decode (Item : UTF_String;
Input_Scheme : Encoding_Scheme) return String;
Returns
the result of decoding Item, which is encoded in UTF-8, UTF-16LE, or
UTF-16BE as specified by Input_Scheme.
{
AI05-0137-2}
function Decode (Item : UTF_8_String) return String;
Returns
the result of decoding Item, which is encoded in UTF-8.
{
AI05-0137-2}
function Decode (Item : UTF_16_Wide_String) return String;
Returns
the result of decoding Item, which is encoded in UTF-16.
{
AI05-0137-2}
function Encode (Item : Wide_String;
Output_Scheme : Encoding_Scheme;
Output_BOM : Boolean := False) return UTF_String;
{
AI05-0262-1}
Returns the value of Item encoded in UTF-8, UTF-16LE,
or UTF-16BE as specified by Output_Scheme.
{
AI05-0137-2}
function Encode (Item : Wide_String;
Output_BOM : Boolean := False) return UTF_8_String;
Returns
the value of Item encoded in UTF-8.
{
AI05-0137-2}
function Encode (Item : Wide_String;
Output_BOM : Boolean := False) return UTF_16_Wide_String;
Returns
the value of Item encoded in UTF_16.
{
AI05-0137-2}
function Decode (Item : UTF_String;
Input_Scheme : Encoding_Scheme) return Wide_String;
Returns
the result of decoding Item, which is encoded in UTF-8, UTF-16LE, or
UTF-16BE as specified by Input_Scheme.
{
AI05-0137-2}
function Decode (Item : UTF_8_String) return Wide_String;
Returns
the result of decoding Item, which is encoded in UTF-8.
{
AI05-0137-2}
function Decode (Item : UTF_16_Wide_String) return Wide_String;
Returns
the result of decoding Item, which is encoded in UTF-16.
{
AI05-0137-2}
function Encode (Item : Wide_Wide_String;
Output_Scheme : Encoding_Scheme;
Output_BOM : Boolean := False) return UTF_String;
{
AI05-0262-1}
Returns the value of Item encoded in UTF-8, UTF-16LE,
or UTF-16BE as specified by Output_Scheme.
{
AI05-0137-2}
function Encode (Item : Wide_Wide_String;
Output_BOM : Boolean := False) return UTF_8_String;
Returns
the value of Item encoded in UTF-8.
{
AI05-0137-2}
function Encode (Item : Wide_Wide_String;
Output_BOM : Boolean := False) return UTF_16_Wide_String;
Returns
the value of Item encoded in UTF_16.
{
AI05-0137-2}
function Decode (Item : UTF_String;
Input_Scheme : Encoding_Scheme) return Wide_Wide_String;
Returns
the result of decoding Item, which is encoded in UTF-8, UTF-16LE, or
UTF-16BE as specified by Input_Scheme.
{
AI05-0137-2}
function Decode (Item : UTF_8_String) return Wide_Wide_String;
Returns
the result of decoding Item, which is encoded in UTF-8.
{
AI05-0137-2}
function Decode (Item : UTF_16_Wide_String) return Wide_Wide_String;
Returns
the result of decoding Item, which is encoded in UTF-16.
Implementation Advice
{
AI05-0137-2}
If an implementation supports other encoding schemes,
another similar child of Ada.Strings should be defined.
Implementation Advice:
If an implementation supports other
string encoding schemes, a child of Ada.Strings similar to UTF_Encoding
should be defined.
19 {
AI05-0137-2}
A BOM (Byte-Order Mark, code position 16#FEFF#)
can be included in a file or other entity to indicate the encoding; it
is skipped when decoding. Typically, only the first line of a file or
other entity contains a BOM. When decoding, the Encoding function can
be called on the first line to determine the encoding; this encoding
will then be used in subsequent calls to Decode to convert all of the
lines to an internal format.
Extensions to Ada 2005
{
AI05-0137-2}
The packages Strings.UTF_Encoding,
Strings.UTF_Encoding.Conversions, Strings.UTF_Encoding.Strings, Strings.UTF_Encoding.Wide_Strings,
and Strings.UTF_Encoding.Wide_Wide_Strings are new.
Ada 2005 and 2012 Editions sponsored in part by Ada-Europe