Module BatCharEncodings


module BatCharEncodings: sig .. end
Character encodings.

When computers were first created, little thought was given to the ability of writing text in languages other than English. With the years, various notations (or "encodings") emerged, adapted to different languages: the notation called Latin-1 for most West-European languages, Euc-Kr for Corean, etc. Universal conventions were later introduced, which are supposed to be sufficient to represent all human languages, with several possible encodings, such as UTF-8, UTF-16, UTF-32, ...

This module deals with conversions between the possible encodings of inputs and outputs.
Author(s): Yamagata Yoriyuki (Camomile module), David Teller



Character encodings.

When computers were first created, little thought was given to the ability of writing text in languages other than English. With the years, various notations (or "encodings") emerged, adapted to different languages: the notation called Latin-1 for most West-European languages, Euc-Kr for Corean, etc. Universal conventions were later introduced, which are supposed to be sufficient to represent all human languages, with several possible encodings, such as UTF-8, UTF-16, UTF-32, ...

This module deals with conversions between the possible encodings of inputs and outputs.

exception Malformed_code
An exception raised when a character is meaningless in the encoding in which it appears. This is usually the sign that the encoding used was incorrect.
type encoding = [ `ansi_x3_110_1983
| `ansi_x3_4_1968
| `ansi_x3_4_1986
| `arabic
| `arabic7
| `armscii_8
| `ascii
| `asmo_449
| `asmo_708
| `big5
| `big5_cp950
| `big5_hkscs
| `big5hkscs
| `bs_4730
| `bs_viewdata
| `ca
| `charset_1026
| `charset_1047
| `charset_437
| `charset_500
| `charset_500v1
| `charset_850
| `charset_851
| `charset_852
| `charset_855
| `charset_856
| `charset_857
| `charset_860
| `charset_861
| `charset_862
| `charset_863
| `charset_865
| `charset_866
| `charset_866nav
| `charset_869
| `charset_904
| `cn
| `cp037
| `cp038
| `cp10007
| `cp1004
| `cp1026
| `cp1047
| `cp1124
| `cp1125
| `cp1129
| `cp1132
| `cp1133
| `cp1160
| `cp1161
| `cp1162
| `cp1163
| `cp1164
| `cp1250
| `cp1251
| `cp1252
| `cp1253
| `cp1254
| `cp1255
| `cp1256
| `cp1257
| `cp1258
| `cp273
| `cp274
| `cp275
| `cp278
| `cp280
| `cp281
| `cp284
| `cp285
| `cp290
| `cp297
| `cp367
| `cp420
| `cp423
| `cp424
| `cp437
| `cp500
| `cp737
| `cp775
| `cp819
| `cp850
| `cp851
| `cp852
| `cp855
| `cp856
| `cp857
| `cp860
| `cp861
| `cp862
| `cp863
| `cp864
| `cp865
| `cp866
| `cp866nav
| `cp868
| `cp869
| `cp870
| `cp871
| `cp874
| `cp875
| `cp880
| `cp891
| `cp903
| `cp904
| `cp905
| `cp918
| `cp922
| `cp932
| `cp949
| `cp_ar
| `cp_gr
| `cp_hu
| `cp_is
| `csa7_1
| `csa7_2
| `csa_t500_1983
| `csa_z243_4_1985_1
| `csa_z243_4_1985_2
| `csa_z243_4_1985_gr
| `csn_369103
| `cuba
| `cwi
| `cwi_2
| `cyrillic
| `de
| `dec
| `dec_mcs
| `din_66003
| `dk
| `ds2089
| `ds_2089
| `e13b
| `ebcdic_at_de
| `ebcdic_at_de_a
| `ebcdic_be
| `ebcdic_br
| `ebcdic_ca_fr
| `ebcdic_cp_ar1
| `ebcdic_cp_ar2
| `ebcdic_cp_be
| `ebcdic_cp_ca
| `ebcdic_cp_ch
| `ebcdic_cp_dk
| `ebcdic_cp_es
| `ebcdic_cp_fi
| `ebcdic_cp_fr
| `ebcdic_cp_gb
| `ebcdic_cp_gr
| `ebcdic_cp_he
| `ebcdic_cp_is
| `ebcdic_cp_it
| `ebcdic_cp_nl
| `ebcdic_cp_no
| `ebcdic_cp_roece
| `ebcdic_cp_se
| `ebcdic_cp_tr
| `ebcdic_cp_us
| `ebcdic_cp_wt
| `ebcdic_cp_yu
| `ebcdic_cyrillic
| `ebcdic_dk_no
| `ebcdic_dk_no_a
| `ebcdic_es
| `ebcdic_es_a
| `ebcdic_es_s
| `ebcdic_fi_se
| `ebcdic_fi_se_a
| `ebcdic_fr
| `ebcdic_greek
| `ebcdic_int
| `ebcdic_int1
| `ebcdic_is_friss
| `ebcdic_it
| `ebcdic_jp_e
| `ebcdic_jp_kana
| `ebcdic_pt
| `ebcdic_uk
| `ebcdic_us
| `ecma_114
| `ecma_118
| `ecma_cyrillic
| `elot_928
| `es
| `es2
| `euc_jisx0213
| `euc_jp
| `euc_kr
| `euc_tw
| `fi
| `fr
| `friss
| `gb
| `gb18030
| `gb2312
| `gb_1988_80
| `gbk
| `georgian_academy
| `georgian_ps
| `gost_19768_74
| `greek
| `greek7
| `greek7_old
| `greek8
| `greek_ccitt
| `hebrew
| `hp_roman8
| `hu
| `ibm037
| `ibm038
| `ibm1004
| `ibm1026
| `ibm1047
| `ibm1124
| `ibm1129
| `ibm1132
| `ibm1133
| `ibm1160
| `ibm1161
| `ibm1162
| `ibm1163
| `ibm1164
| `ibm256
| `ibm273
| `ibm274
| `ibm275
| `ibm277
| `ibm278
| `ibm280
| `ibm281
| `ibm284
| `ibm285
| `ibm290
| `ibm297
| `ibm367
| `ibm420
| `ibm423
| `ibm424
| `ibm437
| `ibm500
| `ibm819
| `ibm848
| `ibm850
| `ibm851
| `ibm852
| `ibm855
| `ibm856
| `ibm857
| `ibm860
| `ibm861
| `ibm862
| `ibm863
| `ibm864
| `ibm865
| `ibm866
| `ibm866nav
| `ibm868
| `ibm869
| `ibm870
| `ibm871
| `ibm874
| `ibm875
| `ibm880
| `ibm891
| `ibm903
| `ibm904
| `ibm905
| `ibm918
| `ibm922
| `iec_p27_1
| `inis
| `inis_8
| `inis_cyrillic
| `invariant
| `irv
| `isiri_3342
| `iso646_ca
| `iso646_ca2
| `iso646_cn
| `iso646_cu
| `iso646_de
| `iso646_dk
| `iso646_es
| `iso646_es2
| `iso646_fi
| `iso646_fr
| `iso646_fr1
| `iso646_gb
| `iso646_hu
| `iso646_it
| `iso646_jp
| `iso646_jp_ocr_b
| `iso646_kr
| `iso646_no
| `iso646_no2
| `iso646_pt
| `iso646_pt2
| `iso646_se
| `iso646_se2
| `iso646_us
| `iso646_yu
| `iso6937
| `iso_10367_box
| `iso_2033_1983
| `iso_5427
| `iso_5427_1981
| `iso_5427_ext
| `iso_5428
| `iso_5428_1980
| `iso_646_basic
| `iso_646_basic_1983
| `iso_646_irv
| `iso_646_irv_1983
| `iso_646_irv_1991
| `iso_6937
| `iso_6937_1992
| `iso_6937_2_1983
| `iso_6937_2_25
| `iso_6937_2_add
| `iso_8859_1
| `iso_8859_10
| `iso_8859_10_1992
| `iso_8859_11
| `iso_8859_13
| `iso_8859_14
| `iso_8859_15
| `iso_8859_16
| `iso_8859_1_1987
| `iso_8859_2
| `iso_8859_2_1987
| `iso_8859_3
| `iso_8859_3_1988
| `iso_8859_4
| `iso_8859_4_1988
| `iso_8859_5
| `iso_8859_5_1988
| `iso_8859_6
| `iso_8859_6_1987
| `iso_8859_7
| `iso_8859_7_1987
| `iso_8859_8
| `iso_8859_8_1988
| `iso_8859_9
| `iso_8859_9_1989
| `iso_8859_supp
| `iso_9036
| `iso_ir_10
| `iso_ir_100
| `iso_ir_101
| `iso_ir_102
| `iso_ir_103
| `iso_ir_109
| `iso_ir_11
| `iso_ir_110
| `iso_ir_111
| `iso_ir_121
| `iso_ir_122
| `iso_ir_123
| `iso_ir_126
| `iso_ir_127
| `iso_ir_128
| `iso_ir_138
| `iso_ir_139
| `iso_ir_14
| `iso_ir_141
| `iso_ir_142
| `iso_ir_143
| `iso_ir_144
| `iso_ir_146
| `iso_ir_147
| `iso_ir_148
| `iso_ir_15
| `iso_ir_150
| `iso_ir_151
| `iso_ir_152
| `iso_ir_153
| `iso_ir_154
| `iso_ir_155
| `iso_ir_156
| `iso_ir_157
| `iso_ir_158
| `iso_ir_16
| `iso_ir_166
| `iso_ir_17
| `iso_ir_170
| `iso_ir_179
| `iso_ir_18
| `iso_ir_19
| `iso_ir_197
| `iso_ir_2
| `iso_ir_209
| `iso_ir_21
| `iso_ir_226
| `iso_ir_25
| `iso_ir_27
| `iso_ir_37
| `iso_ir_4
| `iso_ir_47
| `iso_ir_49
| `iso_ir_50
| `iso_ir_51
| `iso_ir_54
| `iso_ir_55
| `iso_ir_57
| `iso_ir_6
| `iso_ir_60
| `iso_ir_61
| `iso_ir_69
| `iso_ir_70
| `iso_ir_84
| `iso_ir_85
| `iso_ir_86
| `iso_ir_88
| `iso_ir_89
| `iso_ir_8_1
| `iso_ir_90
| `iso_ir_92
| `iso_ir_98
| `iso_ir_99
| `iso_ir_9_1
| `it
| `jis_c6220_1969_ro
| `jis_c6229_1984_b
| `jis_x0201
| `johab
| `jp
| `jp_ocr_b
| `js
| `jus_i_b1_002
| `jus_i_b1_003_mac
| `jus_i_b1_003_serb
| `koi8_r
| `koi8_t
| `koi8_u
| `koi_7
| `koi_8
| `ksc5636
| `l1
| `l10
| `l2
| `l3
| `l4
| `l5
| `l6
| `l7
| `l8
| `lap
| `latin1
| `latin10
| `latin1_2_5
| `latin2
| `latin3
| `latin4
| `latin5
| `latin6
| `latin7
| `latin8
| `latin_greek
| `latin_greek_1
| `latin_lap
| `mac
| `mac_cyrillic
| `mac_is
| `mac_sami
| `mac_uk
| `macedonian
| `macintosh
| `ms_ansi
| `ms_arab
| `ms_cyrl
| `ms_ee
| `ms_greek
| `ms_hebr
| `ms_turk
| `msz_7795_3
| `named of string
| `naplps
| `nats_dano
| `nats_sefi
| `nc_nc00_10
| `nc_nc00_10_81
| `next
| `nextstep
| `nf_z_62_010
| `nf_z_62_010_1973
| `no
| `no2
| `ns_4551_1
| `ns_4551_2
| `os2latin1
| `pt
| `pt2
| `r8
| `ref_encoding
| `roman8
| `ruscii
| `sami
| `se
| `se2
| `sen_850200_b
| `sen_850200_c
| `serbian
| `shift_jis
| `shift_jisx0213
| `sjis
| `ss636127
| `st_sev_358_88
| `t_101_g2
| `t_61
| `t_61_7bit
| `t_61_8bit
| `tcvn
| `tcvn5712_1
| `tcvn5712_1_1993
| `tcvn_5712
| `tis620
| `tis620_0
| `tis620_2529_1
| `tis620_2533_0
| `tis_620
| `ucs4
| `uk
| `us
| `us_ascii
| `utf16
| `utf16be
| `utf16le
| `utf32
| `utf32be
| `utf32le
| `utf8
| `videotex_suppl
| `viscii
| `win_sami_2
| `winbaltrim
| `windows_sami2
| `ws2
| `x0201
| `yu ]
The list if known encodings.
val name_of_encoding : encoding -> string
Return the name of the encoding.

Example: Char_encodings.name_of_encoding `ibm860 = "IBM860"


Type-safe encodings


Types

type ('a, [< encoding ]) t 
The type of items of type 'a, encoded using encoding 'b.

This type, along with its constructor/destructor, is provided as a convenience to represent data encoded within a given encoding.

For instance, an input encoded as ASCII may be represented as a (input, [`ascii]) t.

val as_encoded : 'a -> ([< encoding ] as 'b) -> ('a, 'b) t
as_encoded x enc returns an element of type t used to mark that x is encoded with encoding enc.

Example: let stdin_lat = Char_encodings.as_encoded stdin `latin1

val encoded_as : ('a, [< encoding ]) t -> 'a
encoded_as t returns the x such that t = as_encoded x enc.

Example: encoded_as stdin_lat = stdin

val encoding_of_t : ('a, [< encoding ] as 'b) t -> 'b
Return the encoding of a t.

Example: encoding_of_t stdin_lat = `latin1

val transcode_in : (BatIO.input, [< encoding ]) t ->
([< encoding ] as 'a) ->
(BatIO.input, 'a) t
Transcoders

Convert the contents of an input between encodings.

transcode_in inp enc produces a new input, whose contents are the same as those of inp. However, the encoding of the result is specified by enc.

Note The resulting input may raise Malformed_code if the encoding specified as 'a was incorrect.

Example: let stdin_ebc = transcode_in stdin_lat `ebcdic_us

val transcode_out : (unit BatIO.output, [< encoding ]) t ->
([< encoding ] as 'a) ->
(unit BatIO.output, 'a) t
Convert the contents of an output between encodings.

transcode_in out enc produces a new output. Anything written to this output should be written with encoding enc and is translated to the encoding of out before being written to out.

Example: let out_utf8 = Char_encodings.transcode_out (Char_encodings.as_encoded stdout `utf8) `latin1 in BatIO.nwrite out_utf8 "«αι» in unicode";

This code writes the UTF-8 version of the given latin1 string to stdout.