[ocaml-i18n] Native UTF-8 strings

Richard Jones rich at annexia.org
Wed Nov 26 01:52:46 PST 2003


On Wed, Nov 26, 2003 at 11:18:18AM +0900, Yamagata Yoriyuki wrote:
> We need a way to interact with the environment, like converting input
> string to Unicode (as pointed out by Rich), getting locale etc.  It
> seems one of hardest part of all,

I think the real problem is that you have no way to know what encoding
your streams are using!

In very practical terms, is that file on disk ISO-8859-1, UTF-8 or
SJIS?

Perl solves this by allowing you to specify the encoding when opening
a file. eg:

open FH, "<:utf8", "file";

where the "<:utf8" argument defines a translation "layer" between the
file and the program.

I don't think this is a very elegant solution.

Anyway, leaving such issues aside, I think these steps would help:

* Have ML files be UTF-8 encoded by default. The current choice,
ISO-8859-1, is arbitrary, and essentially derived from the fact that
OCaml was developed in Europe. So if you're going to choose something
arbitrary, either strict US-ASCII or UTF-8 seem like much more
sensible choices.

* Add \U escape sequences for string literals.

* Store strings internally either as wide chars or as UTF-8, and
change the String module accordingly to work on characters.

* For byte strings, have a separate bytestring type. Using type
'string' when you really want bytestrings always seemed to me to be a
hack.

Rich.

-- 
Richard Jones. http://www.annexia.org/ http://freshmeat.net/users/rwmj
Merjis Ltd. http://www.merjis.com/ - improving website return on investment
C2LIB is a library of basic Perl/STL-like types for C. Vectors, hashes,
trees, string funcs, pool allocator: http://www.annexia.org/freeware/c2lib/



More information about the Ocaml-i18n mailing list