From yoriyuki at mbg.ocn.ne.jp Tue Nov 25 18:18:18 2003 From: yoriyuki at mbg.ocn.ne.jp (Yamagata Yoriyuki) Date: Wed, 26 Nov 2003 11:18:18 +0900 (JST) Subject: [ocaml-i18n] Native UTF-8 strings In-Reply-To: References: <20031124.222026.85419707.yoriyuki@mbg.ocn.ne.jp> <20031124134216.GA27961@redhat.com> Message-ID: <20031126.111818.28780337.yoriyuki@mbg.ocn.ne.jp> From: Jun.Furuse at inria.fr Subject: Re: [ocaml-i18n] Native UTF-8 strings Date: Mon, 24 Nov 2003 16:05:29 +0100 > > * string manipulation functions for each encoding like UTF-8, > and how to cleanly provide them to users. (Basically I agree with > Yoriyuki's idea of using the open directive.) > > * how to write i18n string/char constants in Caml programs, and > to print them on screen. > > * and how to provide non-european identifiers and module names > > Are there anything missing? (or is my classement totally nonsense?) We need a way to interact with the environment, like converting input string to Unicode (as pointed out by Rich), getting locale etc. It seems one of hardest part of all, since 1) there is no portable way to do this, and 2) there is a tension between tight integration to the environment vs. portability across different OS. Even interpretation of a specific encoding can be different across platforms (as in the case of SJIS). For a desktop, a user would expect a ocaml program interpret SJIS code in the same way as other non-ocaml programs. For a server program, one would expect SJIS be decoded by a platform-independent way. Locale name, collation etc. have similar (and worse) problems. But maybe we can factor out the OS dependent part by Functor, which would permit developers give their favorite i18n functions to the system, and by open directive overwrite the default. -- Yamagata Yoriyuki From yoriyuki at mbg.ocn.ne.jp Tue Nov 25 17:17:44 2003 From: yoriyuki at mbg.ocn.ne.jp (Yamagata Yoriyuki) Date: Wed, 26 Nov 2003 10:17:44 +0900 (JST) Subject: [ocaml-i18n] Native UTF-8 strings In-Reply-To: References: <20031124.222026.85419707.yoriyuki@mbg.ocn.ne.jp> <20031124134216.GA27961@redhat.com> Message-ID: <20031126.101744.18575319.yoriyuki@mbg.ocn.ne.jp> It looks like most people has moved to Orcaware, so let continue the discussion. This message contains a Japanese character. If this causes a serious problem for someone, please let me know. From: Jun.Furuse at inria.fr Subject: Re: [ocaml-i18n] Native UTF-8 strings Date: Mon, 24 Nov 2003 16:05:29 +0100 > Asian identifiers inside Caml language? Funny, but we would carefully > determine how to treat Caml's case sensitiveness in such Asian > languages without capital/non-capital differences of letters. (In > perl, there is no such a problem, I think.) If we go to this direction, > we should start from Caml programming in Greece and Russian, > which seem to be easier :-). Mame proposed to make the prefix "?" (geki) equivalent to capitalization. See 2003/05/08 entry of http://dame.dyndns.org/memo.cgi (Japanese) But seriously, we should follow Unicode character properties here. Unicode standard defines all ideographs are "neutral" (no casing). all neutrals in ocaml identifiers could be treated as lowercases. If someone want an ideograph module name, we can use some convention like prefixing ideographs by "M". Same for variants. -- Yamagata Yoriyuki From rich at annexia.org Wed Nov 26 01:52:46 2003 From: rich at annexia.org (Richard Jones) Date: Wed, 26 Nov 2003 09:52:46 +0000 Subject: [ocaml-i18n] Native UTF-8 strings In-Reply-To: <20031126.111818.28780337.yoriyuki@mbg.ocn.ne.jp> References: <20031124.222026.85419707.yoriyuki@mbg.ocn.ne.jp> <20031124134216.GA27961@redhat.com> <20031126.111818.28780337.yoriyuki@mbg.ocn.ne.jp> Message-ID: <20031126095246.GB12571@redhat.com> On Wed, Nov 26, 2003 at 11:18:18AM +0900, Yamagata Yoriyuki wrote: > We need a way to interact with the environment, like converting input > string to Unicode (as pointed out by Rich), getting locale etc. It > seems one of hardest part of all, I think the real problem is that you have no way to know what encoding your streams are using! In very practical terms, is that file on disk ISO-8859-1, UTF-8 or SJIS? Perl solves this by allowing you to specify the encoding when opening a file. eg: open FH, "<:utf8", "file"; where the "<:utf8" argument defines a translation "layer" between the file and the program. I don't think this is a very elegant solution. Anyway, leaving such issues aside, I think these steps would help: * Have ML files be UTF-8 encoded by default. The current choice, ISO-8859-1, is arbitrary, and essentially derived from the fact that OCaml was developed in Europe. So if you're going to choose something arbitrary, either strict US-ASCII or UTF-8 seem like much more sensible choices. * Add \U escape sequences for string literals. * Store strings internally either as wide chars or as UTF-8, and change the String module accordingly to work on characters. * For byte strings, have a separate bytestring type. Using type 'string' when you really want bytestrings always seemed to me to be a hack. Rich. -- Richard Jones. http://www.annexia.org/ http://freshmeat.net/users/rwmj Merjis Ltd. http://www.merjis.com/ - improving website return on investment C2LIB is a library of basic Perl/STL-like types for C. Vectors, hashes, trees, string funcs, pool allocator: http://www.annexia.org/freeware/c2lib/ From yoriyuki at mbg.ocn.ne.jp Thu Nov 27 05:54:28 2003 From: yoriyuki at mbg.ocn.ne.jp (Yamagata Yoriyuki) Date: Thu, 27 Nov 2003 22:54:28 +0900 (JST) Subject: [ocaml-i18n] Native UTF-8 strings In-Reply-To: <20031126095246.GB12571@redhat.com> References: <20031126.111818.28780337.yoriyuki@mbg.ocn.ne.jp> <20031126095246.GB12571@redhat.com> Message-ID: <20031127.225427.45748803.yoriyuki@mbg.ocn.ne.jp> From: Richard Jones Subject: Re: [ocaml-i18n] Native UTF-8 strings Date: Wed, 26 Nov 2003 09:52:46 +0000 > I think the real problem is that you have no way to know what encoding > your streams are using! The default encoding is informed by LC_CTYPE locale in POSIX systems. (Actually, in ISO C) I don't know the situation of Windows and OS X, but I suppose similar methods exist. For the locale mechanism, see Kubota's introduction http://www.de.debian.org/doc/manuals/intro-i18n/ch-locale.en.html (The whole document is a very good introduction of i18n.) -- Yamagata Yoriyuki