[Ocaml-pxp-users] lex, ulex, wlex, UTF-8

Sat Dec 17 09:41:56 PST 2005

Am Samstag, den 17.12.2005, 16:06 +0000 schrieb Richard Jones:
> Can someone tell me what lex, ulex and wlex are?  What is the
> difference between them?  Which one should I be using?
> 
> On a related note, I want to have PXP just use UTF-8 everywhere.  I
> want it to assume that all the strings I give it are UTF-8, I want it
> to write UTF-8 documents, and I want it to parse documents into UTF-8
> strings in memory.  I've managed (I think) to get it to write UTF-8
> documents (#write ... `Enc_utf8) and parse documents into UTF-8 (set
> config.encoding to `Enc_utf8).  However I can't do the first thing -
> get it to assume all strings I pass to it are UTF-8 encoded.  It
> attempts to convert my strings from ISO-8859-1 to UTF-8, which isn't
> useful because all the strings are already UTF-8.  How do I do this?

Yes, this is possible, and you are right that this has to do with the
different lexers. If you want to parse UTF-8 in an efficient way, the
lexer must know that, because the lexer tokenizes the character stream,
and XML allows the tokens to consist of non-ASCII characters (e.g. it is
absolutely legal to name XML tags by Japanese words).

(ocaml)lex, ulex, and wlex are three different lexer generators.
Historically, PXP only used ocamllex, but ocamllex can only process 8
bit characters. Nevertheless, we managed to analyze UTF-8 with the help
of a rule transformer (written by Claudio Sacerdoti Coen) that accepts
Unicode rules as input and converts them to byte-based rules that
understand UTF-8. This analyzer is still available as library
pxp-lex-utf8. It is fast, but the table size of the lexer is large.

wlex and ulex are two improvements to this situation (both written by
Alain Frisch). wlex is a patched ocamllex, and works by first
classifying the input Unicode characters, and then using the character
class (which is again an 8 bit entity) in the rules. PXP can be built
with wlex support, the library is called pxp-wlex-utf8. Because wlex
must be distributed as a patch (due to the O'Caml license) it was never
really liked by users. The wlex lexer is small in size, but a bit slower
than the ocamllex-based UTF-8 lexer. It is likely that wlex is
discontinued at some time.

ulex is the "state of the art" Unicode lexer for O'Caml. ulex is simply
a camlp4 module so that lexing rules can be directly included into
O'Caml source code (very elegant). ulex can directly process Unicode
characters - the 8 bit limitation could be dropped. PXP can be built
with ulex support, the library is called pxp-ulex-utf8. Speed is a bit
slower than wlex, but still acceptable. The generated code is a bit
larger than wlex, but still 1/4 of the size of the ocamllex-based lexer.

So if you need speed use pxp-lex-utf8, if code size matters use
pxp-ulex-utf8.

To enable UTF-8 parsing, you need to link one of the UTF-8 lexers into
the executable, and set the internal representation to UTF-8. To do the
latter set the "encoding" field of the config record to `Enc_utf8. (PXP
still reads files encoded differently, and converts these files to UTF-8
instead of ISO-8859-1.)

There is one known problem with UTF-8 parsing: Some editors prepend the
three bytes 0xEF 0xBB 0xBF to UTF-8 files (which is a so-called byte
order mark - senseless for UTF-8 but there is a Unicode document
allowing it). PXP rejects that. I found no time to fix this up to now.

Hope this information helps.

Gerd
-- 
------------------------------------------------------------
Gerd Stolpmann * Viktoriastr. 45 * 64293 Darmstadt * Germany 
gerd at gerd-stolpmann.de          http://www.gerd-stolpmann.de
Telefon: 06151/153855                  Telefax: 06151/997714
------------------------------------------------------------