[Ocaml-pxp-users] PXP: Saving memory

Tue Apr 18 13:20:19 PDT 2006

Am Dienstag, den 18.04.2006, 11:17 +0100 schrieb Richard Jones:
> We have a problematic XML document (a daily report) which runs to
> something like 600-700 MB in size.  It is proving hard to parse this
> document with PXP, because our machine runs out of memory and starts
> thrashing.  In future this document will only grow in size.
> 
> We're looking for options to reduce the amount of memory used.  One
> option would be to somehow parse the document incrementally, but I
> don't think this is possible with PXP.

Yes, this is possible! Of course, you do not get a tree as result, but
only a (well-formed) token stream. 

For a simple example, see examples/pullparser/pull.ml. This example
processes the token stream with a stream parser.

It is also possible to mix the stream and tree-based approaches. See
Pxp_document.liquefy and .solidify. (Note, however, that there is a
serious bug in liquefy.) For example, if your document consists of an
open-ended sequence of parts, you can separate the parts using the
stream approach, and call solidify to create a tree from the tokens
every parts consists of.

> Another option would be to use "pools".  However documentation is very
> thin on how to use these.  It seems that the parsing function we are
> using, parse_wfdocument_entity, doesn't allow pools to be passed, and
> that's assuming we even knew how to create pools in the first place,
> which isn't very obvious.

Pools are configured in the config record. However, don't expect
miracles. The pools try to increase string sharing, i.e. that equal
strings are represented by the same memory block. The problem is that
the pool algorithm does not have any idea for which strings this is a
good idea and for which it is a bad one (because it only occurs once,
and the management costs are much higher than the benefits).

If you have a DTD, pools have a negative effect because if a declared
element or attribute is referenced, it is pointed to the same memory
block anyway, so you only have costs and no benefits.

> The document isn't very complicated - it's just a simple list of
> <row>'s.
> 
> Can someone give me suggestions?

Well, then I would go with streams.

> Rich.
> 
> PS. One thing we found when parsing this, is that #find_all_elements
> isn't tail recursive, meaning that it causes a crash on even fairly
> modest documents.

If you have a fix, I'll include it in the distribution.

Gerd
-- 
------------------------------------------------------------
Gerd Stolpmann * Viktoriastr. 45 * 64293 Darmstadt * Germany 
gerd at gerd-stolpmann.de          http://www.gerd-stolpmann.de
Phone: +49-6151-153855                  Fax: +49-6151-997714
------------------------------------------------------------