From gdw at wave.co.nz Thu May 14 00:28:14 2009 From: gdw at wave.co.nz (Glyn Webster) Date: Thu, 14 May 2009 19:28:14 +1200 Subject: [Ocaml-pxp-users] How do I resolve PUBLIC external entities from inside DTD files? Message-ID: <200905141928.14573.gdw@wave.co.nz> I'm trying to work out how to use PXP, so I've written a toy program to read XHTML files. It resolves PUBLIC id's to files in the current directory, where I've placed W3C's XHTML DTD files. It can open the DTD file from inside an XHTML file, but it does not open PUBLIC entity files from within the DTD. I've attached the program and the error that comes up. Please, could anyone tell what I'm doing wrong? ************************************************************ glyn at ela:~/Ocamldtd/Test$ ls pxptest.ml xhtml1-transitional-2.dtd xhtml-lat1.ent xhtml-symbol.ent test.html xhtml1-transitional.dtd xhtml-special.ent glyn at ela:~/Ocamldtd/Test$ ocamlfind ocamlc -package pxp -linkpkg pxptest.ml && ./a.out In entity [toplevel] = SYSTEM "file://localhost/home/glyn/Ocamldtd/Test/test.html", at line 4, position 64: In entity [dtd] = PUBLIC "-//W3C//DTD XHTML 1.0 Transitional" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd", at line 29, position 0: ERROR: Unable to open the external entity: HTMLlat1 = PUBLIC "-//W3C//ENTITIES Latin 1 for XHTML//EN" "xhtml-lat1.ent" ************************************************************ pxptest.ml let catalog = new Pxp_reader.lookup_public_id_as_file [ ("-//W3C//DTD XHTML 1.0 Transitional", "xhtml1-transitional.dtd"); ("-//W3C//ENTITIES Latin 1 for XHTML//EN", "xhtml-lat1.ent"); ("-//W3C//ENTITIES Symbols 1 for XHTML//EN", "xhtml-symbol.ent"); ("-//W3C//ENTITIES Special 1 for XHTML//EN", "xhtml-special.ent") ] ;; try let document = Pxp_tree_parser.parse_document_entity Pxp_types.default_config (Pxp_types.from_file ~alt:[catalog] "test.html") Pxp_tree_parser.default_spec in document # root # write (`Out_channel stdout) `Enc_utf8 with exn -> print_endline (Pxp_types.string_of_exn exn) ;; ************************************************************ test.html ...etc... --Glyn From gerd at gerd-stolpmann.de Thu May 14 07:14:44 2009 From: gerd at gerd-stolpmann.de (Gerd Stolpmann) Date: Thu, 14 May 2009 16:14:44 +0200 Subject: [Ocaml-pxp-users] How do I resolve PUBLIC external entities from inside DTD files? In-Reply-To: <200905141928.14573.gdw@wave.co.nz> References: <200905141928.14573.gdw@wave.co.nz> Message-ID: <1242310484.30228.13.camel@flake.lan.gerd-stolpmann.de> This is a bad interaction of the file resolver (inside Pxp_types.from_file), and the catalog resolver. It tries to do this: Because HTMLlat1 also has a file name attached to the PUBLIC name, the file resolver tries to open the entity by file name. However, the information is lost relative to which directory the file is to be opened, because it is an "inner" PUBLIC entity. If it is an option for you, you can remove this file name attachment from PUBLIC, as in (note the empty string). Otherwise, I'd recommend to revert the resolution order: First try the catalog, then the file resolution, something like let file_resolver = new Pxp_reader.resolve_as_file ... in let resolver = new Pxp_reader.combine [ catalog; file_resolver ] let file_url = Pxp_reader.make_file_url "test.html" let source = ExtID(System((Neturl.string_of_url file_url), resolver) (untested, however). The SVN version of PXP also contains a tentative fix: It is avoided to run into the problematic case, because the file resolver is no longer used when the directory information is lost. BTW, it must read let catalog = new Pxp_reader.lookup_public_id_as_file [ ("-//W3C//DTD XHTML 1.0 Transitional", "xhtml1-transitional.dtd"); ("-//W3C//ENTITIES Latin 1 for XHTML//EN", "xhtml-lat1.ent"); ("-//W3C//ENTITIES Symbols for XHTML//EN", "xhtml-symbol.ent"); ("-//W3C//ENTITIES Special for XHTML//EN", "xhtml-special.ent") ] ;; Hope this helps, Gerd Am Donnerstag, den 14.05.2009, 19:28 +1200 schrieb Glyn Webster: > I'm trying to work out how to use PXP, so I've written a toy program to read XHTML > files. It resolves PUBLIC id's to files in the current directory, where I've placed > W3C's XHTML DTD files. It can open the DTD file from inside an XHTML file, but it > does not open PUBLIC entity files from within the DTD. I've attached the program > and the error that comes up. Please, could anyone tell what I'm doing wrong? > > ************************************************************ > glyn at ela:~/Ocamldtd/Test$ ls > pxptest.ml xhtml1-transitional-2.dtd xhtml-lat1.ent xhtml-symbol.ent > test.html xhtml1-transitional.dtd xhtml-special.ent > > glyn at ela:~/Ocamldtd/Test$ ocamlfind ocamlc -package pxp -linkpkg pxptest.ml && ./a.out > In entity [toplevel] = SYSTEM "file://localhost/home/glyn/Ocamldtd/Test/test.html", at line 4, position 64: > In entity [dtd] = PUBLIC "-//W3C//DTD XHTML 1.0 Transitional" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd", at line 29, position > 0: > ERROR: Unable to open the external entity: HTMLlat1 = PUBLIC "-//W3C//ENTITIES Latin 1 for XHTML//EN" "xhtml-lat1.ent" > > ************************************************************ > pxptest.ml > > let catalog = > new Pxp_reader.lookup_public_id_as_file > [ ("-//W3C//DTD XHTML 1.0 Transitional", "xhtml1-transitional.dtd"); > ("-//W3C//ENTITIES Latin 1 for XHTML//EN", "xhtml-lat1.ent"); > ("-//W3C//ENTITIES Symbols 1 for XHTML//EN", "xhtml-symbol.ent"); > ("-//W3C//ENTITIES Special 1 for XHTML//EN", "xhtml-special.ent") ] > ;; > > try > let document = > Pxp_tree_parser.parse_document_entity > Pxp_types.default_config > (Pxp_types.from_file ~alt:[catalog] "test.html") > Pxp_tree_parser.default_spec > in > document # root # write (`Out_channel stdout) `Enc_utf8 > with exn -> > print_endline (Pxp_types.string_of_exn exn) > ;; > > ************************************************************ > test.html > > > "-//W3C//DTD XHTML 1.0 Transitional" > "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> > > > ...etc... > > --Glyn > _______________________________________________ > Ocaml-pxp-users mailing list > Ocaml-pxp-users at orcaware.com > http://www.orcaware.com/mailman/listinfo/ocaml-pxp-users > -- ------------------------------------------------------------ Gerd Stolpmann * Viktoriastr. 45 * 64293 Darmstadt * Germany gerd at gerd-stolpmann.de http://www.gerd-stolpmann.de Phone: +49-6151-153855 Fax: +49-6151-997714 ------------------------------------------------------------ From gdw at wave.co.nz Thu May 14 19:46:53 2009 From: gdw at wave.co.nz (Glyn Webster) Date: Fri, 15 May 2009 14:46:53 +1200 Subject: [Ocaml-pxp-users] How do I resolve PUBLIC external entities from inside DTD files? Message-ID: <200905151446.53730.gdw@wave.co.nz> On Fri, 15 May 2009 2:14:44 am you wrote: > This is a bad interaction of the file resolver (inside > Pxp_types.from_file), and the catalog resolver. It tries to do this: > Because HTMLlat1 also has a file name attached to the PUBLIC name, the > file resolver tries to open the entity by file name. However, the > information is lost relative to which directory the file is to be > opened, because it is an "inner" PUBLIC entity. Thank you. Your "Pxp_reader.combine" solution worked for me. (Editing the DTD worked too, but I want to avoid that.) Was the way I was trying to do things originally sensible? Or is what you have shown my here how I should have done things from the start? I think I understand the first part of what was happening (file resolution is applicable to PUBLIC ids that provide filenames, the "~alt" resolver is only used if the default resolver is inapplicable), but I still don't understand why the directory information gets lost, though. --Glyn From gerd at gerd-stolpmann.de Mon May 18 06:54:15 2009 From: gerd at gerd-stolpmann.de (Gerd Stolpmann) Date: Mon, 18 May 2009 15:54:15 +0200 Subject: [Ocaml-pxp-users] How do I resolve PUBLIC external entities from inside DTD files? In-Reply-To: <200905151438.52006.glyn@wave.co.nz> References: <200905141928.14573.gdw@wave.co.nz> <1242310484.30228.13.camel@flake.lan.gerd-stolpmann.de> <200905151438.52006.glyn@wave.co.nz> Message-ID: <1242654855.7546.21.camel@flake.lan.gerd-stolpmann.de> Am Freitag, den 15.05.2009, 14:38 +1200 schrieb Glyn Webster: > On Fri, 15 May 2009 2:14:44 am you wrote: > > This is a bad interaction of the file resolver (inside > > Pxp_types.from_file), and the catalog resolver. It tries to do this: > > Because HTMLlat1 also has a file name attached to the PUBLIC name, the > > file resolver tries to open the entity by file name. However, the > > information is lost relative to which directory the file is to be > > opened, because it is an "inner" PUBLIC entity. > > Thank you. Your "Pxp_reader.combine" solution worked for me. (Editing the DTD > worked too, but I want to avoid that.) > > Was the way I was trying to do things originally sensible? Or is what you have > shown my here how I should have done things from the start? > > I think I understand the first part of what was happening (file resolution is > applicable to PUBLIC ids that provide filenames, the "~alt" resolver is only > used if the default resolver is inapplicable), but I still don't understand > why the directory information gets lost, though. Well, the name resolution is a complex thing. It is easy to make errors - both for the developer and the user. Also, there is no concise definition from W3C how things ought to work. I'd say what you tried originally is reasonable, and should work. Because of this, I've changed the PXP code. For resolving SYSTEM identifiers one has to track the base URI, i.e. the directory in "URL speak". For example, if you have: http://somewhere/file1.xml: http://somewhere/dir/file2.xml: Then file3 is located in dir (like in a HTML hyperlink). If there is one PUBLIC name in between, however, there is no base URI then. This is what I mean with "directory information gets lost". In XML things are even harder, because you can have both a SYSTEM and a PUBLIC name. However, when you look up an entity, you pick only either of them. So it is easy to get errors like yours - on one level, both identifiers would work, and the parser picks the unexpected one, and in the next name resolution you get a failure. Generally, I'd recommend to avoid PUBLIC identifiers, and use only SYSTEM ones. You can easily map http:// URI's to real files with the class rewrite_system_id. That way, each document has a base URI, and you never run into this kind of problem. Gerd -- ------------------------------------------------------------ Gerd Stolpmann * Viktoriastr. 45 * 64293 Darmstadt * Germany gerd at gerd-stolpmann.de http://www.gerd-stolpmann.de Phone: +49-6151-153855 Fax: +49-6151-997714 ------------------------------------------------------------