Commit 7a10fa6c authored by gerd's avatar gerd

Updated docs.


git-svn-id: https://godirepo.camlcity.org/svn/lib-pxp/trunk@731 dbe99aee-44db-0310-b2b3-d33182c8eb97
parent f8ef0ad3
......@@ -15,7 +15,8 @@ OBJ = pxp_lexing.cmo pxp_type_anchor.cmo \
pxp_dtd_parser.cmo \
pxp_yacc.cmo pxp_marshal.cmo pxp_codewriter.cmo
DOC = pxp_document.mli intro_trees.txt
DOC = pxp_document.mli pxp_dtd.mli \
intro_trees.txt intro_extensions.txt intro_namespaces.txt
XOBJ = $(OBJ:.cmo=.cmx)
......
This text explains the custom node extensions that can be attached
to XML trees. This feature can be ignored by users that do not need
it. We effectively comment the class type {!Pxp_document.extension}
here.
{1 Node extensions}
Every node in a tree has a so-called extension. By default, the
extension is practically empty and only present for formal uniformity.
However, one can also define custom extension classes, and effectively
add new methods to the node classes.
The type {!Pxp_document.extension} is:
{[
class type [ 'node ] extension =
object ('self)
method clone : 'self
method node : 'node
method set_node : 'node -> unit
end
]}
Every node has such an extension object, as the following picture shows.
Of course, the idea is to equip the extension object with additional
methods and not only [clone], [node], and [set_node] - which are simply
the bare minimum.
{picture ../pic/extension_general.gif
Node objects and extension objects}
The picture shows how the nodes and extensions are linked
together. Every node has a reference to its extension, and every extension has
a reference to its node. The methods [extension] and
[node] follow these references; a typical phrase is
{[ self # node # attribute "xy" ]}
to get the value of an attribute from a method defined in the extension object;
or
{[
self # node # iter
(fun n -> n # extension # my_method ...)
]}
to iterate over the subnodes and to call [my_method] of the
corresponding extension objects.
Note that extension objects do not have references to subnodes
(or "subextensions") themselves; in order to get one of the children of an
extension you must first go to the node object, then get the child node, and
finally reach the extension that is logically the child of the extension you
started with.
In other programming languages, it is possible to extend the node
objects directly. Ocaml's subtyping rules make this practically
impossible. The type of the extension object appears as type parameter
in the class type of the nodes. Note that this means that the type
of the extension objects has to be the same for all nodes in a tree.
It is not possible to e.g. use a different type for elements than for
data nodes.
{2 How to define an extension class}
At minimum, you must define the methods [clone], [node], and
[set_node] such that your class is compatible with the type
{!Pxp_document.extension}. The method [set_node] is called during the
initialization of the node, or after a node has been cloned; the node
object invokes [set_node] on the extension object to tell it that this
node is now the object the extension is linked to. The extension must
return the node object passed as argument of [set_node] when the
[node] method is called.
The [clone] method must return a copy of the extension object; at
least the object itself must be duplicated, but if required, the copy
should deeply duplicate all objects and values that are referred by
the extension, too. Whether this is required, depends on the
application; [clone] is invoked by the node object when one of its
cloning methods is called.
A good starting point for an extension class:
{[
class custom_extension =
object (self)
val mutable node = (None : custom_extension node option)
method clone = {< >}
method node =
match node with
None ->
assert false
| Some n -> n
method set_node n =
node <- Some n
end
]}
This class is compatible with {!Pxp_document.extension}. The purpose
of defining such a class is, of course, adding further methods; and
you can do it without restriction.
Often, you want more than only a single extension class. In this case,
it is strictly required that all your classes (that will be used in
the same tree) have the same type of extensions (with respect to the
interface; i.e. it does not matter if your classes differ in the
defined private methods and instance variables, but public methods
count). It is simple to implement:
{[
class custom_extension =
object (self)
val mutable node = (None : custom_extension node option)
method clone = ... (* see above *)
method node = ... (* see above *)
method set_node n = ... (* see above *)
method virtual my_method1 : ...
method virtual my_method2 : ...
... (* etc. *)
end
class custom_extension_kind_A =
object (self)
inherit custom_extension
method my_method1 = ...
method my_method2 = ...
end
class custom_extension_kind_B =
object (self)
inherit custom_extension
method my_method1 = ...
method my_method2 = ...
end
]}
If a class does not need a method (e.g. because it does not make
sense, or it would violate some important condition), it is possible
to define the method and to always raise an exception when the method
is invoked (e.g. [assert false]).
{2 How to bind extension classes to element types}
Once you have defined your extension classes, you can bind them to
element types. The simplest case is that you have only one class and
that this class is always to be used. The parsing functions in the
module {!Pxp_tree_parser} take a [spec] argument for the document
model specification which can be customized (of type
{!Pxp_document.spec}). If your single class has the name [c], this
argument should be
{[
let spec =
Pxp_document.make_spec_from_alist
~data_exemplar: (new Pxp_document.data_impl c)
~default_element_exemplar: (new Pxp_document.element_impl c)
~element_alist: []
()
]}
This means that data nodes will be created from the exemplar passed by
[~data_exemplar] and that all element nodes will be made from the
exemplar specified by [~default_element_exemplar]. In
[~element_alist], you can pass that different exemplars are to be used
for different element types; but this is an optional feature. If you
do not need it, pass the empty list.
Remember that an exemplar is a (node, extension) pair that serves as
pattern when new nodes (and the corresponding extension objects) are
added to the document tree. In this case, the exemplar contains [c] as
extension, and when nodes are created, the exemplar is cloned, and
cloning makes also a copy of [c] such that all nodes of the document
tree will have a copy of [c] as extension.
The [~element_alist] argument can bind specific element types to
specific exemplars; as exemplars may be instances of different classes
it is effectively possible to bind element types to classes. For
example, if the element type "p" is implemented by class [c_p], and
"q" is realized by [c_q], you can pass the following value:
{[
let spec =
Pxp_document.make_spec_from_alist
~data_exemplar: (Pxp_document.new data_impl c)
~default_element_exemplar: (Pxp_document.new element_impl c)
~element_alist:
[ "p", new Pxp_document.element_impl c_p;
"q", new Pxp_document.element_impl c_q;
]
()
]}
The extension object [c] is still used for all data nodes and
for all other element types.
This text explains how PXP deals with the optional namespace
declarations in XML text.
{1 Namespaces}
PXP supports namespaces (but they have to be explicitly enabled).
In order to simplify the handling
of namespace-aware documents PXP applies a transformation to the document
which is called "prefix normalization". This transformation ensures that every
namespace prefix uniquely identifies a namespace throughout the whole document.
A namespace is identified by a namespace URI (e.g. something like
"http://company.org/namespaces/project1" - note that this URI is simply
processed as string, and never looked up by an HTTP access). For
brevity of formulation, one has to define a so-called namespace prefix
for such a URI. For example:
{[ <x:q xmlns:x="http://company.org/namespaces/project1">...</q> ]}
The "xmlns:x" attribute is special, and declares that for this
subtree the prefix "x" is to be used as replacement for the long
URI. Here, "x:q" denotes that the element "q" in this namespace "x"
is meant.
The problem is now that the URI defines the namespace, and not the
prefix. In another subtree you may want to use the prefix "y" for the
same namespace. This has always made it difficult to deal with namespaces
in XML-processing software.
PXP, however, performs prefix normalization before it returns the
tree. This means that all prefixes are changed to a norm prefix for
the namespace. This can be the first prefix used for the namespace,
or a prefix declared with a PXP extension, or a programmatically
declared binding of the norm prefix to the namespace.
In order to use the PXP implementation of namespaces, one has to
set [enable_namespace_processing] in the parser configuration, and
to use namespace-aware node implementations. If you don't use extended
node trees, this means to use {!Pxp_tree_parser.default_namespace_spec}
instead of {!Pxp_tree_parser.default_spec}. A good starting point
to enable all that:
{[
let config = Pxp_types.default_namespace_config
let source = ...
let spec = Pxp_tree_parser.default_namespace_spec
let doc = Pxp_tree_parser.parse_document_entity config source spec
let root = doc#root
]}
The namespace-aware implementations of the [node]
class type define additional namespace methods like
[namespace_uri]. (Although you also could direct the parser to create
non-namespace-aware nodes,
this does not make much sense, as you do not get these special access
methods then.)
The method [namespace_scope] allows one to get more information what
happened during prefix normalization. In particular, it is possible to
find out the original prefix in the XML text (which is also called
{b display prefix}), before it was mapped to the normalized prefix.
The [namespace_scope] method returns a
{!Pxp_dtd.namespace_scope} object with additional lookup methods.
{2 Example for prefix normalization}
In the following XML snippet the prefix "h" is declared as a shorthand
for the XHTML namespace:
{[
<h:html xmlns:h="http://www.w3.org/1999/xhtml">
<h:head>
<h:title>Virtual Library</h:title>
</h:head>
<h:body>
<h:p>Moved to <h:a href="http://vlib.org/">vlib.org</h:a>.</h:p>
</h:body>
</h:html>
]}
In this example, normalization changes nothing, because the prefix
"h" has the same meaning thoughout the whole document.
The XML standard, however, gives the author of the document the
freedom to change the meaning of the prefix at any time. For example,
here the prefix "x" is changed in the inner node:
{[
<x:address xmlns:x="http://addresses.org">
<x:name xmlns:x="http://names.org">
Gerd Stolpmann
</x:name>
</x:address>
]}
After normalization, the prefixes would look as follows:
{[
<x:address xmlns:x="http://addresses.org">
<x1:name xmlns:x1="http://names.org">
Gerd Stolpmann
</x1:name>
</x:address>
]}
In order to avoid overridden prefixes, the prefix in the inner node
was changed to "x1".
The idea of prefix normalization is to simplify how programs can match
against element and attribute names. It is possible to configure the
normalizer so that certain prefixes are used for certain URI's.
In this example, we could direct the normalizer to use the prefixes
"addr" and "nm" instead of the quite arbitrary strings "x" and "x1":
{[
dtd # namespace_manager # add_namespace "addr" "http://addresses.org";
dtd # namespace_manager # add_namespace "nm" "http://names.org";
]}
For this to work you need access to the [dtd] object before the parser
actually starts it work. The parsing functions in {!Pxp_tree_parser}
have the special hook [transform_dtd] that is called at the right
moment, and allows the program to enter such special configurations
into the DTD object. The resulting program could look then like:
{[
let config = Pxp_types.default_namespace_config
let source = ...
let spec = Pxp_tree_parser.default_namespace_spec
let transform_dtd dtd =
dtd # namespace_manager # add_namespace "addr" "http://addresses.org";
dtd # namespace_manager # add_namespace "nm" "http://names.org";
dtd
let doc =
Pxp_tree_parser.parse_document_entity ~transform_dtd config source spec
let root = doc#root
]}
Alternatively, it is also possible to put special processing instructions
into the DTD:
{[
<?pxp:dtd namespace prefix="addr" uri="http://addresses.org"?>
<?pxp:dtd namespace prefix="nm" uri="http://names.org"?>
]}
The advantage of configuring specific normprefixes is that one can now
use them directly in programs, e.g. for matching:
{[
match node#node_type with
| T_element "addr:address" -> ...
| T_element "nm:name" -> ...
]}
{2 Finding out more about namespaces}
There are two additional objects that are relevant. First, there is a
namespace manager for the whole tree. This object gathers all namespace
URI's up that occur in the XML text, and decides which normprefixes
are associated with them: {!Pxp_dtd.namespace_manager}.
Second, there is the namespace scope. An XML tree may have a lot of such
objects. A new scope object is created whenever new namespaces are
introduced, i.e. when there are "xmlns" declarations. The scope object
has a pointer to the scope object for the surrounding XML text. Scope
objects are documented here: {!Pxp_dtd.namespace_scope}.
Some examples (when [n] is a node):
{ul
{- To find out which normprefix is used for a namespace URI, use
{[ n # namespace_manager # get_normprefix uri ]} }
{- To find out the reverse, i.e. which URI is represented by a certain
normprefix, use
{[ n # namespace_manager # get_primary_uri prefix ]} }
{- To find out which namespace URI is meant by a display prefix, i.e.
the prefix as it occurs literally in the XML text:
{[ n # namespace_scope # uri_of_display_prefix prefix ]} }
}
......@@ -438,71 +438,173 @@ For getting the built-in classes without any modification, just use
{!Pxp_tree_parser.default_spec}. For the variant with enabled namespaces,
prefer {!Pxp_tree_parser.default_namespace_spec}.
XXX: Look at extended nodes for examples of non-standard specs
{2 Extended nodes}
XXX
Every node in a tree has a so-called extension. By default, the
extension is practically empty and only present for formal uniformity.
However, one can also define custom extension classes, and effectively
add new methods to the node classes.
[There is text and pictures in the SGML version.]
Node extensions are explained in detail here: {!Intro_extensions}
{2 Namespaces}
PXP supports namespaces (but they have to be explicitly enabled).
In order to simplify the handling
of namespace-aware documents PXP applies a transformation to the document
which is called "prefix normalization". This transformation ensures that every
namespace prefix uniquely identifies a namespace throughout the whole document.
A namespace is identified by a namespace URI (e.g. something like
"http://company.org/namespaces/project1" - note that this URI is simply
processed as string, and never looked up by an HTTP access). For
brevity of formulation, one has to define a so-called namespace prefix
for such a URI. For example:
{[ <x:q xmlns:x="http://company.org/namespaces/project1">...</q> ]}
The "xmlns:x" attribute is special, and declares that for this
subtree the prefix "x" is to be used as replacement for the long
URI. Here, "x:q" denotes that the element "q" in this namespace "x"
is meant.
The problem is now that the URI defines the namespace, and not the
prefix. In another subtree you may want to use the prefix "y" for the
same namespace. This has always made it difficult to deal with namespaces
in XML-processing software.
PXP, however, performs prefix normalization before it returns the
tree. This means that all prefixes are changed to a norm prefix for
the namespace. This can be the first prefix used for the namespace,
or a prefix declared with a PXP extension, or a programmatically
declared binding of the norm prefix to the namespace.
In order to use the PXP implementation of namespaces, one has to
set [enable_namespace_processing] in the parser configuration, and
to use namespace-aware node implementations. If you don't use extended
node trees, this means to use {!Pxp_tree_parser.default_namespace_spec}
instead of {!Pxp_tree_parser.default_spec}. A good starting point
to enable all that:
As an option, PXP processes namespace declarations in XML text.
See this separate introduction for details: {!Intro_namespaces}.
{2 Details of the mapping from XML text to the tree representation}
If an element declaration does not allow the element to
contain character data, the following rules apply.
If the element must be empty, i.e. it is declared with the
keyword [EMPTY], the element instance must be effectively
empty (it must not even contain whitespace characters). The parser guarantees
that a declared [EMPTY] element never contains a data
node, even if the data node represents the empty string.
If the element declaration only permits other elements to occur
within that element but not character data, it is still possible to insert
whitespace characters between the subelements. The parser ignores these
characters, too, and does not create data nodes for them.
{b Example.} Consider the following element types:
{[
let config = Pxp_types.default_namespace_config
let source = ...
let spec = Pxp_tree_parser.default_namespace_spec
let doc = Pxp_tree_parser.parse_document_entity config source spec
let root = doc#root
<!ELEMENT x ( #PCDATA | z )* >
<!ELEMENT y ( z )* >
<!ELEMENT z EMPTY>
]}
The namespace-aware implementations of the [node]
class type define additional namespace methods like
[namespace_uri]. (Although you also could direct the parser to create
non-namespace-aware nodes,
this does not make much sense, as you do not get these special access
methods then.)
The method [namespace_scope] allows one to get more information what
happened during prefix normalization. In particular, it is possible to
find out the original prefix in the XML text (which is also called
{b display prefix}), before it was mapped to the normalized prefix.
The [namespace_scope] method returns a
{!Pxp_dtd.namespace_scope} object with additional lookup methods.
Only [x] may contain character data, the keyword
[#PCDATA] indicates this. The other types are character-free.
The XML term
{[
<x><z/> <z/></x>
]}
will be internally represented by an element node for [x]
with three subnodes: the first [z] element, a data node
containing the space character, and the second [z] element.
In contrast to this, the term
{[
<y><z/> <z/></y>
]}
is represented by an element node for [y] with only
{b two} subnodes, the two [z] elements. There
is no data node for the space character because spaces are ignored in the
character-free element [y].
{b Parser option:}
By setting the parser option [drop_ignorable_whitespace] to
[false], the behaviour of the parser is changed such that
even ignorable whitespace characters are represented by data nodes.
{3 The representation of character data}
The XML specification allows all Unicode characters in XML
texts. This parser can be configured such that UTF-8 is used to represent the
characters internally; however, the default character encoding is
ISO-8859-1. (Currently, no other encodings are possible for the internal string
representation; the type {!Pxp_types.rep_encoding} enumerates
the possible encodings. Principally, the parser could use any encoding that is
ASCII-compatible, but there are currently only lexical analyzers for UTF-8 and
ISO-8859-1. It is currently impossible to use UTF-16 or UCS-4 as internal
encodings (or other multibyte encodings which are not ASCII-compatible) unless
major parts of the parser are rewritten - unlikely...)
The internal encoding may be different from the external encoding (specified
in the XML declaration [<?xml ... encoding="..."?>]); in
this case the strings are automatically converted to the internal encoding.
If the internal encoding is ISO-8859-1, it is possible that there are
characters that cannot be represented. In this case, the parser ignores such
characters and prints a warning (to the [collect_warning]
object that must be passed when the parser is called).
The XML specification allows lines to be separated by single LF
characters, by CR LF character sequences, or by single CR
characters. Internally, these separators are always converted to single LF
characters.
The parser guarantees that there are never two adjacent data
nodes; if necessary, data material that would otherwise be represented by
several nodes is collapsed into one node. Note that you can still create node
trees with adjacent data nodes; however, the parser does not return such trees.
Note that CDATA sections are not represented specially; such
sections are added to the current data material that is being collected for the
next data node.
{3 The representation of entities within documents}
{b Entities are not represented within
documents!} If the parser finds an entity reference in the document
content, the reference is immediately expanded, and the parser reads the
expansion text instead of the reference.
{3 The representation of attributes}
As attribute
values are composed of Unicode characters, too, the same problems with the
character encoding arise as for character material. Attribute values are
converted to the internal encoding, too; and if there are characters that
cannot be represented, these are dropped, and a warning is printed.
Attribute values are normalized before they are returned by
methods like [attribute]. First, any remaining entity
references are expanded; if necessary, expansion is performed recursively.
Second, newline characters (any of LF, CR LF, or CR characters) are converted
to single space characters. Note that especially the latter action is
prescribed by the XML standard (but [&#10;] is not converted
such that it is still possible to include line feeds into attributes).
{3 The representation of processing instructions}
Processing instructions are parsed to some extent: The first word of the
PI is called the target, and it is stored separated from the rest of the PI:
{[
<?target rest?>
]}
The exact location where a PI occurs is not represented (by default). The
parser puts the PI into the object that represents the embracing construct (an
element, a DTD, or the whole document); that means you can find out which PIs
occur in a certain element, in the DTD, or in the whole document, but you
cannot lookup the exact position within the construct.
{b Parser option:}
If you require the exact location of PIs, it is possible to
create extra nodes for them. This mode is controlled by the option
[enable_pinstr_nodes]. The additional nodes have the node type
[T_pinstr target], and are created
from special exemplars contained in the [spec] (see
{!Pxp_document.spec}).
{3 The representation of comments}
Normally, comments are not represented; they are dropped by
default.
{b Parser option:}
However, if you require comment in the document tree, it is possible to create
[T_comment] nodes for them. This mode can be specified by the
option [enable_comment_nodes]. Comment nodes are created from
special exemplars contained in the [spec] (see
{!Pxp_document.spec}). You can access the contents of comments through the
method [comment].
{3 The attributes [xml:lang] and [xml:space] }
These attributes are not supported specially; they are handled
like any other attribute.
Note that the utility function
{!Pxp_document.strip_whitespace} respects [xml:space]
......@@ -75,7 +75,7 @@ type data_node_classification =
(** The [extension] is, as the name says, the extensible part of the
nodes. See XXX LINK for an introduction into extensions.
nodes. See {!Intro_extensions} for an introduction into extensions.
*)
class type [ 'node ] extension =
object ('self)
......
This diff is collapsed.
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment