Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Zope: XML

Status report

 

 

Zope xml RSS feed   Index | Next | Previous | View Threaded


faassen at vet

Oct 30, 2001, 5:06 PM

Post #1 of 7 (1037 views)
Permalink
Status report

Hi there,

I've just finished merging two things into the ParsedXML mainline,
so it's time for a status update:

- the DOM proxy speedups

This aims to speed up access to DOM from within the Zope security framework.
I haven't done a lot of measuring, but the measuring that I did showed
human noticable increases in speed, so that's good. I'd like to see
some more tests with rendering or accessing larger documents, however.

- test cleanups

I've cleaned up the unit tests to be more compliant with the Zope
unit test guidelines. Basically I've ripped out some magic and made
each test set runnable by itself (before you'd need to use the
magic domtester.py framework). The testing guidelines are here:

http://dev.zope.org/CVS/ZopeTestingGuidelines

I've also just now discovered a bug that at least appears on my machine
when I try to run setup.py of the included Expat with python 1.5;
because <assert.h> was not included by pyexpat.c anymore I got an
ImportError when trying to import pyexpat on my machine. We should have
the bugfix (#include <assert.h> somewhere on top :) in CVS soon as well,
I hope. It probably hadn't been compiling straight (at least on my
setup) for about 3 months, though it's been working fine with Python
1.5.

Another tidbit: I've done some intensive reading of the DOM recommendation
again to determine whether some namespace related test failures are in fact
testing the wrong thing. I suspect that they are, and once I get some
independent feedback on this I'll alter the tests.

Concerning the tests, I've also started the process to donate the DOM tests
to the pyxml group, so that they can be used to check other DOMs as
well (such as 4DOM).

What I'm currently working on (not yet merged with the trunk in CVS)
is unique element ids. This will enable a much more stable way to
access nodes through URLs than before; a simple insert or append into the
DOM tree now won't change the entire node URL so easily anymore. This
is still work in progress but I'd like to know what people think. Currently
URLs to nodes look like this:

http://foobar.com/path/to/doc/0/5/3

where the 0/5/3 part means: go to the 0th node of the document, take its
5th child node, take its 3rd childnode.

This is very unstable to changes in the DOM structure. If I insert a node,
I might make node 5 node 6, breaking the URL right away.

What I'm playing with is a way to add unique ids to element nodes, so
that at least inserting an element won't break everything anymore.
Perhaps it is a good idea to introduce unique ids to other nodes
as well, instead of only to elements.. for some reason I didn't
do that but I forget now why not.

Having such a unique id per node does cost a bit of extra memory per
node to store the ids.

So, your feedback and opinions, please.

Regards,

Martijn


rik.hoekstra at inghist

Oct 31, 2001, 4:17 AM

Post #2 of 7 (1033 views)
Permalink
Re: Status report [In reply to]

<snip>

> What I'm currently working on (not yet merged with the trunk in CVS)
> is unique element ids. This will enable a much more stable way to
> access nodes through URLs than before; a simple insert or append into the
> DOM tree now won't change the entire node URL so easily anymore. This
> is still work in progress but I'd like to know what people think. Currently
> URLs to nodes look like this:
>
> http://foobar.com/path/to/doc/0/5/3
>
> where the 0/5/3 part means: go to the 0th node of the document, take its
> 5th child node, take its 3rd childnode.
>
> This is very unstable to changes in the DOM structure. If I insert a node,
> I might make node 5 node 6, breaking the URL right away.


I see your point, but this is not undesirable behaviour per definition. It is if
your want your url to be persistent, but not if you actually want to resolve
http://foobar.com/path/to/doc/0/5/3 to "the 0th node of the document, take its
5th child node, take its 3rd childnode"

>
> What I'm playing with is a way to add unique ids to element nodes, so
> that at least inserting an element won't break everything anymore.


taking that http://foobar.com/path/to/doc/0/5/3 would still resolve in the way
expressed above, so that it will still be possible to retrieve an element of
which we just know that it is the first child node of a given DOM structure


I take it that the id would be something random, and not meaningful? A url like
http://foobar.com/path/to/doc/0/5/3, or worse something like
http://foobar.com/path/to/doc/9198274/2394837/9192877 does not sound very
attractive to me.
And even then: at what point is a node with a given identity 919287576 still the
same, and at what point will it be decided that a change in the underlying XML
document (and the DOM) that a node will no longer remain the same? After a
change in its content? After a change in its element name (even if its content
is still the same, just as its place in the DOM tree)? etc


> Perhaps it is a good idea to introduce unique ids to other nodes
> as well, instead of only to elements.. for some reason I didn't
> do that but I forget now why not.
>

> Having such a unique id per node does cost a bit of extra memory per
> node to store the ids.
>
> So, your feedback and opinions, please.
>


my 0.02 EUR (euro that is, in case it get scrambled along the way)

Rik


faassen at vet

Oct 31, 2001, 8:51 AM

Post #3 of 7 (1020 views)
Permalink
Re: Status report [In reply to]

Rik Hoekstra wrote:
> <snip>
>
> >What I'm currently working on (not yet merged with the trunk in CVS)
> >is unique element ids. This will enable a much more stable way to
> >access nodes through URLs than before; a simple insert or append into the
> >DOM tree now won't change the entire node URL so easily anymore. This
> >is still work in progress but I'd like to know what people think. Currently
> >URLs to nodes look like this:
> >
> >http://foobar.com/path/to/doc/0/5/3
> >
> >where the 0/5/3 part means: go to the 0th node of the document, take its
> >5th child node, take its 3rd childnode.
> >
> >This is very unstable to changes in the DOM structure. If I insert a node,
> >I might make node 5 node 6, breaking the URL right away.
>
> I see your point, but this is not undesirable behaviour per definition. It
> is if your want your url to be persistent, but not if you actually want to
> resolve http://foobar.com/path/to/doc/0/5/3 to "the 0th node of the
> document, take its 5th child node, take its 3rd childnode"

That's true; as long as the document doesn't change this is actually
*more* stable in some circumstances, for instance when there's a reparse.
I'd like a way to get stable references into a document somehow for all
kinds of purposes, such as annotation and 'hey, we saw this node before'.

> >What I'm playing with is a way to add unique ids to element nodes, so
> >that at least inserting an element won't break everything anymore.
>
> taking that http://foobar.com/path/to/doc/0/5/3 would still resolve in the
> way expressed above, so that it will still be possible to retrieve an
> element of which we just know that it is the first child node of a given DOM
> structure

You're correct. Perhaps I need another way to arrive at semi-stable
references to nodes, possibly based on XPointer (unfortunately those
are just XPath expressions, and they may be as unstable as anything).
The requirement then would be to be stable through minor edits both
through reparse and DOM manipulation. Perhaps this isn't easily
possible, though. :)

> I take it that the id would be something random, and not meaningful? A url
> like http://foobar.com/path/to/doc/0/5/3, or worse something like
> http://foobar.com/path/to/doc/9198274/2394837/9192877 does not sound very
> attractive to me.

In the test implementation I'm simply using a document global counter that
gives each element a unique id. So the first element encountered during
document construction will be e0, the second e1, then e2, etc.

> And even then: at what point is a node with a given identity 919287576 still
> the same, and at what point will it be decided that a change in the
> underlying XML document (and the DOM) that a node will no longer remain the
> same? After a change in its content? After a change in its element name
> (even if its content is still the same, just as its place in the DOM tree)?
> etc

Yup, this is a tricky problem. Good points. :)

> >Perhaps it is a good idea to introduce unique ids to other nodes
> >as well, instead of only to elements.. for some reason I didn't
> >do that but I forget now why not.
>
> >Having such a unique id per node does cost a bit of extra memory per
> >node to store the ids.
> >
> >So, your feedback and opinions, please.
>
> my 0.02 EUR (euro that is, in case it get scrambled along the way)

I just see EUR here. :)

I suppose the only way to get a semi-stable link into a document is to
use some heuristics to construct an XPath expression. Of course the
*certain* way is to use actual id attributes embedded in the document, but
I'd really like to leave the document alone if at all possible.

So, what kind of heuristics do we need? Let's restrict the problem to
references to element nodes for now; the problem is hard enough and
that would tackle most of the requirements people have in my opinion.

Element name seems a good idea to start with. Then name of the parent
element is probably a good idea, perhaps a few levels up. This remains
stable under fairly many document edits and changes.

Attributes used and value is also helpful, I think, and again remains
relatively stable.

Next we can move on to determine text node contents. If this is non-whitespace
content we could extract a bit of the text, say the first n characters,
to get even more of a match. This isn't always possible; perhaps the
first element child node can also give us more of a contextual match,
though this I think is less stable under changes. Sibling nodes is
something else to consider.

Even if we have a heuristic which works well, we have some other
difficulties:

* our XPointer/XPath expression probably becomes horribly
long and complicated

* it is relatively slow to construct and resolve these things

A completely different approach that I experimented a bit with is
somekind of bookmark ability. This approach would allow one to use
an API to 'bookmark' a reference to a node. You get somekind of number
back, but you just treat it as a token. Put the number back in, and
you'll get a reference to a node again. Internally you'd have a
dictionary mapping bookmark tokens to nodes.

A couple of problems with this though; once a node is bookmarked it
won't be garbage collected even if not in the tree anymore, as it's
always referenced from the dictionary. The other problem would be
that this isn't stable over reparses. One could of course store the
bookmark as URLs to nodes (either using the simple 0/1/2 approach or
using the complicated heuristic approach) just before any reparse, and
try to reestablish the bookmark dictionary afterwards. Even the same
tokens would be preserved.

The garbage collecting problem seems like the hardest to crack, though
perhaps we can come at a fairly simple solution. I tried using a
weak dictionary but that didn't seem to want to work with
Zope's extensionclass mechanism. I could use some form of manual
refcount approach, but how? Another way could be to regularly purge
the dictionary of any nodes that aren't connected to the tree.

Hmm... I need to think more about this, and more feedback and ideas
here would certainly help!

Regards,

Martijn


djay at avaya

Oct 31, 2001, 3:39 PM

Post #4 of 7 (1031 views)
Permalink
RE: Status report [In reply to]

Why not use ID attributes if they are defined in the DTD? That way if
someone wants a stable url for every time they parse it then its their
responsibility to put in the ids.
Just use your other schemes for things that don't have ID attributes.

> -----Original Message-----
> From: Martijn Faassen [mailto:faassen [at] vet]
> Sent: Thursday, 1 November 2001 2:52 AM
> To: Rik Hoekstra
> Cc: zope-xml [at] zope
> Subject: Re: [Zope-xml] Status report
>
>
> Rik Hoekstra wrote:
> > <snip>
> >
> > >What I'm currently working on (not yet merged with the
> trunk in CVS)
> > >is unique element ids. This will enable a much more stable way to
> > >access nodes through URLs than before; a simple insert or
> append into the
> > >DOM tree now won't change the entire node URL so easily
> anymore. This
> > >is still work in progress but I'd like to know what people
> think. Currently
> > >URLs to nodes look like this:
> > >
> > >http://foobar.com/path/to/doc/0/5/3
> > >
> > >where the 0/5/3 part means: go to the 0th node of the
> document, take its
> > >5th child node, take its 3rd childnode.
> > >
> > >This is very unstable to changes in the DOM structure. If
> I insert a node,
> > >I might make node 5 node 6, breaking the URL right away.
> >
> > I see your point, but this is not undesirable behaviour per
> definition. It
> > is if your want your url to be persistent, but not if you
> actually want to
> > resolve http://foobar.com/path/to/doc/0/5/3 to "the 0th node of the
> > document, take its 5th child node, take its 3rd childnode"
>
> That's true; as long as the document doesn't change this is actually
> *more* stable in some circumstances, for instance when
> there's a reparse.
> I'd like a way to get stable references into a document
> somehow for all
> kinds of purposes, such as annotation and 'hey, we saw this
> node before'.
>
> > >What I'm playing with is a way to add unique ids to
> element nodes, so
> > >that at least inserting an element won't break everything anymore.
> >
> > taking that http://foobar.com/path/to/doc/0/5/3 would still
> resolve in the
> > way expressed above, so that it will still be possible to
> retrieve an
> > element of which we just know that it is the first child
> node of a given DOM
> > structure
>
> You're correct. Perhaps I need another way to arrive at semi-stable
> references to nodes, possibly based on XPointer (unfortunately those
> are just XPath expressions, and they may be as unstable as anything).
> The requirement then would be to be stable through minor edits both
> through reparse and DOM manipulation. Perhaps this isn't easily
> possible, though. :)
>
> > I take it that the id would be something random, and not
> meaningful? A url
> > like http://foobar.com/path/to/doc/0/5/3, or worse something like
> > http://foobar.com/path/to/doc/9198274/2394837/9192877 does
> not sound very
> > attractive to me.
>
> In the test implementation I'm simply using a document global
> counter that
> gives each element a unique id. So the first element
> encountered during
> document construction will be e0, the second e1, then e2, etc.
>
> > And even then: at what point is a node with a given
> identity 919287576 still
> > the same, and at what point will it be decided that a change in the
> > underlying XML document (and the DOM) that a node will no
> longer remain the
> > same? After a change in its content? After a change in its
> element name
> > (even if its content is still the same, just as its place
> in the DOM tree)?
> > etc
>
> Yup, this is a tricky problem. Good points. :)
>
> > >Perhaps it is a good idea to introduce unique ids to other nodes
> > >as well, instead of only to elements.. for some reason I didn't
> > >do that but I forget now why not.
> >
> > >Having such a unique id per node does cost a bit of extra
> memory per
> > >node to store the ids.
> > >
> > >So, your feedback and opinions, please.
> >
> > my 0.02 EUR (euro that is, in case it get scrambled along the way)
>
> I just see EUR here. :)
>
> I suppose the only way to get a semi-stable link into a document is to
> use some heuristics to construct an XPath expression. Of course the
> *certain* way is to use actual id attributes embedded in the
> document, but
> I'd really like to leave the document alone if at all possible.
>
> So, what kind of heuristics do we need? Let's restrict the problem to
> references to element nodes for now; the problem is hard enough and
> that would tackle most of the requirements people have in my opinion.
>
> Element name seems a good idea to start with. Then name of the parent
> element is probably a good idea, perhaps a few levels up. This remains
> stable under fairly many document edits and changes.
>
> Attributes used and value is also helpful, I think, and again remains
> relatively stable.
>
> Next we can move on to determine text node contents. If this
> is non-whitespace
> content we could extract a bit of the text, say the first n
> characters,
> to get even more of a match. This isn't always possible; perhaps the
> first element child node can also give us more of a contextual match,
> though this I think is less stable under changes. Sibling nodes is
> something else to consider.
>
> Even if we have a heuristic which works well, we have some other
> difficulties:
>
> * our XPointer/XPath expression probably becomes horribly
> long and complicated
>
> * it is relatively slow to construct and resolve these things
>
> A completely different approach that I experimented a bit with is
> somekind of bookmark ability. This approach would allow one to use
> an API to 'bookmark' a reference to a node. You get somekind of number
> back, but you just treat it as a token. Put the number back in, and
> you'll get a reference to a node again. Internally you'd have a
> dictionary mapping bookmark tokens to nodes.
>
> A couple of problems with this though; once a node is bookmarked it
> won't be garbage collected even if not in the tree anymore, as it's
> always referenced from the dictionary. The other problem would be
> that this isn't stable over reparses. One could of course store the
> bookmark as URLs to nodes (either using the simple 0/1/2 approach or
> using the complicated heuristic approach) just before any reparse, and
> try to reestablish the bookmark dictionary afterwards. Even the same
> tokens would be preserved.
>
> The garbage collecting problem seems like the hardest to crack, though
> perhaps we can come at a fairly simple solution. I tried using a
> weak dictionary but that didn't seem to want to work with
> Zope's extensionclass mechanism. I could use some form of manual
> refcount approach, but how? Another way could be to regularly purge
> the dictionary of any nodes that aren't connected to the tree.
>
> Hmm... I need to think more about this, and more feedback and ideas
> here would certainly help!
>
> Regards,
>
> Martijn
>
>
> _______________________________________________
> Zope-xml mailing list
> Zope-xml [at] zope
> http://lists.zope.org/mailman/listinfo/zope-xml
>


faassen at vet

Oct 31, 2001, 3:47 PM

Post #5 of 7 (1025 views)
Permalink
Re: Status report [In reply to]

Jay, Dylan wrote:
> Why not use ID attributes if they are defined in the DTD? That way if
> someone wants a stable url for every time they parse it then its their
> responsibility to put in the ids.

I agree that when the ID attribute *can* be used it should be used; that's
what it is therefore of course.

> Just use your other schemes for things that don't have ID attributes.

Perhaps we should design a form of scheme where these different types of
URLs into ParsedXML's DOM can all happily live together. IDs, child node
#s, heuristics, bookmarks, etc. Anyone have any ideas?

Regards,

Martijn


dethe at burningtiger

Nov 1, 2001, 11:12 AM

Post #6 of 7 (1094 views)
Permalink
Re: Status report [In reply to]

I think adding a UUID to nodes is a small price to pay for persistence.
There is an mxUID type available from eGenix:

http://www.lemburg.com/files/python/

There's also a Uuid.py in the 4SuiteServer package:

http://www.4suite.org

UUIDs (Universally Unique IDentifiers) or GUIDs (Globally Unique
IDentifiers) are generally used as pointers to persistent data. They
are used in various persistence systems, DCE, and DCOM. There's a
fuller discussion of the issues and how to generate the beasts at:

http://www1.ics.uci.edu/pub/ietf/webdav/uuid-guid/draft-leach-uuids-guids-01.txt

--Dethe



--

Dethe Elza (delza [at] burningtiger)
Chief Mad Scientist
Burning Tiger Technologies (http://burningtiger.com)
Living Code Weblog (http://livingcode.ca)


kra at monkey

Nov 1, 2001, 6:21 PM

Post #7 of 7 (1049 views)
Permalink
Re: Status report [In reply to]

"'Martijn Faassen'" <faassen [at] vet> writes:

> Perhaps we should design a form of scheme where these different types of
> URLs into ParsedXML's DOM can all happily live together. IDs, child node
> #s, heuristics, bookmarks, etc. Anyone have any ideas?

http://www.zope.org/Wikis/DevSite/Projects/ParsedXML/URIPublishing

--
Karl Anderson kra [at] monkey http://www.monkey.org/~kra/

Zope xml RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.