Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Python: Python

Minimally intrusive XML editing using Python

 

 

Python python RSS feed   Index | Next | Previous | View Threaded


thomas at thomas-lotze

Nov 18, 2009, 4:55 AM

Post #1 of 10 (380 views)
Permalink
Minimally intrusive XML editing using Python

I wonder what Python XML library is best for writing a program that makes
small modifications to an XML file in a minimally intrusive way. By that I
mean that information the program doesn't recognize is kept, as are
comments and whitespace, the order of attributes and even whitespace
around attributes. In short, I want to be able to change an XML file while
producing minimal textual diffs.

Most libraries don't allow controlling the order of and the whitespace
around attributes, so what's generally left to do is store snippets of
original text along with the model objects and re-use that for writing the
edited XML if the model wasn't modified by the program. Does a library
exist that helps with this? Does any XML library at all allow structured
access to the text representation of a tag with its attributes?

Thank you very much.

--
Thomas


--
http://mail.python.org/mailman/listinfo/python-list


stefan_ml at behnel

Nov 18, 2009, 5:15 AM

Post #2 of 10 (362 views)
Permalink
Re: Minimally intrusive XML editing using Python [In reply to]

Thomas Lotze, 18.11.2009 13:55:
> I wonder what Python XML library is best for writing a program that makes
> small modifications to an XML file in a minimally intrusive way. By that I
> mean that information the program doesn't recognize is kept, as are
> comments and whitespace, the order of attributes and even whitespace
> around attributes. In short, I want to be able to change an XML file while
> producing minimal textual diffs.

Take a look at canonical XML (C14N). In short, that's the only way to get a
predictable XML serialisation that can be used for textual diffs. It's
supported by lxml.

Stefan
--
http://mail.python.org/mailman/listinfo/python-list


clp2 at rebertia

Nov 18, 2009, 5:23 AM

Post #3 of 10 (362 views)
Permalink
Re: Minimally intrusive XML editing using Python [In reply to]

On Wed, Nov 18, 2009 at 4:55 AM, Thomas Lotze <thomas [at] thomas-lotze> wrote:
> I wonder what Python XML library is best for writing a program that makes
> small modifications to an XML file in a minimally intrusive way. By that I
> mean that information the program doesn't recognize is kept, as are
> comments and whitespace, the order of attributes and even whitespace
> around attributes. In short, I want to be able to change an XML file while
> producing minimal textual diffs.
>
> Most libraries don't allow controlling the order of and the whitespace
> around attributes, so what's generally left to do is store snippets of
> original text along with the model objects and re-use that for writing the
> edited XML if the model wasn't modified by the program. Does a library
> exist that helps with this? Does any XML library at all allow structured
> access to the text representation of a tag with its attributes?

Have you considered using an XML-specific diff tool such as:
* One off this list:
http://www.manageability.org/blog/stuff/open-source-xml-diff-in-java
* xmldiff (it's in Python even): http://www.logilab.org/859
* diffxml: http://diffxml.sourceforge.net/

[Note: I haven't actually used any of these.]

Cheers,
Chris
--
http://blog.rebertia.com
--
http://mail.python.org/mailman/listinfo/python-list


thomas at thomas-lotze

Nov 18, 2009, 5:27 AM

Post #4 of 10 (356 views)
Permalink
Re: Minimally intrusive XML editing using Python [In reply to]

Stefan Behnel wrote:

> Take a look at canonical XML (C14N). In short, that's the only way to get a
> predictable XML serialisation that can be used for textual diffs. It's
> supported by lxml.

Thank you for the pointer. IIUC, c14n is about changing an XML document so
that its textual representation is reproducible. While this representation
would certainly solve my problem if I were to deal with input that's
already in c14n form, it doesn't help me handling arbitrarily formatted
XML in a minimally intrusive way.

IOW, I don't want the XML document to obey the rules of a process, but
instead I want a process that respects the textual form my input happens
to have.

--
Thomas


--
http://mail.python.org/mailman/listinfo/python-list


thomas at thomas-lotze

Nov 18, 2009, 5:30 AM

Post #5 of 10 (359 views)
Permalink
Re: Minimally intrusive XML editing using Python [In reply to]

Chris Rebert wrote:

> Have you considered using an XML-specific diff tool such as:

I'm afraid I'll have to fall back to using such a thing if I don't find a
solution to what I actually want to do.

I do realize that XML isn't primarily about its textual representation, so
I guess I shouldn't be surprised if what I'm looking for doesn't exist.
Still, it would be nice if it did...

--
Thomas


--
http://mail.python.org/mailman/listinfo/python-list


anthra.norell at bluewin

Nov 18, 2009, 10:17 AM

Post #6 of 10 (346 views)
Permalink
Re: Minimally intrusive XML editing using Python [In reply to]

Thomas Lotze wrote:
> Chris Rebert wrote:
>
>
>> Have you considered using an XML-specific diff tool such as:
>>
>
> I'm afraid I'll have to fall back to using such a thing if I don't find a
> solution to what I actually want to do.
>
> I do realize that XML isn't primarily about its textual representation, so
> I guess I shouldn't be surprised if what I'm looking for doesn't exist.
> Still, it would be nice if it did...
>
>
Thomas,
I just might have what you are looking for. But I want to be sure I
understand your problem. So, please show a small sample and explain how
you want it modified.
Frederic
--
http://mail.python.org/mailman/listinfo/python-list


davea at ieee

Nov 18, 2009, 11:16 AM

Post #7 of 10 (346 views)
Permalink
Re: Minimally intrusive XML editing using Python [In reply to]

Thomas Lotze wrote:
> Chris Rebert wrote:
>
>
>> Have you considered using an XML-specific diff tool such as:
>>
>
> I'm afraid I'll have to fall back to using such a thing if I don't find a
> solution to what I actually want to do.
>
> I do realize that XML isn't primarily about its textual representation, so
> I guess I shouldn't be surprised if what I'm looking for doesn't exist.
> Still, it would be nice if it did...
>
>
What's your real problem, or use case? Are you just concerned with
diffing, or are others likely to read the xml, and want it formatted the
way it already is? And how general do you need this tool to be? For
example, if the only thing you're doing is modifying existing attributes
or existing tags, the "minimal change" would be pretty unambiguous. But
if you're adding tags, or adding content on what was an empty element,
then the requirement gets fuzzy And finding an existing library for
something "fuzzy" is unlikely.

Sample input, change list, and desired output would be very useful.

DaveA

--
http://mail.python.org/mailman/listinfo/python-list


nobody at nowhere

Nov 18, 2009, 11:17 AM

Post #8 of 10 (348 views)
Permalink
Re: Minimally intrusive XML editing using Python [In reply to]

On Wed, 18 Nov 2009 13:55:52 +0100, Thomas Lotze wrote:

> I wonder what Python XML library is best for writing a program that makes
> small modifications to an XML file in a minimally intrusive way. By that I
> mean that information the program doesn't recognize is kept, as are
> comments and whitespace, the order of attributes and even whitespace
> around attributes. In short, I want to be able to change an XML file while
> producing minimal textual diffs.
>
> Most libraries don't allow controlling the order of and the whitespace
> around attributes, so what's generally left to do is store snippets of
> original text along with the model objects and re-use that for writing the
> edited XML if the model wasn't modified by the program. Does a library
> exist that helps with this? Does any XML library at all allow structured
> access to the text representation of a tag with its attributes?

Expat parsers have a CurrentByteIndex field, while SAX parsers have
locators. You can use this to identify the portions of the input which
need to be processed, and just copy everything else. One downside is that
these only report either the beginning (Expat) or end (SAX) of the tag;
you'll have to deduce the other side yourself.

OTOH, "diff" is probably the wrong tool for the job.

--
http://mail.python.org/mailman/listinfo/python-list


thomas at thomas-lotze

Nov 23, 2009, 8:45 AM

Post #9 of 10 (299 views)
Permalink
Re: Minimally intrusive XML editing using Python [In reply to]

Please consider this a reply to any unanswered messages I received in
response to my original post.

Dave Angel wrote:

> What's your real problem, or use case? Are you just concerned with
> diffing, or are others likely to read the xml, and want it formatted the
> way it already is?

I'd like to put the XML under revision control along with other stuff.
Other people should be able to make sense of the diffs and I'd rather not
require them to configure their tools to use some XML differ.

> And how general do you need this tool to be? For
> example, if the only thing you're doing is modifying existing attributes
> or existing tags, the "minimal change" would be pretty unambiguous. But
> if you're adding tags, or adding content on what was an empty element,
> then the requirement gets fuzzy And finding an existing library for
> something "fuzzy" is unlikely.

Sure. I guess it's something like an 80/20 problem: Changing attributes in
a way that keeps the rest of the XML intact will go a long way and as
we're talking about XML that is supposed to be looked at by humans, I
would base any further requirements on the assumption that it's
pretty-printed in some way so that removing an element, for example, can
be defined by touching as few lines as possible, and adding one can be
restricted to adding a line in the appropriate place. If more complex
stuff isn't as well-defined, that would be entirely OK with me.

> Sample input, change list, and desired output would be very useful.

I'd like to be able to reliably produce a diff like this using a program
that lets me change the value in some useful way, which might be dragging
a point across a map with the mouse in this example:

--- foo.gpx 2009-05-30 19:45:45.000000000 +0200
+++ bar.gpx 2009-11-23 17:41:36.000000000 +0100
@@ -11,7 +11,7 @@
<speed>0.792244</speed>
<fix>2d</fix>
</trkpt>
-<trkpt lat="50.605995000" lon="10.709680000">
+<trkpt lat="50.605985000" lon="10.709680000">
<ele>508.300000</ele>
<time>2009-05-30T16:37:10Z</time>
<course>15.150000</course>


--
Thomas


--
http://mail.python.org/mailman/listinfo/python-list


nobody at nowhere

Nov 23, 2009, 11:49 AM

Post #10 of 10 (297 views)
Permalink
Re: Minimally intrusive XML editing using Python [In reply to]

On Mon, 23 Nov 2009 17:45:24 +0100, Thomas Lotze wrote:

>> What's your real problem, or use case? Are you just concerned with
>> diffing, or are others likely to read the xml, and want it formatted the
>> way it already is?
>
> I'd like to put the XML under revision control along with other stuff.
> Other people should be able to make sense of the diffs and I'd rather not
> require them to configure their tools to use some XML differ.

In which case, the data format isn't "XML", but a subset of it (and
probably an under-defined subset at that, unless you invest a lot of
effort in defining it).

That defeats one of the advantages of using a standardised "container"
format such as XML, i.e. being able to take advantage of existing tools
and libraries (which will be written to the offical standard, not to your
private "standard").

One option is to require the files to be in a canonical form. Depending
upon your revision control system, you may be able to configure it to
either check that updated files are in this form, or even to convert them
automatically.

If your existing files aren't in such a form, there will be a one-time
penalty (i.e. a huge diff) when converting the files.


--
http://mail.python.org/mailman/listinfo/python-list

Python python RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.