Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Python: Python

xml input sanitizing method in standard lib?

 

 

Python python RSS feed   Index | Next | Previous | View Threaded


afri at afri

Mar 9, 2009, 8:32 AM

Post #1 of 7 (512 views)
Permalink
xml input sanitizing method in standard lib?

Hi,

Is there some method provided in python standard library to sanitize
strings used as input to xml documents? (=remove form-feeds and whatever
else). I've searched docs and google, found only 4Suite project. I
cannot rely on something not in standard lib, so I'm wondering if I've
just overlooked something in the docs to this (imo) important task...

Petr

--
http://mail.python.org/mailman/listinfo/python-list


gagsl-py2 at yahoo

Mar 9, 2009, 8:52 AM

Post #2 of 7 (471 views)
Permalink
Re: xml input sanitizing method in standard lib? [In reply to]

En Mon, 09 Mar 2009 13:32:48 -0200, Petr Muller <afri [at] afri> escribió:

> Is there some method provided in python standard library to sanitize
> strings used as input to xml documents? (=remove form-feeds and whatever
> else). I've searched docs and google, found only 4Suite project. I
> cannot rely on something not in standard lib, so I'm wondering if I've
> just overlooked something in the docs to this (imo) important task...

What do you mean by "sanitize strings used as input to xml documents"?
Do you have a string that you're going to parse as an XML document? Either
it is valid XML, or not. I would not "sanitize" it, you risk changing the
data inside.

--
Gabriel Genellina

--
http://mail.python.org/mailman/listinfo/python-list


afri at afri

Mar 9, 2009, 10:30 AM

Post #3 of 7 (479 views)
Permalink
Re: xml input sanitizing method in standard lib? [In reply to]

Hi,

> > Is there some method provided in python standard library to sanitize
> > strings used as input to xml documents? (=remove form-feeds and whatever
> > else). I've searched docs and google, found only 4Suite project. I
> > cannot rely on something not in standard lib, so I'm wondering if I've
> > just overlooked something in the docs to this (imo) important task...

>
> What do you mean by "sanitize strings used as input to xml
> documents"?
> Do you have a string that you're going to parse as an XML document? Either
> it is valid XML, or not. I would not "sanitize" it, you risk changing the
> data inside.

Thanks for response and sorry for I wasn't clear first time. I have a
heap of data (logs), from which I build a XML document using
xml.dom.minidom. In this data, some xml invalid characters may occur -
form feed (\x0c) character is one example.

I don't know what else is illegal in xml, so I've searched if there's
some method how to prepare strings for insertion to a xml doc before I
start research on a xml spec and write such function on my own.

Petr

--
http://mail.python.org/mailman/listinfo/python-list


tjreedy at udel

Mar 9, 2009, 1:52 PM

Post #4 of 7 (472 views)
Permalink
Re: xml input sanitizing method in standard lib? [In reply to]

Petr Muller wrote:
> Hi,

> Thanks for response and sorry for I wasn't clear first time. I have a
> heap of data (logs), from which I build a XML document using
> xml.dom.minidom. In this data, some xml invalid characters may occur -
> form feed (\x0c) character is one example.

Is this a hypothetical problem or do you actually have such chars?
If so, are they random hiccups from sick loggers, storage or
transmission errors, or informative markers intentionally inserted?
When you find them, do you want to silently ignore, ignore but raise a
flag (metalog the log error), or act on them as part of the
parsing/structuring process?

Silently deleting chars is easy with str.translate.

> I don't know what else is illegal in xml, so I've searched if there's
> some method how to prepare strings for insertion to a xml doc before I
> start research on a xml spec and write such function on my own.

--
http://mail.python.org/mailman/listinfo/python-list


gagsl-py2 at yahoo

Mar 9, 2009, 2:40 PM

Post #5 of 7 (467 views)
Permalink
Re: xml input sanitizing method in standard lib? [In reply to]

En Mon, 09 Mar 2009 15:30:31 -0200, Petr Muller <afri [at] afri> escribió:

> Thanks for response and sorry for I wasn't clear first time. I have a
> heap of data (logs), from which I build a XML document using
> xml.dom.minidom. In this data, some xml invalid characters may occur -
> form feed (\x0c) character is one example.
>
> I don't know what else is illegal in xml, so I've searched if there's
> some method how to prepare strings for insertion to a xml doc before I
> start research on a xml spec and write such function on my own.

You don't have to; Python already comes with xml support. Using
ElementTree to build the document is usually easier and faster:
http://effbot.org/zone/element-index.htm

--
Gabriel Genellina

--
http://mail.python.org/mailman/listinfo/python-list


stefan_ml at behnel

Mar 10, 2009, 12:40 AM

Post #6 of 7 (470 views)
Permalink
Re: xml input sanitizing method in standard lib? [In reply to]

Gabriel Genellina wrote:
> En Mon, 09 Mar 2009 15:30:31 -0200, Petr Muller <afri [at] afri> escribió:
>
>> Thanks for response and sorry for I wasn't clear first time. I have a
>> heap of data (logs), from which I build a XML document using
>> xml.dom.minidom. In this data, some xml invalid characters may occur -
>> form feed (\x0c) character is one example.
>>
>> I don't know what else is illegal in xml, so I've searched if there's
>> some method how to prepare strings for insertion to a xml doc before I
>> start research on a xml spec and write such function on my own.
>
> You don't have to; Python already comes with xml support. Using
> ElementTree to build the document is usually easier and faster:
> http://effbot.org/zone/element-index.htm

While I usually second that, this isn't the problem here. This thread is
about unallowed characters in XML. The set of allowed characters is defined
here:

http://www.w3.org/TR/xml/#charsets

And, as Terry Reedy pointed out, the "unicode.translate" method should get
you where you want. Just define a dict that maps the characters that you
want to remove to whatever character you want to use instead (or None) and
pass that into .translate().

Stefan
--
http://mail.python.org/mailman/listinfo/python-list


gagsl-py2 at yahoo

Mar 10, 2009, 5:36 AM

Post #7 of 7 (463 views)
Permalink
Re: xml input sanitizing method in standard lib? [In reply to]

En Tue, 10 Mar 2009 05:40:10 -0200, Stefan Behnel <stefan_ml [at] behnel>
escribió:
> Gabriel Genellina wrote:
>> En Mon, 09 Mar 2009 15:30:31 -0200, Petr Muller <afri [at] afri> escribió:
>>
>>> I don't know what else is illegal in xml, so I've searched if there's
>>> some method how to prepare strings for insertion to a xml doc before I
>>> start research on a xml spec and write such function on my own.
>>
>> You don't have to; Python already comes with xml support. Using
>> ElementTree to build the document is usually easier and faster:
>> http://effbot.org/zone/element-index.htm
>
> While I usually second that, this isn't the problem here. This thread is
> about unallowed characters in XML. The set of allowed characters is
> defined
> here:
>
> http://www.w3.org/TR/xml/#charsets

Ouch. Sorry, I was under the false impression that a) control characters
were allowed, and b) ElementTree would escape them. Both are wrong, and I
stand corrected.

--
Gabriel Genellina

--
http://mail.python.org/mailman/listinfo/python-list

Python python RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.