Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Python: Python

finding the byte offset of an element in an XML file (tell() and seek()?)

 

 

Python python RSS feed   Index | Next | Previous | View Threaded


btemperton at gmail

Jun 14, 2012, 12:27 PM

Post #1 of 2 (88 views)
Permalink
finding the byte offset of an element in an XML file (tell() and seek()?)

Hi there,

I am working with mass spectroscopy data in the mzXML format that looks like this:
<mzXML>
<msRun>
<scan num="1">...</scan>
<scan num="2">...</scan>
<scan num="3">...</scan>
<scan num="4">...</scan>
.....
</msRun>
<index>
<offset id="1">160409990</offset>
<offset id="2">160442725</offset>
<offset id="3">160474927</offset>
<offset id="4">160497386</offset>
....
</index>
</mzXML>

Where the offset element contains the byte offset of the scan element that shares the id. I am trying to write a python script to remove scan elements and their respective offset, but I can't figure out how I re-calculate the byte offset for each remaining element once the elements have been removed.

My plan was to write the file out, the read it back in again and search through the file for a particular string (e.g. '<scan num="1">') and then use the tell() method to return the current byte location in the file. However, I'm not sure how I would implement this.

Any ideas?

Many thanks,

Ben
--
http://mail.python.org/mailman/listinfo/python-list


emile at fenx

Jun 14, 2012, 1:28 PM

Post #2 of 2 (88 views)
Permalink
Re: finding the byte offset of an element in an XML file (tell() and seek()?) [In reply to]

On 6/14/2012 12:27 PM Ben Temperton said...
> Hi there,
>
> I am working with mass spectroscopy data in the mzXML format that looks like this:
> <mzXML>
> <msRun>
> <scan num="1">...</scan>
> <scan num="2">...</scan>
> <scan num="3">...</scan>
> <scan num="4">...</scan>
> .....
> </msRun>
> <index>
> <offset id="1">160409990</offset>
> <offset id="2">160442725</offset>
> <offset id="3">160474927</offset>
> <offset id="4">160497386</offset>
> ....
> </index>
> </mzXML>
>
> Where the offset element contains the byte offset

byte offset to what in what?

> of the scan element that shares the id. I am trying to write a python script to remove scan elements and their respective offset, but I can't figure out how I re-calculate the byte offset for each remaining element once the elements have been removed.

Removing the reference will render the location as inaccessible so you
may not need to recalculate the remaining offsets. I'm assuming these
are offset in some other file and that the lengths of the data contained
at that location is known or knowable. But, if you don't point to that
location you won't try to get the data at the offset.

If you're trying to minimize the size on disk you'll probably want to
write a new file and leave the original alone. Initialize a buffer,
then for each valid offset id add the content and bump the offset by the
content length (or whatever is appropriate)

Emile


>
> My plan was to write the file out, the read it back in again and search through the file for a particular string (e.g. '<scan num="1">') and then use the tell() method to return the current byte location in the file. However, I'm not sure how I would implement this.
>
> Any ideas?
>
> Many thanks,
>
> Ben


--
http://mail.python.org/mailman/listinfo/python-list

Python python RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.