Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Python: Python
What's the best way to write this regular expression?
 

Index | Next | Previous | View Flat


johnjsal at gmail

Mar 6, 2012, 2:43 PM


Views: 1420
Permalink
What's the best way to write this regular expression?

I sort of have to work with what the website gives me (as you'll see below), but today I encountered an exception to my RE. Let me just give all the specific information first. The point of my script is to go to the specified URL and extract song information from it.

This is my RE:

song_pattern = re.compile(r'([0-9]{1,2}:[0-9]{2} [a|p].m.).*?<a.*?>(.*?)</a>.*?<a.*?>(.*?)</a>', re.DOTALL)

This is how the website is formatted:

4:25 p.m.
</div><div class="cmPlaylistContent"><strong><a href="/lsp/t24435/">AP TX SOC CPAS TRF</a></strong><br /><br /></div></li><li ><div class="cmPlaylistTime">

4:21 p.m.
</div><div class="cmPlaylistContent"><strong><a href="/lsp/t7672/">No One Else On Earth</a></strong><br /><a href="/lsp/a1924/">Wynonna</a><br /></div></li><li ><div class="cmPlaylistTime">

4:19 p.m.
</div><div class="cmPlaylistImage"><img src="http://media.cmgdigital.com/shared/amg/pic200/drp100/p109/p10901ruw7x_r85x85.jpg?998f84231a014ed68123ddb508af9480570dc122" alt="Moe Bandy" class="cmDarkBoxShadow cmPhotoBorderWhite"/></div><div class="cmPlaylistContent"><strong><a href="/lsp/t15101/">It&#39; A Cheating Situation</a></strong><br /><a href="/lsp/a5307/">Moe Bandy</a><br /><span class="sprite iconVoteUp">Votes&nbsp;&nbsp;(1) </span></div></li><li ><div class="cmPlaylistTime">

4:15 p.m.
</div><div class="cmPlaylistImage"><img src="http://media.cmgdigital.com/shared/amg/pic200/drp700/p744/p74493d85qy_r85x85.jpg?998f84231a014ed68123ddb508af9480570dc122" alt="Reba McEntire" class="cmDarkBoxShadow cmPhotoBorderWhite"/></div><div class="cmPlaylistContent"><strong><a href="/lsp/t14437/">Somebody Should Leave</a></strong><br /><a href="/lsp/a396/">REBA McENTIRE</a> & <a href="/lsp/a5765/">LINDA DAVIS</a><br /></div></li><li ><div class="cmPlaylistTime">

There's something of a pattern, although it's not always perfect. The time is listed first, and then the song information in <a> tags. However, in this particular case, you can see that for the 4:25pm entry, "AP TX SOC CPAS TRF" is extracted for the song title, and then the RE skips to the next entry in order to find the next <a> tags, which is actually the name of the next song in the list, instead of being the artist as normal. (Of course, I have no idea what AP TX SOC CPAS TRF is anyway. Usually the website doesn't list commercials or anomalies like that.)

So my first question is basic: am I even extracting the information properly? It works almost all the time, but because the website is such a mess, I pretty much have to rely on the tags being in the proper places (as they were NOT in this case!).

The second question is, to fix the above problem, would it be sufficient to rewrite my RE so that it has to find all of the specified information, i.e. a time followed by two <a> entries, BEFORE it moves on to finding the next time? I think that would have caused it to skip the 4:25 entry above, and only extract entries that have a time followed by two <a> entries (song and artist).

If this is possible, how do I rewrite it so that it has to match all the conditions without skipping over the next time entry in order to do so?

Thanks.
--
http://mail.python.org/mailman/listinfo/python-list

Subject User Time
What's the best way to write this regular expression? johnjsal at gmail Mar 6, 2012, 2:43 PM
    Re: What's the best way to write this regular expression? clp2 at rebertia Mar 6, 2012, 2:52 PM
    Re: What's the best way to write this regular expression? johnjsal at gmail Mar 6, 2012, 3:02 PM
    Re: What's the best way to write this regular expression? johnjsal at gmail Mar 6, 2012, 3:02 PM
    Re: What's the best way to write this regular expression? johnjsal at gmail Mar 6, 2012, 3:05 PM
    Re: What's the best way to write this regular expression? johnjsal at gmail Mar 6, 2012, 3:05 PM
    Re: What's the best way to write this regular expression? johnjsal at gmail Mar 6, 2012, 3:25 PM
    Re: What's the best way to write this regular expression? johnjsal at gmail Mar 6, 2012, 3:33 PM
    Re: What's the best way to write this regular expression? johnjsal at gmail Mar 6, 2012, 3:33 PM
    Re: What's the best way to write this regular expression? ian.g.kelly at gmail Mar 6, 2012, 3:35 PM
    Re: What's the best way to write this regular expression? johnjsal at gmail Mar 6, 2012, 3:39 PM
    Re: What's the best way to write this regular expression? steve+comp.lang.python at pearwood Mar 6, 2012, 3:44 PM
    Re: What's the best way to write this regular expression? johnjsal at gmail Mar 6, 2012, 3:57 PM
    RE: What's the best way to write this regular expression? ramit.prasad at jpmorgan Mar 6, 2012, 4:04 PM
    Re: What's the best way to write this regular expression? tjreedy at udel Mar 6, 2012, 5:04 PM
    Re: What's the best way to write this regular expression? tjreedy at udel Mar 6, 2012, 5:06 PM
    Re: What's the best way to write this regular expression? roy at panix Mar 6, 2012, 5:26 PM
    Re: What's the best way to write this regular expression? johnjsal at gmail Mar 6, 2012, 11:02 PM
    Re: What's the best way to write this regular expression? no.email at nospam Mar 7, 2012, 2:36 AM
        Re: RE: What's the best way to write this regular expression? driscoll at cs Mar 7, 2012, 2:02 PM
    Re: What's the best way to write this regular expression? johnjsal at gmail Mar 7, 2012, 12:39 PM
    Re: What's the best way to write this regular expression? ian.g.kelly at gmail Mar 7, 2012, 1:01 PM
    Re: What's the best way to write this regular expression? johnjsal at gmail Mar 7, 2012, 1:11 PM
    Re: What's the best way to write this regular expression? benjamin.kaplan at case Mar 7, 2012, 1:27 PM
    RE: What's the best way to write this regular expression? ramit.prasad at jpmorgan Mar 7, 2012, 1:31 PM
    Re: What's the best way to write this regular expression? ian.g.kelly at gmail Mar 7, 2012, 1:34 PM
    Re: What's the best way to write this regular expression? johnjsal at gmail Mar 7, 2012, 1:44 PM
    Re: What's the best way to write this regular expression? rosuav at gmail Mar 7, 2012, 9:03 PM
    Re: What's the best way to write this regular expression? johnjsal at gmail Mar 7, 2012, 11:25 PM
    Re: What's the best way to write this regular expression? johnjsal at gmail Mar 7, 2012, 11:26 PM
    Re: What's the best way to write this regular expression? johnjsal at gmail Mar 8, 2012, 1:33 PM
    Re: What's the best way to write this regular expression? johnjsal at gmail Mar 8, 2012, 1:40 PM
    Re: What's the best way to write this regular expression? johnjsal at gmail Mar 8, 2012, 1:52 PM
    Re: What's the best way to write this regular expression? gordon at panix Mar 8, 2012, 1:54 PM
    Re: What's the best way to write this regular expression? d at davea Mar 8, 2012, 2:19 PM
        Re: What's the best way to write this regular expression? johnjsal at gmail Mar 8, 2012, 2:25 PM
        Re: What's the best way to write this regular expression? ethan at stoneleaf Mar 8, 2012, 2:52 PM
    RE: What's the best way to write this regular expression? ramit.prasad at jpmorgan Mar 8, 2012, 3:02 PM
    Re: What's the best way to write this regular expression? d at davea Mar 8, 2012, 3:23 PM
    Re: What's the best way to write this regular expression? wuwei23 at gmail Mar 8, 2012, 7:38 PM
    Re: What's the best way to write this regular expression? johnjsal at gmail Mar 8, 2012, 7:52 PM
    Re: What's the best way to write this regular expression? jkn_gg at nicorp Mar 9, 2012, 2:45 AM

  Index | Next | Previous | View Flat
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.