Gossamer Forum
Home : General : Perl Programming :

Extracting text from web page

Quote Reply
Extracting text from web page
Hi!

I'm working on programming a PERL subroutine or global to extract a text sample of a webpage. Would supplement the "Description" meta tag, etc info.

Specifically, how would I extract the first 200 chars, (broken on word boundary) digest of a web page using LINKS SQL 2.12, stripped of HTML codes.

I will need this to run through all "200' (GOOD PAGE) valid links and insert to the links database in a custom field, let's say "page_extract".
Same goes for Title and Description meta tags, and email, but there may be code examples for that ...

thanks!
Quote Reply
Re: [webslicer] Extracting text from web page In reply to
your question is kinda advanced for me, i... but maybe this will help you *somewhat?



Stripping HTML from a .html file - try this from a unix shell....

sed -e :a -e 's/<[^>]*>//g;/</N;//ba' $YOUR_HTMLFILE_HERE |grep -v "&nbsp"