Gossamer Forum: General: Perl Programming: Extracting text from web page

Gossamer Threads

Home : General : Perl Programming :

Extracting text from web page

Apr 27, 2003, 12:34 PM

webslicer

User (236 posts)

Apr 27, 2003, 12:34 PM

Post #1 of 2

Shortcut

Extracting text from web page

Hi!

I'm working on programming a PERL subroutine or global to extract a text sample of a webpage. Would supplement the "Description" meta tag, etc info.

Specifically, how would I extract the first 200 chars, (broken on word boundary) digest of a web page using LINKS SQL 2.12, stripped of HTML codes.

I will need this to run through all "200' (GOOD PAGE) valid links and insert to the links database in a custom field, let's say "page_extract".
Same goes for Title and Description meta tags, and email, but there may be code examples for that ...

thanks!

Apr 30, 2003, 2:37 PM

mtorres

Novice (16 posts)

Apr 30, 2003, 2:37 PM

Post #2 of 2

Shortcut

Re: [webslicer] Extracting text from web page In reply to

your question is kinda advanced for me, i... but maybe this will help you *somewhat?

Stripping HTML from a .html file - try this from a unix shell....

sed -e :a -e 's/<[^>]*>//g;/</N;//ba' $YOUR_HTMLFILE_HERE |grep -v "&nbsp"

Gossamer Threads is a Vancouver-based company with over 28 years experience in web technology. From development to hosting, we partner with leading organizations around the globe and help to build their web presences, strategies and infrastructures.

Let’s talk: 1-877-715-7676