Gossamer Forum
Home : General : Perl Programming :

Looking for script to pull news from other sites

Quote Reply
Looking for script to pull news from other sites
I have an agreement with another web site to use some of their content. I am looking for a script that would find out what new links are avaible on a page, and then pull them over to my site so that I can put headers and footers on them. It would be something that would happen daily, and I would like the to keep an archive of these pulled in news items.

Any ideas anyone?

Charles
Quote Reply
Re: Looking for script to pull news from other sites In reply to
It depends on the format of these files, and how you normally would download them. Are they available on the web? Are they plain text or HTML? I am working on some small projects with LWP modules for retrieving and processing page content, and would be glad to help, just need a little info.



------------------
Fred Hirsch
Web Consultant & Programmer
Quote Reply
Re: Looking for script to pull news from other sites In reply to
I heard about a script called Web Pluck but i wasn't too sure how it worked.. I also have an aggreement to "grab" news from another site, but haven't worked out how to get the/a script to do it..

the pages i'd be grabbing from would be html files..

Simon.
Quote Reply
Re: Looking for script to pull news from other sites In reply to
I just wanted to say, I'm working one one, and will post it when ready.

read my other post, so see my problems.

The script will read (news) headlines from a newssite, and display it in your own layout!
(you can even deny some headlines...)
Quote Reply
Re: Looking for script to pull news from other sites In reply to
Great, I am glad that you are working on one. This is kind of how I would envision it to work. There may be a way to simplify the process, but you get the general idea. I'll read through some of your post to see what you have been talking about.

Charles



At X time during X day start Content Crawler

Crawler looks in map configuration file
http://www.mysite.com/cgi/crawler/map.dat

Crawler sees first entry in map file as:
http://wwww.contentsite.com/index.html

Crawler goes out and fetches entry in map.dat

Crawler parses http://www.contentsite.com/index.html
for <a href> tags

If found place data in temp.dat

Data Contains: <a href="newsItem1.htm"> Name of News Item

Continue until crawler reaches </HTML> Tag

Read another entry in map.dat file

Do until EOF


Content Crawler Processing

Compare temp.dat with news.dat

If entry in temp.dat is not in news.dat then store in good.dat (Remove Dups)

Read another entry in temp.dat

Do until temp.dat EOF


Content Crawler Formatting

Read good.dat

Suck in .htm file

Format with predefined headers and footers

Save in predefined location

Write to News_Index file as a new entry

Repeat as necessary


Include Archiving Routines Here
Quote Reply
Re: Looking for script to pull news from other sites In reply to
I am writing a spidering program that does many of the functions you describe, but is written from the viewpoint of image processing, so it does not do any parsing of HTML yet. However, it is very adept at spidering sites, and allows inclusions of certain file and tag types as well as filtering of specific file name matches. It could probably be easily converted to handle your task.

It is however extremely Beta, and has some possible memory and configuration problems I am working on. It only works from a prompt, and does not have any sort of Web or GUI interface. I am working on those aspects as I clean up the spidering logic of the code.

If this program interests you, let me know, and I can email you a copy. Otherwise, wait for the final version, it will either have a web based and/or Windows based interface.



------------------
Fred Hirsch
Web Consultant & Programmer