Gossamer Forum: General: Perl Programming: Looking for script to pull news from other sites

Jan 7, 1999, 7:07 PM

calvis

Novice (24 posts)

Jan 7, 1999, 7:07 PM

Post #1 of 6

Shortcut

Looking for script to pull news from other sites

I have an agreement with another web site to use some of their content. I am looking for a script that would find out what new links are avaible on a page, and then pull them over to my site so that I can put headers and footers on them. It would be something that would happen daily, and I would like the to keep an archive of these pulled in news items.

Any ideas anyone?

Charles

Jan 8, 1999, 2:57 AM

fhirsch

User (134 posts)

Jan 8, 1999, 2:57 AM

Post #2 of 6

Shortcut

Re: Looking for script to pull news from other sites In reply to

It depends on the format of these files, and how you normally would download them. Are they available on the web? Are they plain text or HTML? I am working on some small projects with LWP modules for retrieving and processing page content, and would be glad to help, just need a little info.

------------------
Fred Hirsch
Web Consultant & Programmer

Jan 8, 1999, 4:09 AM

Simon Martin

User (73 posts)

Jan 8, 1999, 4:09 AM

Post #3 of 6

Shortcut

Re: Looking for script to pull news from other sites In reply to

I heard about a script called Web Pluck but i wasn't too sure how it worked.. I also have an aggreement to "grab" news from another site, but haven't worked out how to get the/a script to do it..

the pages i'd be grabbing from would be html files..

Simon.

Jan 8, 1999, 4:24 AM

Enthusiast (567 posts)

Jan 8, 1999, 4:24 AM

Post #4 of 6

Shortcut

Re: Looking for script to pull news from other sites In reply to

I just wanted to say, I'm working one one, and will post it when ready.

read my other post, so see my problems.

The script will read (news) headlines from a newssite, and display it in your own layout!
(you can even deny some headlines...)

Jan 8, 1999, 6:43 AM

calvis

Novice (24 posts)

Jan 8, 1999, 6:43 AM

Post #5 of 6

Shortcut

Re: Looking for script to pull news from other sites In reply to

Great, I am glad that you are working on one. This is kind of how I would envision it to work. There may be a way to simplify the process, but you get the general idea. I'll read through some of your post to see what you have been talking about.

Charles

At X time during X day start Content Crawler

Crawler looks in map configuration file
http://www.mysite.com/cgi/crawler/map.dat

Crawler sees first entry in map file as:
http://wwww.contentsite.com/index.html

Crawler goes out and fetches entry in map.dat

Crawler parses http://www.contentsite.com/index.html
for <a href> tags

If found place data in temp.dat

Data Contains: <a href="newsItem1.htm"> Name of News Item

Continue until crawler reaches </HTML> Tag

Read another entry in map.dat file

Do until EOF

Content Crawler Processing

Compare temp.dat with news.dat

If entry in temp.dat is not in news.dat then store in good.dat (Remove Dups)

Read another entry in temp.dat

Do until temp.dat EOF

Content Crawler Formatting

Read good.dat

Suck in .htm file

Format with predefined headers and footers

Save in predefined location

Write to News_Index file as a new entry

Repeat as necessary

Include Archiving Routines Here

Jan 11, 1999, 3:02 AM

fhirsch

User (134 posts)

Jan 11, 1999, 3:02 AM

Post #6 of 6

Shortcut

Re: Looking for script to pull news from other sites In reply to

I am writing a spidering program that does many of the functions you describe, but is written from the viewpoint of image processing, so it does not do any parsing of HTML yet. However, it is very adept at spidering sites, and allows inclusions of certain file and tag types as well as filtering of specific file name matches. It could probably be easily converted to handle your task.

It is however extremely Beta, and has some possible memory and configuration problems I am working on. It only works from a prompt, and does not have any sort of Web or GUI interface. I am working on those aspects as I clean up the spidering logic of the code.

If this program interests you, let me know, and I can email you a copy. Otherwise, wait for the final version, it will either have a web based and/or Windows based interface.

------------------
Fred Hirsch
Web Consultant & Programmer