Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Wikipedia: Mediawiki

Help with crawling Special:AllPages for small proprietary wiki

 

 

Wikipedia mediawiki RSS feed   Index | Next | Previous | View Threaded


christopher.desmarais at sjrb

Jun 20, 2008, 3:42 PM

Post #1 of 3 (1278 views)
Permalink
Help with crawling Special:AllPages for small proprietary wiki

We have a small propreitary wiki, and we would like to be able to search
the entire wiki content daily with sharepoint.

It looks like the easiest way to do that would be to start a crawl at
special:allpages site. However, sharepoint immediately stops any such
crawl because the site has:

<meta name="robots" content="noindex,nofollow" />

We looked for but can't seem to find any configuration options that we
set to include those tags. There is no robots.txt file in the root
directory, and we haven't set anything in LocalSettings or
DefaultingSettings to prevent robots from following the page (eg.
Defaultsettings.php has $wgNamespaceRobotPolicies = array(); and local
settings has no robot directives at all)

1) Is this a default setting for the special pages?
2) If it isn't where can we look for things we might have set that we
can turn off?
3) If it is, is there anything we can turn on to stop that tag from
being put in the page?

If we can't prevent those tags from being inserted, has anyone managed
to use the special:export feature with sharepoint? Any articles that
might help us solve this problem?

In theory I could write a .net application to read the anchor tags out
of the page, then create an .aspx without the noindex, nofollow settings
to crawl the pages on special:allpages. But surely, there's an easier
way.

Thanks,

Chris
_______________________________________________
MediaWiki-l mailing list
MediaWiki-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/mediawiki-l


tstarling at wikimedia

Jun 20, 2008, 3:59 PM

Post #2 of 3 (1255 views)
Permalink
Re: Help with crawling Special:AllPages for small proprietary wiki [In reply to]

Christopher Desmarais (Contractor) wrote:
> We have a small propreitary wiki, and we would like to be able to search
> the entire wiki content daily with sharepoint.
>
> It looks like the easiest way to do that would be to start a crawl at
> special:allpages site. However, sharepoint immediately stops any such
> crawl because the site has:
>
> <meta name="robots" content="noindex,nofollow" />
>
> We looked for but can't seem to find any configuration options that we
> set to include those tags. There is no robots.txt file in the root
> directory, and we haven't set anything in LocalSettings or
> DefaultingSettings to prevent robots from following the page (eg.
> Defaultsettings.php has $wgNamespaceRobotPolicies = array(); and local
> settings has no robot directives at all)
>
> 1) Is this a default setting for the special pages?

It's hard-coded for all special pages.

> 2) If it isn't where can we look for things we might have set that we
> can turn off?
> 3) If it is, is there anything we can turn on to stop that tag from
> being put in the page?

Index: includes/specials/Allpages.php
===================================================================
--- includes/specials/Allpages.php (revision 36353)
+++ includes/specials/Allpages.php (working copy)
@@ -12,6 +12,8 @@
function wfSpecialAllpages( $par=NULL, $specialPage ) {
global $wgRequest, $wgOut, $wgContLang;

+ $wgOut->setRobotPolicy( '' );
+
# GET values
$from = $wgRequest->getVal( 'from' );
$namespace = $wgRequest->getInt( 'namespace' );


_______________________________________________
MediaWiki-l mailing list
MediaWiki-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/mediawiki-l


jidanni at jidanni

Jun 23, 2008, 12:51 PM

Post #3 of 3 (1235 views)
Permalink
Re: Help with crawling Special:AllPages for small proprietary wiki [In reply to]

This is https://bugzilla.wikimedia.org/show_bug.cgi?id=8473

The obvious solution is to make $wgArticleRobotPolicies work as
advertized and not be overpowered by hardwired code.

Also having users maintain private copies of includes/* usually lasts
until the next upgrade only, when different staff don't know about
previous tweaks.

Yes the user could also choose to maintain a sitemap, but that is
beside the point.

_______________________________________________
MediaWiki-l mailing list
MediaWiki-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/mediawiki-l

Wikipedia mediawiki RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.