Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Wikipedia: Wikitech

Re: commons.wikimedia.org allowing directory indexes and web robots

 

 

Wikipedia wikitech RSS feed   Index | Next | Previous | View Threaded


questpc at rambler

Jul 19, 2009, 11:20 PM

Post #1 of 4 (349 views)
Permalink
Re: commons.wikimedia.org allowing directory indexes and web robots

* David Gerard <dgerard[at]gmail.com> [Sat, 18 Jul 2009 14:55:28 +0100]:
> 2009/7/18 Robert Rohde <rarohde[at]gmail.com>:
> > On Sat, Jul 18, 2009 at 6:20 AM, David Gerard<dgerard[at]gmail.com>
> wrote:
>
> >> It'd actually be better if Google properly indexed text pages whose
> >> name ends in .jpg or whatever ... but they're aware we'd like that,
> so
> >> it's up to them.
>
> > Which is why my personal wiki is patched to translate the ".jpg"
into
> > "_jpg", etc. for all references to image description pages.
>
>
> Hmmmmmmmmmmmmmmmm. How much hacking would MediaWiki on Wikimedia need
> for _jpg to be the default image page name and .jpg an alias for
> backward compatibility? That'd be really helpful in all sorts of ways
> - on pretty much any website *not* running MediaWiki, something ending
> ".jpg" is going to be the image, not a text page.
>
I am not sure that the underscore is the most suitable character,
because in MediaWiki it's interchangable with the space character. The
type of the document should be determined by it's mime-type. If Google
uses the web path "extension" (which is meaningless by the way, because
that's a virtual path) instead of mime-type to determine whether the
page should be indexed, that's amazing bug for Google.
Dmitriy

_______________________________________________
Wikitech-l mailing list
Wikitech-l[at]lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


dgerard at gmail

Jul 20, 2009, 2:08 AM

Post #2 of 4 (316 views)
Permalink
Re: commons.wikimedia.org allowing directory indexes and web robots [In reply to]

2009/7/20 Dmitriy Sintsov <questpc[at]rambler.ru>:
> * David Gerard <dgerard[at]gmail.com> [Sat, 18 Jul 2009 14:55:28 +0100]:

>> Hmmmmmmmmmmmmmmmm. How much hacking would MediaWiki on Wikimedia need
>> for _jpg to be the default image page name and .jpg an alias for
>> backward compatibility? That'd be really helpful in all sorts of ways
>> - on pretty much any website *not* running MediaWiki, something ending
>> ".jpg" is going to be the image, not a text page.

> I am not sure that the underscore is the most suitable character,
> because in MediaWiki it's interchangable with the space character.


Or whatever, as long as it isn't ending .jpg .


> The
> type of the document should be determined by it's mime-type. If Google
> uses the web path "extension" (which is meaningless by the way, because
> that's a virtual path) instead of mime-type to determine whether the
> page should be indexed, that's amazing bug for Google.


Yes, it's an amazing bug for Google. It's also the way they do it.


- d.

_______________________________________________
Wikitech-l mailing list
Wikitech-l[at]lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


smolensk at eunet

Jul 20, 2009, 2:45 AM

Post #3 of 4 (318 views)
Permalink
Re: commons.wikimedia.org allowing directory indexes and web robots [In reply to]

Dmitriy Sintsov wrote:
> because in MediaWiki it's interchangable with the space character. The
> type of the document should be determined by it's mime-type. If Google
> uses the web path "extension" (which is meaningless by the way, because
> that's a virtual path) instead of mime-type to determine whether the
> page should be indexed, that's amazing bug for Google.

It's a necessary evil however, because of a number of servers that serve
incorrect mime types. IIRC, previously Google didn't index our images at
all, but later added MediaWiki as an exception.

_______________________________________________
Wikitech-l mailing list
Wikitech-l[at]lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Simetrical+wikilist at gmail

Jul 20, 2009, 3:15 PM

Post #4 of 4 (314 views)
Permalink
Re: commons.wikimedia.org allowing directory indexes and web robots [In reply to]

On Mon, Jul 20, 2009 at 6:20 AM, Dmitriy Sintsov<questpc[at]rambler.ru> wrote:
> I am not sure that the underscore is the most suitable character,
> because in MediaWiki it's interchangable with the space character. The
> type of the document should be determined by it's mime-type. If Google
> uses the web path "extension" (which is meaningless by the way, because
> that's a virtual path) instead of mime-type to determine whether the
> page should be indexed, that's amazing bug for Google.

Maybe they don't retrieve the page in the first place, because they
don't want to waste bandwidth and processing time getting images. It
would be rather a waste to send dozens or hundreds of HEAD requests on
every Flickr page (or whatever) just to make sure that all those
things ending in a suffix universally accepted to designate images
really *are* images.

On Mon, Jul 20, 2009 at 9:45 AM, Nikola Smolenski<smolensk[at]eunet.yu> wrote:
> It's a necessary evil however, because of a number of servers that serve
> incorrect mime types.

Well, that would make no difference if you actually downloaded the
content, or the first handful of bytes. It's easy to *very* reliably
distinguish binary image data from HTML if you get to look at the
first several bytes of the file.

Anyway, I think the "right" way to do this would be to omit the suffix
from the page name entirely, treating the format as an implementation
detail. That way you could, for instance, upload an SVG over a PNG or
a PNG over a JPEG, and have all users be automatically updated without
manually changing the references. This does get a little confusing
when you consider totally different types of media, though, like audio
or video or PDF or whatnot. If NS_FILE (NS_IMAGE) weren't hardcoded
in thirty million places both in code and templates, I might suggest
different namespaces for different media types instead of one unified
File: namespace, but that seems impractical at this point.

_______________________________________________
Wikitech-l mailing list
Wikitech-l[at]lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Wikipedia wikitech RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact lists@gossamer-threads.com
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.