Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Wikipedia: Wikitech

Html code

 

 

Wikipedia wikitech RSS feed   Index | Next | Previous | View Threaded


send.to.khalida at gmail

Nov 27, 2011, 12:31 PM

Post #1 of 4 (308 views)
Permalink
Html code

Hello!

In the html code of a Wikipedia article how to recognise the
*first*sentence of this article?
_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


derhoermi at gmx

Nov 27, 2011, 12:47 PM

Post #2 of 4 (293 views)
Permalink
Re: Html code [In reply to]

* Khalida BEN SIDI AHMED wrote:
>In the html code of a Wikipedia article how to recognise the
>*first*sentence of this article?

It's not marked up and probably differs among language versions. On the
english version the first `p` child of a `mw-content-ltr` element is a
good bet, as I pointed out earlier, to identify the first paragraph. It
would then be necessary to find the full stop at the end of a sentence;
criteria for that include that a space or the end of a paragraph follows
and that it is not included in some nesting construct like parentheses;
http://en.wikipedia.org/wiki/Sentence_boundary_disambiguation discusses
some of the problems and includes pointers to some solutions.
--
Björn Höhrmann · mailto:bjoern [at] hoehrmann · http://bjoern.hoehrmann.de
Am Badedeich 7 · Telefon: +49(0)160/4415681 · http://www.bjoernsworld.de
25899 Dagebüll · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/

_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


send.to.khalida at gmail

Nov 27, 2011, 1:05 PM

Post #3 of 4 (292 views)
Permalink
Re: Html code [In reply to]

Thank you very much. That exactly what I wanted to know.

2011/11/27 Bjoern Hoehrmann <derhoermi [at] gmx>

> * Khalida BEN SIDI AHMED wrote:
> >In the html code of a Wikipedia article how to recognise the
> >*first*sentence of this article?
>
> It's not marked up and probably differs among language versions. On the
> english version the first `p` child of a `mw-content-ltr` element is a
> good bet, as I pointed out earlier, to identify the first paragraph. It
> would then be necessary to find the full stop at the end of a sentence;
> criteria for that include that a space or the end of a paragraph follows
> and that it is not included in some nesting construct like parentheses;
> http://en.wikipedia.org/wiki/Sentence_boundary_disambiguation discusses
> some of the problems and includes pointers to some solutions.
> --
> Björn Höhrmann · mailto:bjoern [at] hoehrmann · http://bjoern.hoehrmann.de
> Am Badedeich 7 · Telefon: +49(0)160/4415681 · http://www.bjoernsworld.de
> 25899 Dagebüll · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/
>
> _______________________________________________
> Wikitech-l mailing list
> Wikitech-l [at] lists
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>
_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


z at mzmcbride

Nov 27, 2011, 1:22 PM

Post #4 of 4 (290 views)
Permalink
Re: Html code [In reply to]

Bjoern Hoehrmann wrote:
> * Khalida BEN SIDI AHMED wrote:
>> In the html code of a Wikipedia article how to recognise the
>> *first*sentence of this article?
>
> It's not marked up and probably differs among language versions. On the
> english version the first `p` child of a `mw-content-ltr` element is a
> good bet, as I pointed out earlier, to identify the first paragraph. It
> would then be necessary to find the full stop at the end of a sentence;
> criteria for that include that a space or the end of a paragraph follows
> and that it is not included in some nesting construct like parentheses;
> http://en.wikipedia.org/wiki/Sentence_boundary_disambiguation discusses
> some of the problems and includes pointers to some solutions.

I've found you have to be careful with the <p> trick. Sometimes geo
coordinates in the article will mistakenly use a <p> or one will slip into
an infobox or a hatnote. There are also other edge cases like disambiguation
pages (where the first "sentence" often ends in a colon, at least on the
English Wikipedia). I'm not sure if anyone has put together a comprehensive
set of edge cases.

The real answer here seems to be switching to an architecture that makes the
distinction explicit. I don't imagine that'll be happening any time soon,
though.

MZMcBride



_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Wikipedia wikitech RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.