Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Wikipedia: Wikitech

Re: [WikiEN-l] Extracting main titles from enwiki-latest-all-titles-in-ns0.gz

 

 

Wikipedia wikitech RSS feed   Index | Next | Previous | View Threaded


dgerard at gmail

Dec 12, 2009, 7:35 AM

Post #1 of 2 (631 views)
Permalink
Re: [WikiEN-l] Extracting main titles from enwiki-latest-all-titles-in-ns0.gz

This is probably not the right place, you would want wikitech-l (where
I've cc'ed this reply).


- d.



2009/12/11 Behrang Saeedzadeh <behrangsa [at] gmail>:
> Hi,
>
> I have downloaded enwiki-latest-all-titles-in-ns0.gz and I want to extract
> main titles and store them in another file. For example, some titles have
> meta information (e.g. disambiguation etc.) and I want these to be removed.
> Can I remove all the text between parentheses from the titles to achieve
> this?
>
> Also some titles start with the "!" character. and some are enclosed between
> two or three of them such as !Adiso_Amigos!. What is the purpose of "!" in
> such cases? Also why some titles are enclosed between two double quotes such
> as "400_Years_of_Telescope"?
>
> Finally, is there a document describing all these conventions?
>
> P.S: Is this the right place to ask such questions?
>
> Cheers,
> Behrang Saeedzadeh
> -------------------------------
> http://my.opera.com/behrangsa
> http://twitter.com/behrangsa
> _______________________________________________
> WikiEN-l mailing list
> WikiEN-l [at] lists
> To unsubscribe from this mailing list, visit:
> https://lists.wikimedia.org/mailman/listinfo/wikien-l
>

_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


marco at harddisk

Dec 12, 2009, 7:45 AM

Post #2 of 2 (581 views)
Permalink
Re: [WikiEN-l] Extracting main titles from enwiki-latest-all-titles-in-ns0.gz [In reply to]

Hi,

On Sat, Dec 12, 2009 at 4:35 PM, David Gerard <dgerard [at] gmail> wrote:

>
> 2009/12/11 Behrang Saeedzadeh <behrangsa [at] gmail>:
> > Hi,
> >
> > I have downloaded enwiki-latest-all-titles-in-ns0.gz and I want to
> extract
> > main titles and store them in another file. For example, some titles have
> > meta information (e.g. disambiguation etc.) and I want these to be
> removed.
> > Can I remove all the text between parentheses from the titles to achieve
> > this?
> >
>
You have to parse it by hand.


> > Also some titles start with the "!" character. and some are enclosed
> between
> > two or three of them such as !Adiso_Amigos!. What is the purpose of "!"
> in
> > such cases?

It's part of the topic's name (in case of <
http://en.wikipedia.org/wiki/%C2%A1Adios_Amigos!>, the band's name). The
reverse exclamation mark is part of the Spanish language.

> > Also why some titles are enclosed between two double quotes such
> > as "400_Years_of_Telescope"?
>
Same case: The " are part of the topic's name (e.g. <
http://en.wikipedia.org/wiki/%22Weird_Al%22_Yankovic>).

Marco

PS: Next time, please do correct copy&paste so people have a chance to see
what you want. Both your supplied examples had to be corrected, the second
one was missing a "the": <http://en.wikipedia.org/wiki/
"400_Years_of_the_Telescope">


--
VMSoft GbR
Nabburger Str. 15
81737 München
Geschäftsführer: Marco Schuster, Volker Hemmert
http://vmsoft-gbr.de
_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Wikipedia wikitech RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.