Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Wikipedia: Wikitech

Questions about fields in wiki dumps

 

 

Wikipedia wikitech RSS feed   Index | Next | Previous | View Threaded


jeff.kubina at gmail

Nov 10, 2009, 4:52 AM

Post #1 of 10 (827 views)
Permalink
Questions about fields in wiki dumps

I am working with some enwiki-{YYYYMMDD}-stub-meta-history.xml dumps and
wanted to get clarification on how certain fields of the articles can
change:

1. What action will make an article get a new pageId? Is it
only move/rename, a redirect, or a deletion and recreation, or are there
other ways this could happen? Can any of these changes be detected from the
stub-meta-history.xml files?

2. Is it possible for just one particular revision of an article to be
deleted, maybe due to a copyright violation? If so, is just the content of
the revision deleted or would this include all the data associated with it,
so that the revision would not even appear in the stub-meta-history.xml
file?

3. Are pageIds recycled? If a page is deleted, could its id number be used
for a completely new page in the future?

Thanks,
Jeff
--
Jeff Kubina
http://google.com/profiles/jeff.kubina
_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


roan.kattouw at gmail

Nov 10, 2009, 5:00 AM

Post #2 of 10 (799 views)
Permalink
Re: Questions about fields in wiki dumps [In reply to]

2009/11/10 Jeff Kubina <jeff.kubina [at] gmail>:
> I am working with some enwiki-{YYYYMMDD}-stub-meta-history.xml dumps and
> wanted to get clarification on how certain fields of the articles can
> change:
>
> 1. What action will make an article get a new pageId? Is it
> only move/rename, a redirect, or a deletion and recreation, or are there
> other ways this could happen? Can any of these changes be detected from the
> stub-meta-history.xml files?
>
When a page is moved, it'll change its name but keep its pageid. A
redirect will be created at the old name with a new pageid.

> 2. Is it possible for just one particular revision of an article to be
> deleted, maybe due to a copyright violation? If so, is just the content of
> the revision deleted or would this include all the data associated with it,
> so that the revision would not even appear in the stub-meta-history.xml
> file?
>
Yes. In this case, any trace of the revision ever having existed is
gone from the dumps, AFAIK.

> 3. Are pageIds recycled? If a page is deleted, could its id number be used
> for a completely new page in the future?
>
No, pageids are never recycled.

Roan Kattouw (Catrope)

_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Simetrical+wikilist at gmail

Nov 10, 2009, 7:59 AM

Post #3 of 10 (792 views)
Permalink
Re: Questions about fields in wiki dumps [In reply to]

On Tue, Nov 10, 2009 at 7:52 AM, Jeff Kubina <jeff.kubina [at] gmail> wrote:
> 1. What action will make an article get a new pageId? Is it
> only move/rename, a redirect, or a deletion and recreation, or are there
> other ways this could happen? Can any of these changes be detected from the
> stub-meta-history.xml files?

Normal deletion/undeletion, moving, or similar things will not create
a new page_id. However, there are a couple of things to be aware of:

1) In the old days, deleting an article and recreating it would assign
it a new page_id. This hasn't been true for several years.

2) It's still possible to get revisions associated with a different
page_id than they were originally written for, by deleting a page,
moving another page over it, and undeleting one or more revisions.

> 2. Is it possible for just one particular revision of an article to be
> deleted, maybe due to a copyright violation? If so, is just the content of
> the revision deleted or would this include all the data associated with it,
> so that the revision would not even appear in the stub-meta-history.xml
> file?

Yes, an individual revision can be deleted. There are at least three
different ways to do this, last I checked. I would expect that the
old ways (oversight, and delete+selective undelete) would leave no
traces at all in the dump, while the new way (rev_deleted) might only
suppress certain fields. I'm not sure offhand, though.

> 3. Are pageIds recycled? If a page is deleted, could its id number be used
> for a completely new page in the future?

No. page_ids are handed out in strictly increasing order.

_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


happy-melon at live

Nov 10, 2009, 1:15 PM

Post #4 of 10 (792 views)
Permalink
Re: Questions about fields in wiki dumps [In reply to]

"Aryeh Gregor" <Simetrical+wikilist [at] gmail> wrote in message
news:7c2a12e20911100759s1ba211b0k6ef6cb076449be37 [at] mail
> On Tue, Nov 10, 2009 at 7:52 AM, Jeff Kubina <jeff.kubina [at] gmail>
> wrote:
>> 2. Is it possible for just one particular revision of an article to be
>> deleted, maybe due to a copyright violation? If so, is just the content
>> of
>> the revision deleted or would this include all the data associated with
>> it,
>> so that the revision would not even appear in the stub-meta-history.xml
>> file?
>
> Yes, an individual revision can be deleted. There are at least three
> different ways to do this, last I checked. I would expect that the
> old ways (oversight, and delete+selective undelete) would leave no
> traces at all in the dump, while the new way (rev_deleted) might only
> suppress certain fields. I'm not sure offhand, though.

IIRC, any revision that has any of the rev_deleted bitfields set will be
excluded from dumps. Don't quote me on that....

--HM




_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


jeff.kubina at gmail

Nov 10, 2009, 2:48 PM

Post #5 of 10 (793 views)
Permalink
Re: Questions about fields in wiki dumps [In reply to]

Thanks for the help, but I'm still a bit confused about this case: in
enwiki-20090714-stub-meta-history.xml the AmericanSamoa page has a pageId of
6; as shown below. But, in enwiki-20090914-stub-meta-history.xml it has an
id of 23741520 <http://en.wikipedia.org/wiki/Special:Export/AmericanSamoa>,
with only the last edit history entry. So what happen? Is this an example of
a delete, then restore with a new id? Why are the older revisions missing or
does a restore only restore the latest revision?

XML from enwiki-20090714-stub-meta-history.xml for AmericanSamoa:
<page>
<title>AmericanSamoa</title>
<id>6</id>
<redirect />
<revision>
<id>233188</id>
<timestamp>2001-01-19T01:12:51Z</timestamp>
<contributor>
<ip>office.bomis.com</ip>
</contributor>
<comment>*</comment>
<text id="233188" />
</revision>
<revision>
<id>15898942</id>
<timestamp>2002-02-25T15:43:11Z</timestamp>
<contributor>
<ip>Conversion script</ip>
</contributor>
<minor/>
<comment>Automated conversion</comment>
<text id="15898942" />
</revision>
<revision>
<id>18063795</id>
<timestamp>2005-07-03T11:14:17Z</timestamp>
<contributor>
<username>Docu</username>
<id>8029</id>
</contributor>
<minor/>
<comment>adding to cur_id=6 {{R from CamelCase}}</comment>
<text id="18058393" />
</revision>
<revision>
<id>133180191</id>
<timestamp>2007-05-24T14:41:33Z</timestamp>
<contributor>
<username>Ngaiklin</username>
<id>4477979</id>
</contributor>
<minor/>
<comment>Robot: Automated text replacement
(-\[\[(.*?[\:|\|])*?(.+?)\]\] +\g&lt;2&gt;)</comment>
<text id="132462505" />
</revision>
<revision>
<id>133452270</id>
<timestamp>2007-05-25T17:12:06Z</timestamp>
<contributor>
<username>Gurch</username>
<id>241822</id>
</contributor>
<minor/>
<comment>Revert edit(s) by [[Special:Contributions/Ngaiklin|Ngaiklin]]
to last version by [[Special:Contributions/Docu|Docu]]</comment>
<text id="132732979" />
</revision>
</page>

Thanks,
Jeff
--
Jeff Kubina
http://google.com/profiles/jeff.kubina

On Tue, Nov 10, 2009 at 4:15 PM, Happy-melon <happy-melon [at] live> wrote:

>
> "Aryeh Gregor" <Simetrical+wikilist [at] gmail<Simetrical%2Bwikilist [at] gmail>>
> wrote in message
> news:7c2a12e20911100759s1ba211b0k6ef6cb076449be37 [at] mail
> > On Tue, Nov 10, 2009 at 7:52 AM, Jeff Kubina <jeff.kubina [at] gmail>
> > wrote:
> >> 2. Is it possible for just one particular revision of an article to be
> >> deleted, maybe due to a copyright violation? If so, is just the content
> >> of
> >> the revision deleted or would this include all the data associated with
> >> it,
> >> so that the revision would not even appear in the stub-meta-history.xml
> >> file?
> >
> > Yes, an individual revision can be deleted. There are at least three
> > different ways to do this, last I checked. I would expect that the
> > old ways (oversight, and delete+selective undelete) would leave no
> > traces at all in the dump, while the new way (rev_deleted) might only
> > suppress certain fields. I'm not sure offhand, though.
>
> IIRC, any revision that has any of the rev_deleted bitfields set will be
> excluded from dumps. Don't quote me on that....
>
> --HM
>
>
>
>
> _______________________________________________
> Wikitech-l mailing list
> Wikitech-l [at] lists
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>
_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


rarohde at gmail

Nov 10, 2009, 3:24 PM

Post #6 of 10 (794 views)
Permalink
Re: Questions about fields in wiki dumps [In reply to]

On Tue, Nov 10, 2009 at 1:15 PM, Happy-melon <happy-melon [at] live> wrote:
>
> "Aryeh Gregor" <Simetrical+wikilist [at] gmail> wrote in message
> news:7c2a12e20911100759s1ba211b0k6ef6cb076449be37 [at] mail
>> On Tue, Nov 10, 2009 at 7:52 AM, Jeff Kubina <jeff.kubina [at] gmail>
>> wrote:
>>> 2. Is it possible for just one particular revision of an article to be
>>> deleted, maybe due to a copyright violation? If so, is just the content
>>> of
>>> the revision deleted or would this include all the data associated with
>>> it,
>>> so that the revision would not even appear in the stub-meta-history.xml
>>> file?
>>
>> Yes, an individual revision can be deleted.  There are at least three
>> different ways to do this, last I checked.  I would expect that the
>> old ways (oversight, and delete+selective undelete) would leave no
>> traces at all in the dump, while the new way (rev_deleted) might only
>> suppress certain fields.  I'm not sure offhand, though.
>
> IIRC, any revision that has any of the rev_deleted bitfields set will be
> excluded from dumps.  Don't quote me on that....

I'm not sure what the criteria actually are, but I recall encountering
a dump entry where the editor's name had been suppressed (missing in
the revision) but where the revision text itself was present. (I had
an analysis script choke on this, since up to that time I had assumed
every revision would have valid contributor information attached to
it.)

-Robert Rohde

_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Platonides at gmail

Nov 10, 2009, 3:30 PM

Post #7 of 10 (792 views)
Permalink
Re: Questions about fields in wiki dumps [In reply to]

Happy-melon wrote:
>> Yes, an individual revision can be deleted. There are at least three
>> different ways to do this, last I checked. I would expect that the
>> old ways (oversight, and delete+selective undelete) would leave no
>> traces at all in the dump, while the new way (rev_deleted) might only
>> suppress certain fields. I'm not sure offhand, though.
>
> IIRC, any revision that has any of the rev_deleted bitfields set will be
> excluded from dumps. Don't quote me on that....
>
> --HM

They will appear with a deleted="deleted" attribute, so the content of
the suppressed fields isn't available, but that of the other fields is.


_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


rarohde at gmail

Nov 10, 2009, 3:32 PM

Post #8 of 10 (793 views)
Permalink
Re: Questions about fields in wiki dumps [In reply to]

On Tue, Nov 10, 2009 at 2:48 PM, Jeff Kubina <jeff.kubina [at] gmail> wrote:
> Thanks for the help, but I'm still a bit confused about this case: in
> enwiki-20090714-stub-meta-history.xml the AmericanSamoa page has a pageId of
> 6; as shown below. But, in enwiki-20090914-stub-meta-history.xml it has an
> id of 23741520 <http://en.wikipedia.org/wiki/Special:Export/AmericanSamoa>,
> with only the last edit history entry. So what happen? Is this an example of
> a delete, then restore with a new id? Why are the older revisions missing or
> does a restore only restore the latest revision?

I assume the Page ID answer lies with whatever the hell Graham87 was
doing here in July:

http://en.wikipedia.org/w/index.php?title=Special:Log&page=AmericanSamoa

Also, if you use a URL GET, such as you have above, it only gives the
most recent revision. You can uncheck the "Include only the current
revision" box at Special:Export if you want to get additional
revisions from the online form.

-Robert Rohde

_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Platonides at gmail

Nov 10, 2009, 3:41 PM

Post #9 of 10 (792 views)
Permalink
Re: Questions about fields in wiki dumps [In reply to]

Jeff Kubina wrote:
> Thanks for the help, but I'm still a bit confused about this case: in
> enwiki-20090714-stub-meta-history.xml the AmericanSamoa page has a pageId of
> 6; as shown below. But, in enwiki-20090914-stub-meta-history.xml it has an
> id of 23741520 <http://en.wikipedia.org/wiki/Special:Export/AmericanSamoa>,
> with only the last edit history entry. So what happen? Is this an example of
> a delete, then restore with a new id? Why are the older revisions missing or
> does a restore only restore the latest revision?

See http://en.wikipedia.org/w/index.php?title=Special:Log&page=AmericanSamoa

There was a quite a bit of deletion move and undeletion trickery to move
the first revision on the XML (the one from office.bomis.com) to the
history of American_Samoa.
http://en.wikipedia.org/w/index.php?title=American_Samoa&oldid=233188

Seems AmericanSamoa page id was recreated during that.

There's another id oddness on that page, since that office edit is from
January 2001 and has id 233188. It has listed (wrongly) as previous on
the diff links one from July 2002 with revid of 205006.
It is listed as previous because 205006 < 233188. That older revision
has a newer revid because originally, only current version of articles
were imported from UseModWiki (those that are tagged as from Conversion
script). Older edits like this one were imported later, after that
205006 edit was made.


_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


glimmer_phoenix at yahoo

Nov 11, 2009, 5:30 PM

Post #10 of 10 (753 views)
Permalink
Re: Questions about fields in wiki dumps [In reply to]

--- El mié, 11/11/09, Robert Rohde <rarohde [at] gmail> escribió:

> I'm not sure what the criteria actually are, but I recall
> encountering
> a dump entry where the editor's name had been suppressed
> (missing in
> the revision) but where the revision text itself was
> present.  (I had
> an analysis script choke on this, since up to that time I
> had assumed
> every revision would have valid contributor information
> attached to
> it.)

Yes, actually that case forced updates on some parsers like mine, since they weren't supposed to expect empty fields on revisions (and specially the rev_user field).

F --
>
> -Robert Rohde
>
> _______________________________________________
> Wikitech-l mailing list
> Wikitech-l [at] lists
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>




_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Wikipedia wikitech RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact lists@gossamer-threads.com
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.