Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Wikipedia: Wikitech

Unclean UTF-8 in dump, due to stale templatelinks

 

 

Wikipedia wikitech RSS feed   Index | Next | Previous | View Threaded


lars at aronsson

Aug 16, 2006, 2:32 PM

Post #1 of 1 (172 views)
Permalink
Unclean UTF-8 in dump, due to stale templatelinks

When I download the dump
http://download.wikimedia.org/svwiki/20060808/svwiki-20060808-templatelinks.sql.gz

and uncompress it, I find violations of UTF-8 (apparently
remainders of ISO-8859-1) in these records:


0xf6 in (141524,10,'F\xf6rfattarstub'),
0xf6 in (154217,10,'Geografistub-Gr\×f6nland'),
0xe4 in (147111,10,'Japanskt_L\xe4n'),
0xe4 in (145703,10,'Motorv\xe4gar_i_Sverige'),
0xc4 in (125122,10,'RA\xc4'),
0xe5 in (146360,10,'Sk\xe5despelarstub'),
0xf6 and 0xd6 in (160822,10,'S\xf6dra_\xd6sterbotten'),
0xe4 in (145703,10,'TrafikplatsLandsv\xe4g'),

I could still import this SQL dump into mysql (4.0), but when I
open the SQL dump file in GNU Emacs (22.0.50) it doesn't go into
Unicode mode as it does for a clean UTF-8 file.

I've found no errors in some other files I've looked at.

This is a total of 9 violations in 8 records referring to 7
different pages (page ID 145703 appears twice) out of 330,000
records, so no real reason for panic. These seven page IDs are
not present in the page.sql dump, so apparently stale link records
that should have been removed from the database. If I run the
inner join:

select page_namespace, page_title, tl_namespace, tl_title
from page, templatelinks where page_id = tl_from;

the result is clean UTF-8. But the result is 2076 rows shorter
than the templatelinks table:

select count(*)
from page, templatelinks where page_id=tl_from;
328349

select count(*)
from templatelinks;
330425


--
Lars Aronsson (lars [at] aronsson)
Aronsson Datateknik - http://aronsson.se
_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] wikimedia
http://mail.wikipedia.org/mailman/listinfo/wikitech-l

Wikipedia wikitech RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.