Gossamer Forum
Home : Products : Gossamer Links : Discussions :

RDF - Convert UTF-8 to ISO-8859-1

Quote Reply
RDF - Convert UTF-8 to ISO-8859-1
Hi Guys,

Any ideas how to convert the excisting Open Directory Database RDF file http://dmoz.org/rdf/content.rdf.u8.gz from UTF-8 to ISO-8859-1 The format they used before.

A ISO-8859-1 format will from my understaning after a complete import into Links Sql 2.04 support the following languages correctly: Afrikaans, Basque, Catalan, Danish, Dutch, English, Faeroese, Finnish, French, Galician, German, Icelandic, Irish, Italian, Norwegian, Portuguese, Scottish, Spanish, and Swedish.

Best, Tomas



Quote Reply
Re: RDF - Convert UTF-8 to ISO-8859-1 In reply to
We just finished an import of the World/Chinese section of the rdf directory and ended up doing this:

1. Use 'vi' to pull out the section we wanted.
2. Use the following script and the Unicode::MapUTF8 module to convert the charset, you would use 'ISO-8859-1' instead of 'Big5'.

Code:
use Unicode::MapUTF8 qw(to_utf8 from_utf8 utf8_supported_charset);
open (FH, "/root/chinese.rdf");
read (FH, $data, -s FH);
close FH;

open (OUT, "> /root/chinese.big5.rdf");
$data =~ s/([\200-\377]+)/from_utf8({ -string => $1, -charset => 'BIG5'})/eg;
print OUT $data;
close OUT;
3. Run nph-import as normal on the new file.

Hope this helps, the key is the Unicode::MapUTF8 module.

Cheers,

Alex

--
Gossamer Threads Inc.
Quote Reply
Re: RDF - Convert UTF-8 to ISO-8859-1 In reply to
Alex,

Thanks a lot for this lines :-) I will give it a try.

I initially solved the swedish language with this simple shell script.

#!/bin/sh
sed 's/Ã¥/å/g' content.rdf.u8 >2
sed 's/ä/ä/g' 2 >3
sed 's/ö/ö/g' 3 >4
sed 's/Ã…/Å/g' 4 >5
sed 's/Ä/Ä/g' 5 >6
sed 's/Ö/Ö/g' 6 >content.rdf.u8.sv
rm -f 2
rm -f 3
rm -f 4
rm -f 5
rm -f 6

/Tomas

Quote Reply
Re: RDF - Convert UTF-8 to ISO-8859-1 In reply to
Alex,

I have tried to convert the complete rdf data at once as per your recomendation, it takes up to much memory. Is there any other way to do it which not require as much memory?

Best, Soobe

Quote Reply
Re: [Alex] RDF - Convert UTF-8 to ISO-8859-1 In reply to
Hi Alex. This looks like a solution for me too. I'm having problems with foreign charachters in the DMOZ import script.

Is there a list of char-types that I can use to convert from? i.e there are a lot of charachters that vary category to category...I'm gonna try to write a script that will go through the database, depending on the category, and replace the charachters accordingly.

Any ideas?

TIA

Andy (mod)
andy@ultranerds.co.uk
Want to give me something back for my help? Please see my Amazon Wish List
GLinks ULTRA Package | GLinks ULTRA Package PRO
Links SQL Plugins | Website Design and SEO | UltraNerds | ULTRAGLobals Plugin | Pre-Made Template Sets | FREE GLinks Plugins!
Quote Reply
Re: [Andy] RDF - Convert UTF-8 to ISO-8859-1 In reply to
This one really helped me. I just included it in my import script but cannot say anything about memory usage...

http://www.gossamer-threads.com/...orum.cgi?post=255894

Regards

Niko
Quote Reply
Re: [el noe] RDF - Convert UTF-8 to ISO-8859-1 In reply to
Yeah, I managed to get it all working a few days ago. Ended up using Encode::UTF8 to replace the invalid charachters.

Thanks anyway.

Andy (mod)
andy@ultranerds.co.uk
Want to give me something back for my help? Please see my Amazon Wish List
GLinks ULTRA Package | GLinks ULTRA Package PRO
Links SQL Plugins | Website Design and SEO | UltraNerds | ULTRAGLobals Plugin | Pre-Made Template Sets | FREE GLinks Plugins!
Quote Reply
Re: [Andy] RDF - Convert UTF-8 to ISO-8859-1 In reply to
Weird... its doing stuff like this now :/

ID : 405533 Full: World/Catalí /Arts i cultura/Cinema/Pel·lí­cules/G/Gattaca Name: Gattaca

New_Full: World/Catal/Arts i cultura/Cinema/Pel·lcules/G/Gattaca Name: Gattaca

Its really weird.. .it just keeps on replacing the non-english charachters to · and removing letters that are already transfered to blanks Unsure

Anyone got any ideas?

My code is;

$full =~ s/([\200-\377]+)/from_utf8({ -string => $1, -charset => 'ISO-8859-1'})/eg;
$name =~ s/([\200-\377]+)/from_utf8({ -string => $1, -charset => 'ISO-8859-1'})/eg;

TIA

Andy (mod)
andy@ultranerds.co.uk
Want to give me something back for my help? Please see my Amazon Wish List
GLinks ULTRA Package | GLinks ULTRA Package PRO
Links SQL Plugins | Website Design and SEO | UltraNerds | ULTRAGLobals Plugin | Pre-Made Template Sets | FREE GLinks Plugins!