Below is some code Alex posted a while back to convert UTF8 to ISO-8859-1. Unfortunately, when running this on large cuts of the DMOZ rdf (like Regional), I run out of memory (1.5GB in my server). Is there an easy way to modify this code so that it isn't trying to load the entire cut into memory, convert it, then write it to a file? Maybe it can use a temp/swap file to write the data a piece at a time. Any ideas?
open (FH, "$rdf_path/Regional.rdf.u8");
read (FH, $data, -s FH);
close FH;
open (OUT, "> $out_path/Regional.rdf");
$data =~ s/([\200-\377]+)/from_utf8({ -string => $1, -charset => 'ISO-8859-1'})/eg;
print OUT $data;
close OUT;
Thanks in advance,
Sean
Code:
use Unicode::MapUTF8 qw(to_utf8 from_utf8 utf8_supported_charset); open (FH, "$rdf_path/Regional.rdf.u8");
read (FH, $data, -s FH);
close FH;
open (OUT, "> $out_path/Regional.rdf");
$data =~ s/([\200-\377]+)/from_utf8({ -string => $1, -charset => 'ISO-8859-1'})/eg;
print OUT $data;
close OUT;
Thanks in advance,
Sean