Gossamer Forum
Home : General : Perl Programming :

read file backwards avoiding dupes

Quote Reply
read file backwards avoiding dupes
hi,

i want to...

1. get 20 records from the bottom of a tab delimited .txt datafile

2. skip duplicates on one field

3. minimize i/o

i have some code, but i'm wondering what might be a more efficient way to do it, here, i'm reading in the whole datafile....

Code:
my (%track, @newarr, $x);
my $count = 20;
open(IN, "< $datafile") || die "Couldn't open $datafile: $!\n";
my @update = <IN>;
close(IN) || die "Can't close $datafile: $!\n";
@update = reverse(@update);
for ($x=0;$x<$count;$x++) {
my ($status, $time, $desc, $PM) = split "\t", $update[$x];
$track{$PM}++ && $count++ && next;
push(@newarr, $update[$x]);
}
Quote Reply
Re: [adrockjames] read file backwards avoiding dupes In reply to
Getting the last 20 lines is easy if you have a server with "tail". The bummer is that if there are duplicates then you will end up knocking records out of the array and may have less than 20 in the end...so I don't think there's any other way than doing some jiggery pokery.

You would need to determine the number of lines in the file, then get the last 20, check for duplicates and determine the size of your final array. If less than 20 records exist you'd need to go back an extra line to pick up another record.

Or if there are duplicates are you happy for them to just be removed, or do you need exactly 20 after removing dupes?
Quote Reply
Re: [Paul] read file backwards avoiding dupes In reply to
thanks paul.

yeah, i originally was using @update = `tail -20 $datafile`;

until i realized that i did need exactly 20 and i would be forced to go back to the file once i knocked out dupes. how costly do you think going back into the file is vs. reading it all into an array the first time given the file has 3000 lines? (i know there are a lot variables involved in this question, just wanted a best practices type understanding).
Quote Reply
Re: [adrockjames] read file backwards avoiding dupes In reply to
Hmmm. Seems to me that you could to a tail command to get the last 50 lines or so, assuming you never got more than 30 duplicates in the last 50 lines. Then, after processing, take the last 20 of what you have left.