Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Python: Python

Finding duplicate file names and modifying them based on elements of the path

 

 

Python python RSS feed   Index | Next | Previous | View Threaded


larry.martell at gmail

Jul 18, 2012, 3:20 PM

Post #1 of 18 (736 views)
Permalink
Finding duplicate file names and modifying them based on elements of the path

I have an interesting problem I'm trying to solve. I have a solution
almost working, but it's super ugly, and know there has to be a
better, cleaner way to do it.

I have a list of path names that have this form:

/dir0/dir1/dir2/dir3/dir4/dir5/dir6/file

I need to find all the file names (basenames) in the list that are
duplicates, and for each one that is a dup, prepend dir4 to the
filename as long as the dir4/file pair is unique. If there are
multiple dir4/files in the list, then I also need to add a sequence
number based on the sorted value of dir5 (which is a date in ddMONyy
format).

For example, if my list contains:

/dir0/dir1/dir2/dir3/qwer/09Jan12/dir6/file3
/dir0/dir1/dir2/dir3/abcd/08Jan12/dir6/file1
/dir0/dir1/dir2/dir3/abcd/08Jan12/dir6/file2
/dir0/dir1/dir2/dir3/xyz/08Jan12/dir6/file1
/dir0/dir1/dir2/dir3/qwer/07Jan12/dir6/file3

Then I want to end up with:

/dir0/dir1/dir2/dir3/qwer/09Jan12/dir6/qwer_01_file3
/dir0/dir1/dir2/dir3/abcd/08Jan12/dir6/abcd_file1
/dir0/dir1/dir2/dir3/abcd/08Jan12/dir6/file2
/dir0/dir1/dir2/dir3/xyz/08Jan12/dir6/xyz_file1
/dir0/dir1/dir2/dir3/qwer/07Jan12/dir6/qwer_00_file3

My solution involves multiple maps and multiple iterations through the
data. How would you folks do this?
--
http://mail.python.org/mailman/listinfo/python-list


no.email at nospam

Jul 18, 2012, 3:49 PM

Post #2 of 18 (717 views)
Permalink
Re: Finding duplicate file names and modifying them based on elements of the path [In reply to]

"Larry.Martell [at] gmail" <larry.martell [at] gmail> writes:
> I have an interesting problem I'm trying to solve. I have a solution
> almost working, but it's super ugly, and know there has to be a
> better, cleaner way to do it. ...
>
> My solution involves multiple maps and multiple iterations through the
> data. How would you folks do this?

You could post your code and ask for suggestions how to improve it.
There are a lot of not-so-natural constraints in that problem, so it
stands to reason that the code will be a bit messy. The whole
specification seems like an antipattern though. You should just give a
sensible encoding for the filename regardless of whether other fields
are duplicated or not. You also don't seem to address the case where
basename, dir4, and dir5 are all duplicated.

The approach I'd take for the spec as you wrote it is:

1. Sort the list on the (basename, dir4, dir5) triple, saving original
location (numeric index) of each item
2. Use itertools.groupby to group together duplicate basenames.
3. Within the groups, use groupby again to gather duplicate dir4's,
4. Within -those- groups, group by dir5 and assign sequence numbers in
groups where there's more than one file
5. Unsort to get the rewritten items back into the original order.

Actual code is left as an exercise.
--
http://mail.python.org/mailman/listinfo/python-list


simoncropper at fossworkflowguides

Jul 18, 2012, 5:36 PM

Post #3 of 18 (714 views)
Permalink
Re: Finding duplicate file names and modifying them based on elements of the path [In reply to]

On 19/07/12 08:20, Larry.Martell [at] gmail wrote:
> I have an interesting problem I'm trying to solve. I have a solution
> almost working, but it's super ugly, and know there has to be a
> better, cleaner way to do it.
>
> I have a list of path names that have this form:
>
> /dir0/dir1/dir2/dir3/dir4/dir5/dir6/file
>
> I need to find all the file names (basenames) in the list that are
> duplicates, and for each one that is a dup, prepend dir4 to the
> filename as long as the dir4/file pair is unique. If there are
> multiple dir4/files in the list, then I also need to add a sequence
> number based on the sorted value of dir5 (which is a date in ddMONyy
> format).
>
> For example, if my list contains:
>
> /dir0/dir1/dir2/dir3/qwer/09Jan12/dir6/file3
> /dir0/dir1/dir2/dir3/abcd/08Jan12/dir6/file1
> /dir0/dir1/dir2/dir3/abcd/08Jan12/dir6/file2
> /dir0/dir1/dir2/dir3/xyz/08Jan12/dir6/file1
> /dir0/dir1/dir2/dir3/qwer/07Jan12/dir6/file3
>
> Then I want to end up with:
>
> /dir0/dir1/dir2/dir3/qwer/09Jan12/dir6/qwer_01_file3
> /dir0/dir1/dir2/dir3/abcd/08Jan12/dir6/abcd_file1
> /dir0/dir1/dir2/dir3/abcd/08Jan12/dir6/file2
> /dir0/dir1/dir2/dir3/xyz/08Jan12/dir6/xyz_file1
> /dir0/dir1/dir2/dir3/qwer/07Jan12/dir6/qwer_00_file3
>
> My solution involves multiple maps and multiple iterations through the
> data. How would you folks do this?
>

Hi Larry,

I am making the assumption that you intend to collapse the directory
tree and store each file in the same directory, otherwise I can't think
of why you need to do this.

If this is the case, then I would...

1. import all the files into an array
2. parse path to extract forth level directory name and base name.
3. reiterate through the array
3.1 check if base filename exists in recipient directory
3.2 if not, copy to recipient directory
3.3 if present, append the directory path then save
3.4 create log of success or failure

Personally, I would not have some files with abcd_file1 and others as
file2 because if it is important enough to store a file in a separate
directory you should also note where file2 came from as well. When
looking at your results at a later date you are going to have to open
file2 (which I presume must record where it relates to) to figure out
where it came from. If it is in the name it is easier to review.

In short, consistency is the name of the game; if you are going to do it
for some then do it for all; and finally it will be easier for others
later to work out what you have done.

--
Cheers Simon

Simon Cropper - Open Content Creator

Free and Open Source Software Workflow Guides
------------------------------------------------------------
Introduction http://www.fossworkflowguides.com
GIS Packages http://www.fossworkflowguides.com/gis
bash / Python http://www.fossworkflowguides.com/scripting


--
http://mail.python.org/mailman/listinfo/python-list


larry.martell at gmail

Jul 19, 2012, 11:52 AM

Post #4 of 18 (709 views)
Permalink
Re: Finding duplicate file names and modifying them based on elements of the path [In reply to]

On Jul 18, 4:49 pm, Paul Rubin <no.em...@nospam.invalid> wrote:
> "Larry.Mart...@gmail.com" <larry.mart...@gmail.com> writes:
> > I have an interesting problem I'm trying to solve. I have a solution
> > almost working, but it's super ugly, and know there has to be a
> > better, cleaner way to do it. ...
>
> > My solution involves multiple maps and multiple iterations through the
> > data. How would you folks do this?
>
> You could post your code and ask for suggestions how to improve it.
> There are a lot of not-so-natural constraints in that problem, so it
> stands to reason that the code will be a bit messy.  The whole
> specification seems like an antipattern though.  You should just give a
> sensible encoding for the filename regardless of whether other fields
> are duplicated or not.  You also don't seem to address the case where
> basename, dir4, and dir5 are all duplicated.
>
> The approach I'd take for the spec as you wrote it is:
>
> 1. Sort the list on the (basename, dir4, dir5) triple, saving original
>    location (numeric index) of each item
> 2. Use itertools.groupby to group together duplicate basenames.
> 3. Within the groups, use groupby again to gather duplicate dir4's,
> 4. Within -those- groups, group by dir5 and assign sequence numbers in
>    groups where there's more than one file
> 5. Unsort to get the rewritten items back into the original order.
>
> Actual code is left as an exercise.

Thanks very much for the reply Paul. I did not know about itertools.
This seems like it will be perfect for me. But I'm having 1 issue, how
do I know how many of a given basename (and similarly how many
basename/dir4s) there are? I don't know that I have to modify a file
until I've passed it, so I have to do all kinds of contortions to save
the previous one, and deal with the last one after I fall out of the
loop, and it's getting very nasty.

reports_list is the list sorted on basename, dir4, dir5 (tool is dir4,
file_date is dir5):

for file, file_group in groupby(reports_list, lambda x: x[0]):
# if file is unique in file_group do nothing, but how can I tell
if file is unique?
for tool, tool_group in groupby(file_group, lambda x: x[1]):
# if tool is unique for file, change file to tool_file, but
how can I tell if tool is unique for file?
for file_date, file_date_group in groupby(tool_group, lambda
x: x[2]):


You can't do a len on the iterator that is returned from groupby, and
I've tried to do something with imap or defaultdict, but I'm not
getting anywhere. I guess I can just make 2 passes through the data,
the first time getting counts. Or am I missing something about how
groupby works?

Thanks!
-larry
--
http://mail.python.org/mailman/listinfo/python-list


larry.martell at gmail

Jul 19, 2012, 11:54 AM

Post #5 of 18 (711 views)
Permalink
Re: Finding duplicate file names and modifying them based on elements of the path [In reply to]

On Jul 18, 6:36 pm, Simon Cropper
<simoncrop...@fossworkflowguides.com> wrote:
> On 19/07/12 08:20, Larry.Mart...@gmail.com wrote:
>
>
>
>
>
>
>
>
>
> > I have an interesting problem I'm trying to solve. I have a solution
> > almost working, but it's super ugly, and know there has to be a
> > better, cleaner way to do it.
>
> > I have a list of path names that have this form:
>
> > /dir0/dir1/dir2/dir3/dir4/dir5/dir6/file
>
> > I need to find all the file names (basenames) in the list that are
> > duplicates, and for each one that is a dup, prepend dir4 to the
> > filename as long as the dir4/file pair is unique. If there are
> > multiple dir4/files in the list, then I also need to add a sequence
> > number based on the sorted value of dir5 (which is a date in ddMONyy
> > format).
>
> > For example, if my list contains:
>
> > /dir0/dir1/dir2/dir3/qwer/09Jan12/dir6/file3
> > /dir0/dir1/dir2/dir3/abcd/08Jan12/dir6/file1
> > /dir0/dir1/dir2/dir3/abcd/08Jan12/dir6/file2
> > /dir0/dir1/dir2/dir3/xyz/08Jan12/dir6/file1
> > /dir0/dir1/dir2/dir3/qwer/07Jan12/dir6/file3
>
> > Then I want to end up with:
>
> > /dir0/dir1/dir2/dir3/qwer/09Jan12/dir6/qwer_01_file3
> > /dir0/dir1/dir2/dir3/abcd/08Jan12/dir6/abcd_file1
> > /dir0/dir1/dir2/dir3/abcd/08Jan12/dir6/file2
> > /dir0/dir1/dir2/dir3/xyz/08Jan12/dir6/xyz_file1
> > /dir0/dir1/dir2/dir3/qwer/07Jan12/dir6/qwer_00_file3
>
> > My solution involves multiple maps and multiple iterations through the
> > data. How would you folks do this?
>
> Hi Larry,
>
> I am making the assumption that you intend to collapse the directory
> tree and store each file in the same directory, otherwise I can't think
> of why you need to do this.

Hi Simon, thanks for the reply. It's not quite this - what I am doing
is creating a zip file with relative path names, and if there are
duplicate files the parts of the path that are not be carried over
need to get prepended to the file names to make then unique,
>
> If this is the case, then I would...
>
> 1. import all the files into an array
> 2. parse path to extract forth level directory name and base name.
> 3. reiterate through the array
>     3.1 check if base filename exists in recipient directory
>     3.2 if not, copy to recipient directory
>     3.3 if present, append the directory path then save
>     3.4 create log of success or failure
>
> Personally, I would not have some files with abcd_file1 and others as
> file2 because if it is important enough to store a file in a separate
> directory you should also note where file2 came from as well. When
> looking at your results at a later date you are going to have to open
> file2 (which I presume must record where it relates to) to figure out
> where it came from. If it is in the name it is easier to review.
>
> In short, consistency is the name of the game; if you are going to do it
> for some then do it for all; and finally it will be easier for others
> later to work out what you have done.

Yeah, I know, but this is for a client, and this is what they want.
--
http://mail.python.org/mailman/listinfo/python-list


larry.martell at gmail

Jul 19, 2012, 12:00 PM

Post #6 of 18 (711 views)
Permalink
Re: Finding duplicate file names and modifying them based on elements of the path [In reply to]

On Jul 18, 4:49 pm, Paul Rubin <no.em...@nospam.invalid> wrote:
> "Larry.Mart...@gmail.com" <larry.mart...@gmail.com> writes:
> > I have an interesting problem I'm trying to solve. I have a solution
> > almost working, but it's super ugly, and know there has to be a
> > better, cleaner way to do it. ...
>
> > My solution involves multiple maps and multiple iterations through the
> > data. How would you folks do this?
>
> You could post your code and ask for suggestions how to improve it.
> There are a lot of not-so-natural constraints in that problem, so it
> stands to reason that the code will be a bit messy.  The whole
> specification seems like an antipattern though.  You should just give a
> sensible encoding for the filename regardless of whether other fields
> are duplicated or not.  You also don't seem to address the case where
> basename, dir4, and dir5 are all duplicated.
>
> The approach I'd take for the spec as you wrote it is:
>
> 1. Sort the list on the (basename, dir4, dir5) triple, saving original
>    location (numeric index) of each item
> 2. Use itertools.groupby to group together duplicate basenames.
> 3. Within the groups, use groupby again to gather duplicate dir4's,
> 4. Within -those- groups, group by dir5 and assign sequence numbers in
>    groups where there's more than one file
> 5. Unsort to get the rewritten items back into the original order.
>
> Actual code is left as an exercise.

I replied to this before, but I don't see, so if this is a duplicate,
sorry.

Thanks for the reply Paul. I had not heard of itertools. It sounds
like just what I need for this. But I am having 1 issue - how do you
know how many items are in each group? Without knowing that I have to
either make 2 passes through the data, or else work on the previous
item (when I'm in an iteration after the first then I know I have
dups). But that very quickly gets crazy with trying to keep the
previous values.
--
http://mail.python.org/mailman/listinfo/python-list


larry.martell at gmail

Jul 19, 2012, 12:06 PM

Post #7 of 18 (711 views)
Permalink
Re: Finding duplicate file names and modifying them based on elements of the path [In reply to]

On Jul 19, 1:02 pm, "Prasad, Ramit" <ramit.pra...@jpmorgan.com> wrote:
> > > I am making the assumption that you intend to collapse the directory
> > > tree and store each file in the same directory, otherwise I can't think
> > > of why you need to do this.
>
> > Hi Simon, thanks for the reply. It's not quite this - what I am doing
> > is creating a zip file with relative path names, and if there are
> > duplicate files the parts of the path that are not be carried over
> > need to get prepended to the file names to make then unique,
>
> Depending on the file system of the client, you can hit file name
> length limits. I would think it would be better to just create
> the full structure in the zip.
>
> Just something to keep in mind, especially if you see funky behavior.

Thanks, but it's not what the client wants.
--
http://mail.python.org/mailman/listinfo/python-list


no.email at nospam

Jul 19, 2012, 12:43 PM

Post #8 of 18 (710 views)
Permalink
Re: Finding duplicate file names and modifying them based on elements of the path [In reply to]

"Larry.Martell [at] gmail" <larry.martell [at] gmail> writes:
> Thanks for the reply Paul. I had not heard of itertools. It sounds
> like just what I need for this. But I am having 1 issue - how do you
> know how many items are in each group?

Simplest is:

for key, group in groupby(xs, lambda x:(x[-1],x[4],x[5])):
gs = list(group) # convert iterator to a list
n = len(gs) # this is the number of elements

there is some theoretical inelegance in that it requires each group to
fit in memory, but you weren't really going to have billions of files
with the same basename.

If you're not used to iterators and itertools, note there are some
subtleties to using groupby to iterate over files, because an iterator
actually has state. It bumps a pointer and maybe consumes some input
every time you advance it. In a situation like the above, you've got
some nexted iterators (the groupby iterator generating groups, and the
individual group iterators that come out of the groupby) that wrap the
same file handle, so bad confusion can result if you advance both
iterators without being careful (one can consume file input that you
thought would go to another).

This isn't as bad as it sounds once you get used to it, but it can be
a source of frustration at first.

BTW, if you just want to count the elements of an iterator (while
consuming it),

n = sum(1 for x in xs)

counts the elements of xs without having to expand it into an in-memory
list.

Itertools really makes Python feel a lot more expressive and clean,
despite little kinks like the above.
--
http://mail.python.org/mailman/listinfo/python-list


no.email at nospam

Jul 19, 2012, 12:56 PM

Post #9 of 18 (718 views)
Permalink
Re: Finding duplicate file names and modifying them based on elements of the path [In reply to]

"Larry.Martell [at] gmail" <larry.martell [at] gmail> writes:
> You can't do a len on the iterator that is returned from groupby, and
> I've tried to do something with imap or defaultdict, but I'm not
> getting anywhere. I guess I can just make 2 passes through the data,
> the first time getting counts. Or am I missing something about how
> groupby works?

I posted another reply to your other message, which reached me earlier.
If you're still stuck, post again, though I probably won't be able to
reply til tomorrow or the next day.
--
http://mail.python.org/mailman/listinfo/python-list


python at mrabarnett

Jul 19, 2012, 2:32 PM

Post #10 of 18 (713 views)
Permalink
Re: Finding duplicate file names and modifying them based on elements of the path [In reply to]

On 19/07/2012 20:06, Larry.Martell [at] gmail wrote:
> On Jul 19, 1:02 pm, "Prasad, Ramit" <ramit.pra...@jpmorgan.com> wrote:
>> > > I am making the assumption that you intend to collapse the directory
>> > > tree and store each file in the same directory, otherwise I can't think
>> > > of why you need to do this.
>>
>> > Hi Simon, thanks for the reply. It's not quite this - what I am doing
>> > is creating a zip file with relative path names, and if there are
>> > duplicate files the parts of the path that are not be carried over
>> > need to get prepended to the file names to make then unique,
>>
>> Depending on the file system of the client, you can hit file name
>> length limits. I would think it would be better to just create
>> the full structure in the zip.
>>
>> Just something to keep in mind, especially if you see funky behavior.
>
> Thanks, but it's not what the client wants.
>
Here's another solution, not using itertools:

from collections import defaultdict
from os.path import basename, dirname
from time import strftime, strptime

# Starting with the original paths

paths = [
"/dir0/dir1/dir2/dir3/qwer/09Jan12/dir6/file3",
"/dir0/dir1/dir2/dir3/abcd/08Jan12/dir6/file1",
"/dir0/dir1/dir2/dir3/abcd/08Jan12/dir6/file2",
"/dir0/dir1/dir2/dir3/xyz/08Jan12/dir6/file1",
"/dir0/dir1/dir2/dir3/qwer/07Jan12/dir6/file3",
]

def make_dir5_key(path):
date = strptime(path.split("/")[6], "%d%b%y")
return strftime("%y%b%d", date)

# Collect the paths into a dict keyed by the basename

files = defaultdict(list)
for path in paths:
files[basename(path)].append(path)

# Process a list of paths if there's more than one entry

renaming = []

for name, entries in files.items():
if len(entries) > 1:
# Collect the paths in each subgroup into a dict keyed by dir4

subgroup = defaultdict(list)
for path in entries:
subgroup[path.split("/")[5]].append(path)

for dir4, subentries in subgroup.items():
# Sort the subentries by dir5 (date)
subentries.sort(key=make_dir5_key)

if len(subentries) > 1:
for index, path in enumerate(subentries):
renaming.append((path,
"{}/{}_{:02}_{}".format(dirname(path), dir4, index, name)))
else:
path = subentries[0]
renaming.append((path, "{}/{}_{}".format(dirname(path),
dir4, name)))
else:
path = entries[0]

for old_path, new_path in renaming:
print("Rename {!r} to {!r}".format(old_path, new_path))

--
http://mail.python.org/mailman/listinfo/python-list


larry.martell at gmail

Jul 19, 2012, 5:58 PM

Post #11 of 18 (713 views)
Permalink
Re: Finding duplicate file names and modifying them based on elements of the path [In reply to]

On Jul 19, 1:56 pm, Paul Rubin <no.em...@nospam.invalid> wrote:
> "Larry.Mart...@gmail.com" <larry.mart...@gmail.com> writes:
> > You can't do a len on the iterator that is returned from groupby, and
> > I've tried to do something with imap or      defaultdict, but I'm not
> > getting anywhere. I guess I can just make 2 passes through the data,
> > the first time getting counts. Or am I missing something about how
> > groupby works?
>
> I posted another reply to your other message, which reached me earlier.
> If you're still stuck, post again, though I probably won't be able to
> reply til tomorrow or the next day.

I really appreciate the offer, but I'm going to go with MRAB's
solution. It works, and I understand it ;-)
--
http://mail.python.org/mailman/listinfo/python-list


larry.martell at gmail

Jul 19, 2012, 6:01 PM

Post #12 of 18 (709 views)
Permalink
Re: Finding duplicate file names and modifying them based on elements of the path [In reply to]

On Jul 19, 3:32 pm, MRAB <pyt...@mrabarnett.plus.com> wrote:
> On 19/07/2012 20:06, Larry.Mart...@gmail.com wrote:
>
>
>
>
>
>
>
> > On Jul 19, 1:02 pm, "Prasad, Ramit" <ramit.pra...@jpmorgan.com> wrote:
> >> > > I am making the assumption that you intend to collapse the directory
> >> > > tree and store each file in the same directory, otherwise I can't think
> >> > > of why you need to do this.
>
> >> > Hi Simon, thanks for the reply. It's not quite this - what I am doing
> >> > is creating a zip file with relative path names, and if there are
> >> > duplicate files the parts of the path that are not be carried over
> >> > need to get prepended to the file names to make then unique,
>
> >> Depending on the file system of the client, you can hit file name
> >> length limits. I would think it would be better to just create
> >> the full structure in the zip.
>
> >> Just something to keep in mind, especially if you see funky behavior.
>
> > Thanks, but it's not what the client wants.
>
> Here's another solution, not using itertools:
>
> from collections import defaultdict
> from os.path import basename, dirname
> from time import strftime, strptime
>
> # Starting with the original paths
>
> paths = [.
>      "/dir0/dir1/dir2/dir3/qwer/09Jan12/dir6/file3",
>      "/dir0/dir1/dir2/dir3/abcd/08Jan12/dir6/file1",
>      "/dir0/dir1/dir2/dir3/abcd/08Jan12/dir6/file2",
>      "/dir0/dir1/dir2/dir3/xyz/08Jan12/dir6/file1",
>      "/dir0/dir1/dir2/dir3/qwer/07Jan12/dir6/file3",
> ]
>
> def make_dir5_key(path):
>      date = strptime(path.split("/")[6], "%d%b%y")
>      return strftime("%y%b%d", date)
>
> # Collect the paths into a dict keyed by the basename
>
> files = defaultdict(list)
> for path in paths:
>      files[basename(path)].append(path)
>
> # Process a list of paths if there's more than one entry
>
> renaming = []
>
> for name, entries in files.items():
>      if len(entries) > 1:
>          # Collect the paths in each subgroup into a dict keyed by dir4
>
>          subgroup = defaultdict(list)
>          for path in entries:
>              subgroup[path.split("/")[5]].append(path)
>
>          for dir4, subentries in subgroup.items():
>              # Sort the subentries by dir5 (date)
>              subentries.sort(key=make_dir5_key)
>
>              if len(subentries) > 1:
>                  for index, path in enumerate(subentries):
>                      renaming.append((path,
> "{}/{}_{:02}_{}".format(dirname(path), dir4, index, name)))
>              else:
>                  path = subentries[0]
>                  renaming.append((path, "{}/{}_{}".format(dirname(path),
> dir4, name)))
>      else:
>          path = entries[0]
>
> for old_path, new_path in renaming:
>      print("Rename {!r} to {!r}".format(old_path, new_path))

Thanks a million MRAB. I really like this solution. It's very
understandable and it works! I had never seen .format before. I had to
add the index of the positional args to them to make it work.

--
http://mail.python.org/mailman/listinfo/python-list


larry.martell at gmail

Jul 19, 2012, 6:01 PM

Post #13 of 18 (713 views)
Permalink
Re: Finding duplicate file names and modifying them based on elements of the path [In reply to]

On Jul 19, 1:43 pm, Paul Rubin <no.em...@nospam.invalid> wrote:
> "Larry.Mart...@gmail.com" <larry.mart...@gmail.com> writes:
> > Thanks for the reply Paul. I had not heard of itertools. It sounds
> > like just what I need for this. But I am having 1 issue - how do you
> > know how many items are in each group?
>
> Simplest is:
>
>   for key, group in groupby(xs, lambda x:(x[-1],x[4],x[5])):
>      gs = list(group)  # convert iterator to a list
>      n = len(gs)       # this is the number of elements
>
> there is some theoretical inelegance in that it requires each group to
> fit in memory, but you weren't really going to have billions of files
> with the same basename.
>
> If you're not used to iterators and itertools, note there are some
> subtleties to using groupby to iterate over files, because an iterator
> actually has state.  It bumps a pointer and maybe consumes some input
> every time you advance it.  In a situation like the above, you've got
> some nexted iterators (the groupby iterator generating groups, and the
> individual group iterators that come out of the groupby) that wrap the
> same file handle, so bad confusion can result if you advance both
> iterators without being careful (one can consume file input that you
> thought would go to another).

It seems that if you do a list(group) you have consumed the list. This
screwed me up for a while, and seems very counter-intuitive.

> This isn't as bad as it sounds once you get used to it, but it can be
> a source of frustration at first.
>
> BTW, if you just want to count the elements of an iterator (while
> consuming it),
>
>      n = sum(1 for x in xs)
>
> counts the elements of xs without having to expand it into an in-memory
> list.
>
> Itertools really makes Python feel a lot more expressive and clean,
> despite little kinks like the above.

--
http://mail.python.org/mailman/listinfo/python-list


larry.martell at gmail

Jul 19, 2012, 8:07 PM

Post #14 of 18 (712 views)
Permalink
Re: Finding duplicate file names and modifying them based on elements of the path [In reply to]

On Jul 19, 7:01 pm, "Larry.Mart...@gmail.com"
<larry.mart...@gmail.com> wrote:
> On Jul 19, 3:32 pm, MRAB <pyt...@mrabarnett.plus.com> wrote:
>
>
>
>
>
>
>
>
>
> > On 19/07/2012 20:06, Larry.Mart...@gmail.com wrote:
>
> > > On Jul 19, 1:02 pm, "Prasad, Ramit" <ramit.pra...@jpmorgan.com> wrote:
> > >> > > I am making the assumption that you intend to collapse the directory
> > >> > > tree and store each file in the same directory, otherwise I can't think
> > >> > > of why you need to do this.
>
> > >> > Hi Simon, thanks for the reply. It's not quite this - what I am doing
> > >> > is creating a zip file with relative path names, and if there are
> > >> > duplicate files the parts of the path that are not be carried over
> > >> > need to get prepended to the file names to make then unique,
>
> > >> Depending on the file system of the client, you can hit file name
> > >> length limits. I would think it would be better to just create
> > >> the full structure in the zip.
>
> > >> Just something to keep in mind, especially if you see funky behavior.
>
> > > Thanks, but it's not what the client wants.
>
> > Here's another solution, not using itertools:
>
> > from collections import defaultdict
> > from os.path import basename, dirname
> > from time import strftime, strptime
>
> > # Starting with the original paths
>
> > paths = [.
> >      "/dir0/dir1/dir2/dir3/qwer/09Jan12/dir6/file3",
> >      "/dir0/dir1/dir2/dir3/abcd/08Jan12/dir6/file1",
> >      "/dir0/dir1/dir2/dir3/abcd/08Jan12/dir6/file2",
> >      "/dir0/dir1/dir2/dir3/xyz/08Jan12/dir6/file1",
> >      "/dir0/dir1/dir2/dir3/qwer/07Jan12/dir6/file3",
> > ]
>
> > def make_dir5_key(path):
> >      date = strptime(path.split("/")[6], "%d%b%y")
> >      return strftime("%y%b%d", date)
>
> > # Collect the paths into a dict keyed by the basename
>
> > files = defaultdict(list)
> > for path in paths:
> >      files[basename(path)].append(path)
>
> > # Process a list of paths if there's more than one entry
>
> > renaming = []
>
> > for name, entries in files.items():
> >      if len(entries) > 1:
> >          # Collect the paths in each subgroup into a dict keyed by dir4
>
> >          subgroup = defaultdict(list)
> >          for path in entries:
> >              subgroup[path.split("/")[5]].append(path)
>
> >          for dir4, subentries in subgroup.items():
> >              # Sort the subentries by dir5 (date)
> >              subentries.sort(key=make_dir5_key)
>
> >              if len(subentries) > 1:
> >                  for index, path in enumerate(subentries):
> >                      renaming.append((path,
> > "{}/{}_{:02}_{}".format(dirname(path), dir4, index, name)))
> >              else:
> >                  path = subentries[0]
> >                  renaming.append((path, "{}/{}_{}".format(dirname(path),
> > dir4, name)))
> >      else:
> >          path = entries[0]
>
> > for old_path, new_path in renaming:
> >      print("Rename {!r} to {!r}".format(old_path, new_path))
>
> Thanks a million MRAB. I really like this solution. It's very
> understandable and it works! I had never seen .format before. I had to
> add the index of the positional args to them to make it work.

Also, in make_dir5_key the format specifier for strftime should be %y%m
%d so they sort properly.
--
http://mail.python.org/mailman/listinfo/python-list


__peter__ at web

Jul 20, 2012, 12:35 AM

Post #15 of 18 (717 views)
Permalink
Re: Finding duplicate file names and modifying them based on elements of the path [In reply to]

Larry.Martell [at] gmail wrote:

> It seems that if you do a list(group) you have consumed the list. This
> screwed me up for a while, and seems very counter-intuitive.

Many itertools functions work that way. It allows you to iterate over the
items even if there is more data than fits into memory.
If you need to keep all items and are sure that your computer can cope with
them at once you can always throw in a

group = list(group)


--
http://mail.python.org/mailman/listinfo/python-list


no.email at nospam

Jul 20, 2012, 12:51 AM

Post #16 of 18 (713 views)
Permalink
Re: Finding duplicate file names and modifying them based on elements of the path [In reply to]

"Larry.Martell [at] gmail" <larry.martell [at] gmail> writes:
> It seems that if you do a list(group) you have consumed the list. This
> screwed me up for a while, and seems very counter-intuitive.

Yes, that is correct, you have to carefully watch where the stuff in the
iterators is getting consumed, including when there are nested
iterators. That's what I was mentioning earlier--it got me confused at
first, but I use that style all the time now and it is pretty natural.
--
http://mail.python.org/mailman/listinfo/python-list


paul.nospam at rudin

Jul 20, 2012, 1:37 AM

Post #17 of 18 (714 views)
Permalink
Re: Finding duplicate file names and modifying them based on elements of the path [In reply to]

"Larry.Martell [at] gmail" <larry.martell [at] gmail> writes:

> It seems that if you do a list(group) you have consumed the list. This
> screwed me up for a while, and seems very counter-intuitive.

You've consumed the *group* which is an iterator, in order to construct
a list from its elements. Sorry if this is excessively nit-picking, but
it generally helps to keep these things very clear in your own mind.

--
http://mail.python.org/mailman/listinfo/python-list


python at mrabarnett

Jul 20, 2012, 8:45 AM

Post #18 of 18 (712 views)
Permalink
Re: Finding duplicate file names and modifying them based on elements of the path [In reply to]

On 20/07/2012 04:07, Larry.Martell [at] gmail wrote:
[snip]
> Also, in make_dir5_key the format specifier for strftime should be %y%m
> %d so they sort properly.
>
Correct. I realised that only some time later, after I'd turned off my
computer for the night. :-(
--
http://mail.python.org/mailman/listinfo/python-list

Python python RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.