Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Python: Python

comparison of files using set function

 

 

Python python RSS feed   Index | Next | Previous | View Threaded


beema.shafreen at gmail

May 17, 2008, 7:47 AM

Post #1 of 2 (513 views)
Permalink
comparison of files using set function

I have files with two column, column 1 is with id and column 2 is with
data(sequence)
My goal is to create a table in such a way, the column one of the table
should have all the id from the files and next column will be have the
respective seq of the file1 with correlation to the id and the third column
will be sequence information of the next file with respective to the id
original files look like this

45 ytut
46 erete
37 dfasf
45 dassdsd


and so on for all the 10 files that is it has two column as mentioned
above.

The output should look like this:

Id file1 file2 file3 file4 file5
43 ytuh ytuh ytuh ytuh ytuh
46 erteee rty ryyy ertyu
47 yutio rrr eeerr



The goal is if the pick all the common id in the files and with their
respective information in the adjacent rows.
the various conditons ca also prevails
1) common id present in all the files, which have same information
2)common id present in all the files, which donot have same information
3) common id may not be present in all the files

But the goal is exactly find the common id in all the files and add their
corresponding information in the file to the table as per the view
my script :
def file1_search(*files1):
for file1 in files1:
gi1_lis = []
fh = open(file1,'r')
for line in fh.readlines():
data1 = line.strip().split('\t')
gi1 = data1[0].strip()
seq1 = data1[1].strip()
gi1_lis.append(gi1)
return gi1_lis
def file2_search(**files2):
for file2 in files2:
for file in files2[file2]:
gi2_lis = []
fh1 = open(file,'r')
for line1 in fh1.readlines():
data2 = line1.strip().split('\t')
gi2 = data2[0].strip()
seq2 = data2[1].strip()
gi2_lis.append(gi2)

return gi2_lis
def set_compare(data1,data2,*files1,**files2):
A = set(data1)
B = set(data2)
I = A&B # common between thesetwo sets

D = A-B #57 is the len of D
C = B-A #176 is the len of c
# print len(C)
# print len(D)
for file1 in files1:
for gi in D:
fh = open(file1,'r')
for line in fh.readlines():
data1 = line.strip().split('\t')
gi1 = data1[0].strip()
seq1 = data1[1].strip()
if gi == gi1:
# print line.strip()
pass

for file2 in files2:
for file in files2[file2]:
for gi in C:
fh1 = open(file,'r')
for line1 in fh1.readlines():
data2 = line1.strip().split('\t')
gi2 = data2[0].strip()
seq2 = data2[1].strip()
if gi == gi2:
# print line1.strip()
pass
if __name__ == "__main__":
files1 = ["Fr20.txt",\
"Fr22.txt",\
"Fr24.txt",\
"Fr60.txt",\
"Fr62.txt"]
files2 = {"data":["Fr64.txt",\
"Fr66.txt",\
"Fr68.txt",\
"Fr70.txt",\
"Fr72.txt"]}
data1 = file1_search(*files1)

"""113 is the total number of gi"""
data2 = file2_search(**files2)
#for j in data2:
# print j
"""232 is the total number of gi found"""
result = set_compare(data1,data2,*files1,**files2)

It doesnot work fine... some body please suggest me the way i can proceed .
Thanks a lot

--
Beema Shafreen


kamhung.soh at gmail

May 17, 2008, 6:08 PM

Post #2 of 2 (453 views)
Permalink
Re: comparison of files using set function [In reply to]

On Sun, 18 May 2008 00:47:55 +1000, Beema shafreen
<beema.shafreen [at] gmail> wrote:

> I have files with two column, column 1 is with id and column 2 is with
> data(sequence)
> My goal is to create a table in such a way, the column one of the table
> should have all the id from the files and next column will be have the
> respective seq of the file1 with correlation to the id and the third
> column
> will be sequence information of the next file with respective to the id
> original files look like this
>
> 45 ytut
> 46 erete
> 37 dfasf
> 45 dassdsd
>
>
> and so on for all the 10 files that is it has two column as mentioned
> above.
>
> The output should look like this:
>
> Id file1 file2 file3 file4 file5
> 43 ytuh ytuh ytuh ytuh ytuh
> 46 erteee rty ryyy ertyu
> 47 yutio rrr eeerr
>
>
>
> The goal is if the pick all the common id in the files and with their
> respective information in the adjacent rows.
> the various conditons ca also prevails
> 1) common id present in all the files, which have same information
> 2)common id present in all the files, which donot have same information
> 3) common id may not be present in all the files
>
> But the goal is exactly find the common id in all the files and add their
> corresponding information in the file to the table as per the view
> my script :
> def file1_search(*files1):
> for file1 in files1:
> gi1_lis = []
> fh = open(file1,'r')
> for line in fh.readlines():
> data1 = line.strip().split('\t')
> gi1 = data1[0].strip()
> seq1 = data1[1].strip()
> gi1_lis.append(gi1)
> return gi1_lis
> def file2_search(**files2):
> for file2 in files2:
> for file in files2[file2]:
> gi2_lis = []
> fh1 = open(file,'r')
> for line1 in fh1.readlines():
> data2 = line1.strip().split('\t')
> gi2 = data2[0].strip()
> seq2 = data2[1].strip()
> gi2_lis.append(gi2)
>
> return gi2_lis
> def set_compare(data1,data2,*files1,**files2):
> A = set(data1)
> B = set(data2)
> I = A&B # common between thesetwo sets
>
> D = A-B #57 is the len of D
> C = B-A #176 is the len of c
> # print len(C)
> # print len(D)
> for file1 in files1:
> for gi in D:
> fh = open(file1,'r')
> for line in fh.readlines():
> data1 = line.strip().split('\t')
> gi1 = data1[0].strip()
> seq1 = data1[1].strip()
> if gi == gi1:
> # print line.strip()
> pass
>
> for file2 in files2:
> for file in files2[file2]:
> for gi in C:
> fh1 = open(file,'r')
> for line1 in fh1.readlines():
> data2 = line1.strip().split('\t')
> gi2 = data2[0].strip()
> seq2 = data2[1].strip()
> if gi == gi2:
> # print line1.strip()
> pass
> if __name__ == "__main__":
> files1 = ["Fr20.txt",\
> "Fr22.txt",\
> "Fr24.txt",\
> "Fr60.txt",\
> "Fr62.txt"]
> files2 = {"data":["Fr64.txt",\
> "Fr66.txt",\
> "Fr68.txt",\
> "Fr70.txt",\
> "Fr72.txt"]}
> data1 = file1_search(*files1)
>
> """113 is the total number of gi"""
> data2 = file2_search(**files2)
> #for j in data2:
> # print j
> """232 is the total number of gi found"""
> result = set_compare(data1,data2,*files1,**files2)
>
> It doesnot work fine... some body please suggest me the way i can
> proceed .
> Thanks a lot
>

1. Test with a small number of short files with a clear idea of the
expected result.
2. Use better variable names. Names such as file1_search, file2_search,
gi, gi2, A, B, C and D make it nearly impossible to understand your code.

--
Kam-Hung Soh <a href="http://kamhungsoh.com/blog">Software Salariman</a>

--
http://mail.python.org/mailman/listinfo/python-list

Python python RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.