
kamhung.soh at gmail
May 17, 2008, 6:08 PM
Post #2 of 2
(453 views)
Permalink
|
|
Re: comparison of files using set function
[In reply to]
|
|
On Sun, 18 May 2008 00:47:55 +1000, Beema shafreen <beema.shafreen [at] gmail> wrote: > I have files with two column, column 1 is with id and column 2 is with > data(sequence) > My goal is to create a table in such a way, the column one of the table > should have all the id from the files and next column will be have the > respective seq of the file1 with correlation to the id and the third > column > will be sequence information of the next file with respective to the id > original files look like this > > 45 ytut > 46 erete > 37 dfasf > 45 dassdsd > > > and so on for all the 10 files that is it has two column as mentioned > above. > > The output should look like this: > > Id file1 file2 file3 file4 file5 > 43 ytuh ytuh ytuh ytuh ytuh > 46 erteee rty ryyy ertyu > 47 yutio rrr eeerr > > > > The goal is if the pick all the common id in the files and with their > respective information in the adjacent rows. > the various conditons ca also prevails > 1) common id present in all the files, which have same information > 2)common id present in all the files, which donot have same information > 3) common id may not be present in all the files > > But the goal is exactly find the common id in all the files and add their > corresponding information in the file to the table as per the view > my script : > def file1_search(*files1): > for file1 in files1: > gi1_lis = [] > fh = open(file1,'r') > for line in fh.readlines(): > data1 = line.strip().split('\t') > gi1 = data1[0].strip() > seq1 = data1[1].strip() > gi1_lis.append(gi1) > return gi1_lis > def file2_search(**files2): > for file2 in files2: > for file in files2[file2]: > gi2_lis = [] > fh1 = open(file,'r') > for line1 in fh1.readlines(): > data2 = line1.strip().split('\t') > gi2 = data2[0].strip() > seq2 = data2[1].strip() > gi2_lis.append(gi2) > > return gi2_lis > def set_compare(data1,data2,*files1,**files2): > A = set(data1) > B = set(data2) > I = A&B # common between thesetwo sets > > D = A-B #57 is the len of D > C = B-A #176 is the len of c > # print len(C) > # print len(D) > for file1 in files1: > for gi in D: > fh = open(file1,'r') > for line in fh.readlines(): > data1 = line.strip().split('\t') > gi1 = data1[0].strip() > seq1 = data1[1].strip() > if gi == gi1: > # print line.strip() > pass > > for file2 in files2: > for file in files2[file2]: > for gi in C: > fh1 = open(file,'r') > for line1 in fh1.readlines(): > data2 = line1.strip().split('\t') > gi2 = data2[0].strip() > seq2 = data2[1].strip() > if gi == gi2: > # print line1.strip() > pass > if __name__ == "__main__": > files1 = ["Fr20.txt",\ > "Fr22.txt",\ > "Fr24.txt",\ > "Fr60.txt",\ > "Fr62.txt"] > files2 = {"data":["Fr64.txt",\ > "Fr66.txt",\ > "Fr68.txt",\ > "Fr70.txt",\ > "Fr72.txt"]} > data1 = file1_search(*files1) > > """113 is the total number of gi""" > data2 = file2_search(**files2) > #for j in data2: > # print j > """232 is the total number of gi found""" > result = set_compare(data1,data2,*files1,**files2) > > It doesnot work fine... some body please suggest me the way i can > proceed . > Thanks a lot > 1. Test with a small number of short files with a clear idea of the expected result. 2. Use better variable names. Names such as file1_search, file2_search, gi, gi2, A, B, C and D make it nearly impossible to understand your code. -- Kam-Hung Soh <a href="http://kamhungsoh.com/blog">Software Salariman</a> -- http://mail.python.org/mailman/listinfo/python-list
|