
report at bugs
Sep 10, 2007, 11:18 PM
Post #4 of 7
(183 views)
Permalink
|
|
[issue1142] code sample showing errors reading large files with py 2.5/3.0
[In reply to]
|
|
christen added the comment: Hi Guido It is not the end of the file that is not read (see also below) I found about that about one year ago when I was parsing very large files resulting from "blast" on the human genome My parser chock after 4 Go, well before the end of the file : one line was missing and my acc=li[x:y] end up with an error, because acc was never filled... This was kind of strange because this had not happened before with my Linux box. I opened the file (which I had created myself) with a editor that could show hexa code : the proper line was there and allright. If I remember well, I modified my code to see better what was going on : in fact the missing line had been concateneted to the previous line despite the proper existence of the end of line (hexa code was ok). see also below I forgot about that because nobody replied to my mails, and I thought it was possibly related with windows 32 . I moved to a windows 64 recently (windows has the best driver for SQL databases) and forgot about the bug until I again ran into it. I then decided to try python 3k, it reads >4Go file with no trouble but is so so slow, both in reading and writing files. The following code produces either <4Go or >4Go files depending upon which fichout.write is commented They both have the same line numbers, but the >4Go does not read completely under windows (32 or 64) I have no such pb on Linux or BSD (Mac). python 3k on windows read both files ok, but is very very slow (change xrange to range , I guess it is preposterous to advice you about that :-). best Richard import sys print(sys.version_info) import time print (time.strftime('%Y-%m-%d %H:%M:%S')) liste=[] start = time.time() fichout=open('test.txt','w') for i in xrange(85014961): if i%5000000==0 and i>0: print (i,time.time()-start) fichout.write(str(i)+' '*59+'\n') #big file #fichout.write(str(i)+'\n') #small file, same number of lines fishout.flush() fichout.close() print ('total lines written ',i) print (i,time.time()-start) print ('*'*50) fichin=open('test.txt') start3 = time.time() for i,li in enumerate(fichin): if i%5000000==0 and i>0: print (i,time.time()-start3) fichin.close() print ('total lines read ',i) print(time.time()-start) > Richard, can you somehow view the end of the file to see what its last > lines actually are? It should end like this: > > 85014951 > 85014952 > 85014953 > 85014954 > 85014955 > 85014956 > 85014957 > 85014958 > 85014959 > 85014960 > > using a text editor reads: 85014944 85014945 85014946 85014947 85014948 85014949 85014950 85014951 85014952 85014953 85014954 85014955 85014956 85014957 85014958 85014959 85014960 windows py 2.5, with if i>85014940: print i, li.strip() prints : (2, 5, 0, 'final', 0) 2007-09-11 07:58:47 (5000000, 2.6720001697540283) (10000000, 5.375) (15000000, 8.0320000648498535) (20000000, 10.703000068664551) (25000000, 13.375) (30000000, 16.047000169754028) (35000000, 18.703000068664551) (40000000, 21.360000133514404) (45000000, 24.032000064849854) (50000000, 26.687999963760376) (55000000, 29.360000133514404) (60000000, 32.032000064849854) (65000000, 34.703000068664551) (70000000, 37.407000064849854) (75000000, 40.094000101089478) (80000000, 42.797000169754028) (85000000, 45.485000133514404) 85014941 85014951 85014942 85014952 85014943 85014953 85014944 85014954 85014945 85014955 85014946 85014956 85014947 85014957 85014948 85014958 85014949 85014959 85014950 85014960 ==> missing lines are from within the file now introduce in the loop: if len(li)>80: print li.strip() (2, 5, 0, 'final', 0) 2007-09-11 08:08:16 (5000000, 3.1559998989105225) (10000000, 6.3280000686645508) (15000000, 9.4839999675750732) (20000000, 12.655999898910522) (25000000, 15.843999862670898) (30000000, 19.016000032424927) (35000000, 22.187999963760376) (40000000, 25.358999967575073) (45000000, 28.530999898910522) (50000000, 31.703000068664551) (55000000, 34.858999967575073) (60000000, 38.030999898910522) * 62410138 62410139 * * 62414887 62414888 * * 62415540 62415541 * * 62420289 62420290 * * 62420942 62420943 * * 62421595 62421596 * * 62422248 62422249 * * 62422901 62422902 * * 62427650 62427651 * * 62428303 62428304 * (65000000, 41.233999967575073) (70000000, 44.437999963760376) (75000000, 47.625) (80000000, 50.828000068664551) (85000000, 54.016000032424927) ('total lines read ', 85014950) 54.0309998989 ==> end of line not read for 10 lines in the middle of the file ! NTFS file system best Richard __________________________________ Tracker <report[at]bugs.python.org> <http://bugs.python.org/issue1142> __________________________________
|