Re: Working with large files
Kai Peters (kpeters)
11-Aug-2008/19:59:32-4:00
#44193
<Back Thread Next>
<Back Index Next>
On Mon, 11 Aug 2008 15:11:56 -0400, Brock Kalef wrote:
> I'm looking to read 800+ MB web log files and process the log prior to=
running through an
> analysis tool. I'm running into "Out of Memory" errors and the odd Rebol=
Crash in attempting to
> do this.
>
> I started out simply reading the data directly into a word and looping=
through the data. This
> worked great for the sample data set of 45 MB. this then failed on a 430+=
MB file. i.e.. data:
> read/lines %file-name.log
>
> I then changed the direct read to use a port i.e.. data-port: open/lines=
%file-name.log. This
> worked for the 430+ MB file but then I started getting the errors again=
for the 800+ MB files.
>
> It's now obvious that I will need to read in portions of the file at a=
time. However, I am
> unsure how to do this while also ensuring I get all the data. As you can=
see from my earlier
> example code, I'm interested in reading a line at a time for simplicity in=
processing the records
> as they are not fixed width (vary in length). My fear is that I will not=
be able to properly
> handle the records that are truncated due to the size of the data block I=
retrieve from the file.
> Or atleast not be able to do this easily. Are there any suggestions?
>
> My guess is that I will need to;
> - pull in a fixed length block of data
> - read to the data until I reach the first occurrence of a newline - =
track the index of the
> location of the newline
> - continue reading the data until I reach the end of the data-block - =
once reaching the end of
> the data retrieved, calculate where the last record process ended - read=
the next data block
> from that point - continue until reaching the end of file
>
> Any other suggestions?
>
> Regards,
> Brock Kalef
Sounds like a plan to me. Just ran this on a 1.9 GB file and it was=
surprisingly fast (kept my HD
busy for sure):
port: open/seek %/c/apache.log
chunksize: 1'048'576 ; 1 MB chunks
forskip port chunksize [
chunk: copy/part port chunksize
]
close port
Do you really need to process it line by line though? That would really slow=
it down.
Sure you cannot operate on the chunks in their entirety somehow?
Cheers,
Kai
<Back Thread Next>
<Back Index Next>REBOL.com