Re: Working with large files
CarleySimon (jonwhispa)
12-Aug-2008/11:23:34-4:00
#44197
<Back Thread Next>
<Back Index Next>
There is also a /with refinement to specify additional line terminators
open/direct/lines/with %file ","
It seems that works on both the "," and newline.
Using Tim`s suggestion and checking the last char for a newline and doing a
remove, second pick and a rejoin should fix that.
Jon
----- Original Message -----
From: "Brock Kalef" <brock.kalef-innovapost.com>
To: <<rebol list address>>
Sent: Tuesday, August 12, 2008 2:05 PM
Subject: [REBOL] Re: Working with large files
>
> Kai,
> Yes, I'm going to need to use the /seek option. I was trying to avoid
> it but it looks like it is the only way to go.
>
> The records that I am working with although not fixed width are tab
> delimited. I could likely come up with a way to work on the fixed
> record size using skip etc, but think it may be just as easy to manage
> by checking if the last character of the block is a #"^/", and if not
> ignoring that record, then starting the next block with the start of
> this record. I should be able to do that easily enough using 'index?.
> I've been playing with it a little and looks very feasible to implement
> with minimal pain. Whether it will slow it down or not isn't too big a
> concern.
>
> Cheers, and thanks for your reply.
>
> Brock
>
> -----Original Message-----
> From: rebol-bounce-rebol.com [mailto:rebol-bounce-rebol.com] On Behalf
> Of Kai Peters
> Sent: August 11, 2008 7:59 PM
> To: Brock Kalef
> Subject: [REBOL] Re: Working with large files
>
>
> On Mon, 11 Aug 2008 15:11:56 -0400, Brock Kalef wrote:
>> I'm looking to read 800+ MB web log files and process the log prior=20
>> to=3D
> running through an
>> analysis tool. I'm running into "Out of Memory" errors and the odd=20
>> Rebol=3D
> Crash in attempting to
>> do this.
>>
>> I started out simply reading the data directly into a word and=20
>> looping=3D
> through the data. This
>> worked great for the sample data set of 45 MB. this then failed on a=20
>> 430+=3D
> MB file. i.e.. data:
>> read/lines %file-name.log
>>
>> I then changed the direct read to use a port i.e.. data-port:
> open/lines=3D
> %file-name.log. This
>> worked for the 430+ MB file but then I started getting the errors=20
>> again=3D
> for the 800+ MB files.
>>
>> It's now obvious that I will need to read in portions of the file at=20
>> a=3D
> time. However, I am
>> unsure how to do this while also ensuring I get all the data. As you=20
>> can=3D
> see from my earlier
>> example code, I'm interested in reading a line at a time for=20
>> simplicity in=3D
> processing the records
>> as they are not fixed width (vary in length). My fear is that I will=20
>> not=3D
> be able to properly
>> handle the records that are truncated due to the size of the data=20
>> block I=3D
> retrieve from the file.
>> Or atleast not be able to do this easily. Are there any suggestions?
>>
>> My guess is that I will need to;
>> - pull in a fixed length block of data
>> - read to the data until I reach the first occurrence of a newline -=20
>> =3D
> track the index of the
>> location of the newline
>> - continue reading the data until I reach the end of the data-block -
>
>> =3D
> once reaching the end of
>> the data retrieved, calculate where the last record process ended - =20
>> read=3D
> the next data block
>> from that point - continue until reaching the end of file
>>
>> Any other suggestions?
>>
>> Regards,
>> Brock Kalef
>
>
> Sounds like a plan to me. Just ran this on a 1.9 GB file and it was=3D
> surprisingly fast (kept my HD busy for sure):
>
> port: open/seek %/c/apache.log
> chunksize: 1'048'576 ; 1 MB chunks
> forskip port chunksize [
> chunk: copy/part port chunksize
> ]
> close port
>
> Do you really need to process it line by line though? That would really
> slow=3D it down.=20
> Sure you cannot operate on the chunks in their entirety somehow?
>
> Cheers,
> Kai
> --
> To unsubscribe from the list, just send an email to lists at rebol.com
> with unsubscribe as the subject.
>
> --
> To unsubscribe from the list, just send an email to
> lists at rebol.com with unsubscribe as the subject.
>
<Back Thread Next>
<Back Index Next>REBOL.com