REBOL.net

Re: Working with large files

Brock Kalef (brock.kalef)
12-Aug-2008/13:52:08-4:00
#44198
<Back   Thread   Next>
<Back   Index   Next>

Thanks to everyone for their feedback/suggestions.

I seem to have a solution that will back track to the starting point of
any non-complete record.  This should work on any data that is newline
terminated and you can set the amount of data to grab in each call to
grab a new batch of data using the 'size word.  The number represents
the number of bytes to copy from the data file.

rebol[]
port: open/seek %"Sample data/simplified.log"
size: 130

cnt: 1
while [not tail? port] [
	data: copy/part port size
	working-data: copy data
	either (last working-data) =3D #"^/" [
		use-last-record?: true
		start-at: (index? data) + :size
	][
		use-last-record?: false
		either not error? try [start-at: (index? find/reverse
tail data "^/")][
		; new starting point of next read since block didn't end
in full record;
			start-at: (index? find/reverse tail data "^/")
		][
			start-at: (index? data) + :size
		]
	]
	working-data: parse/all working-data "^/"
	record-cnt: length? working-data
	print ["Record Count " :cnt ": " record-cnt]
	print ["First Record:^/" first working-data]
	print ["Use last record?: " use-last-record?]
	print ["Last Record:^/" last working-data newline newline]
=09
	port: skip port (size + start-at - size)
	cnt: cnt + 1

]
close port
halt=20


If anyone wants to try this for themselves here's a sample data file
that can be cut and then saved to disk and then change the file path in
the script above.  I used this data to be able to quickly identify what
record you are in.  If you save the file, make sure there is an emply
line at the end of the data file.

1 record1 record1recordonerecord1 end
2 recordtwo record2 record 2 record 2 end
3 rec3 recordthree record3 record 3 record3 end
4 record 4 record4 recordfour record14 end
5 recordfive record5 record 5 record 5 end
6 rec6 recordsix record6 record 6 record6 end
7 record 7 record7 recordseven record7 end
8 record 8 record8 recordeight record8 end
9 recordnine record9 record 9 record 9 end
10 rec10 recordten record10 record 10 record10 end
11 record 11 record11 recordeleven record11 end
12 recordtwelve record12 record 12 record 12 end
13 rec13 recordthirteen record13 record 13 record13 end
14 record 14 record14 recordfourteen record14 end
15 recordfifteen record15 record 15 record 15 end
16 rec16 recordsixteen record16 record 16 record16 end

I just finished running the above script on a 900+ MB file and it
processed through to the end no problem.

Brock


-----Original Message-----
From: rebol-bounce-rebol.com [mailto:rebol-bounce-rebol.com] On Behalf
Of CarleySimon
Sent: August 12, 2008 11:23 AM
To: <rebol list address>
Subject: [REBOL] Re: Working with large files


There is also a /with refinement to specify additional line terminators

open/direct/lines/with %file ","

It seems that works on both the "," and newline.
Using Tim`s suggestion and checking the last char for a newline and
doing a remove, second pick and a rejoin should fix that.
Jon

----- Original Message -----
From: "Brock Kalef" <brock.kalef-innovapost.com>
To: <<rebol list address>>
Sent: Tuesday, August 12, 2008 2:05 PM
Subject: [REBOL] Re: Working with large files


>
> Kai,
> Yes, I'm going to need to use the /seek option.  I was trying to avoid
> it but it looks like it is the only way to go.
>
> The records that I am working with although not fixed width are tab
> delimited.  I could likely come up with a way to work on the fixed
> record size using skip etc, but think it may be just as easy to manage
> by checking if the last character of the block is a #"^/", and if not
> ignoring that record, then starting the next block with the start of
> this record.  I should be able to do that easily enough using 'index?.
> I've been playing with it a little and looks very feasible to
implement
> with minimal pain.  Whether it will slow it down or not isn't too big
a
> concern.
>
> Cheers, and thanks for your reply.
>
> Brock
>
> -----Original Message-----
> From: rebol-bounce-rebol.com [mailto:rebol-bounce-rebol.com] On Behalf
> Of Kai Peters
> Sent: August 11, 2008 7:59 PM
> To: Brock Kalef
> Subject: [REBOL] Re: Working with large files
>
>
> On Mon, 11 Aug 2008 15:11:56 -0400, Brock Kalef wrote:
>> I'm looking to read 800+ MB web log files and process the log
prior=3D20
>> to=3D3D
> running through an
>> analysis tool.  I'm running into "Out of Memory" errors and the
odd=3D20
>> Rebol=3D3D
> Crash in attempting to
>> do this.
>>
>> I started out simply reading the data directly into a word and=3D20
>> looping=3D3D
> through the data.  This
>> worked great for the sample data set of 45 MB. this then failed on
a=3D20
>> 430+=3D3D
> MB file.  i.e..  data:
>> read/lines %file-name.log
>>
>> I then changed the direct read to use a port i.e..   data-port:
> open/lines=3D3D
> %file-name.log.   This
>> worked for the 430+ MB file but then I started getting the =
errors=3D20
>> again=3D3D
> for the 800+ MB files.
>>
>> It's now obvious that I will need to read in portions of the file
at=3D20
>> a=3D3D
> time.  However, I am
>> unsure how to do this while also ensuring I get all the data.  As
you=3D20
>> can=3D3D
> see from my earlier
>> example code, I'm interested in reading a line at a time for=3D20
>> simplicity in=3D3D
> processing the records
>> as they are not fixed width (vary in length).  My fear is that I
will=3D20
>> not=3D3D
> be able to properly
>> handle the records that are truncated due to the size of the =
data=3D20
>> block I=3D3D
> retrieve from the file.
>> Or atleast not be able to do this easily.  Are there any suggestions?
>>
>> My guess is that I will need to;
>> -  pull in a fixed length block of data
>> -  read to the data until I reach the first occurrence of a newline
-=3D20
>> =3D3D
> track the index of the
>> location of the newline
>> -  continue reading the data until I reach the end of the data-block
-
>
>> =3D3D
> once reaching the end of
>> the data retrieved, calculate where the last record process ended -
=3D20
>> read=3D3D
> the next data block
>> from that point -  continue until reaching the end of file
>>
>> Any other suggestions?
>>
>> Regards,
>> Brock Kalef
>
>
> Sounds like a plan to me. Just ran this on a 1.9 GB file and it =
was=3D3D
> surprisingly fast (kept my HD busy for sure):
>
> port: open/seek %/c/apache.log
> chunksize: 1'048'576  ; 1 MB chunks
> forskip port chunksize [
>  chunk: copy/part port chunksize
> ]
> close port
>
> Do you really need to process it line by line though? That would
really
> slow=3D3D  it down.=3D20
> Sure you cannot operate on the chunks in their entirety somehow?
>
> Cheers,
> Kai
> --
> To unsubscribe from the list, just send an email to lists at rebol.com
> with unsubscribe as the subject.
>
> --=20
> To unsubscribe from the list, just send an email to
> lists at rebol.com with unsubscribe as the subject.
>=20

--=20
To unsubscribe from the list, just send an email to=20
lists at rebol.com with unsubscribe as the subject.



<Back   Thread   Next>
<Back   Index   Next>

REBOL.com