<phil_bb>
Hey folks, kinda struggling with CL atm... I have a 1.7GB csv file that I need to parse into something more coherent... and it looks like every single library that I stumble upon has issues, since there's quotes in some of the fields, since it's so long, since some of the fields have parens and quotes in them... I'm genuinely not sure what I can do about this to get the full dataset processed. (The dataset is the allCountries file
<phil_bb>
GeoNames.org)
<phil_bb>
Most of the libraries I have tried end up maxing out my RAM and locking SBCL up. This includes cl-csv, the reader from lisp-stat, read-csv.
<phil_bb>
A lot of them don't handle quotes that include separators in them well.
<aeth>
Maybe try writing FFI bindings for a C CSV library?
<aeth>
Or use UIOP:LAUNCH-PROGRAM and use a shell CSV thing
<phil_bb>
I have already tried to clean it up using csvkit
<phil_bb>
Alas, that only increased the size to 2.3GB
<phil_bb>
My plan is to process it once, and then store it using cl-store in a file for future reference.
<phil_bb>
So I don't ever have to deal with this issue again lol
thejammahimself has quit [Remote host closed the connection]
<aeth>
have you tried emacs?
<aeth>
replace-regexp, string-rectangle, replace-string, etc., are all very useful
<phil_bb>
josrr: thanks, I'll try that one. Currently experimenting with cl-csv a little. Something about the way parts of it are quoted doesn't play nice with it.
<phil_bb>
gilberth: yes, that, though I am converting it to a csv for the most part, escaping the quotes. Seems to be doing something.
<gilberth>
A mere (with-open-file (i "AllCountries.txt") (loop for x = (read-line i nil) while x collect (split-sequence #\tab x))) would have done that too.
<phil_bb>
Hm. I'm not sure why but trying to use fare-csv:read-csv-file maxes out my RAM, and the SBCL process freezes up.
Demosthenex has quit [Ping timeout: 248 seconds]
<gilberth>
Does it freeze or does it crash when the heap is exhausted. Did you pass --dynamic-space-size to SBCL?
<phil_bb>
I have 16GB allocated in the dynamic space size. It gets up to 15.5GB, and the SBCL process becomes unresponsive
<phil_bb>
All I'm running really is (fare-csv:read-csv-file *file*)
<phil_bb>
sly-interrupt does nothing
<josrr>
I'm using --dynamic-space-size 16384, but perhaps read one line at a time
<gilberth>
Yes, perhaps it freezes while trying to crash. But something is not right. The file is 1.7GB four times that is 6GB for the strings. It has 13M lines of what a dozen or so rows? Like 2.4G for cons cells. Well below 16GB.
<phil_bb>
I'm quite aware. This happens with most libraries I try to use.
Demosthenex has joined #commonlisp
* gilberth
tries at home.
<gilberth>
phil_bb: I used that LOOP from above. Heap usage according to top is 18GB.
<gilberth>
Heap limit is set to 64GB.
<phil_bb>
Good lord. How even?
<phil_bb>
That's way excessive.
<gilberth>
Heap limit was 64GB and SBCL would make huge heaps given that it thinks it may do so. Annoying. But only speed matters. I'm curious and try with a smaller heap.
<gilberth>
Question: How to trigger a global GC with SBCL?
<gilberth>
josrr: Only speed matters. Increase your heap limit.
kaskal has joined #commonlisp
<jeffrey>
phil_bb, why not process line by line and dump the result to an output stream immediately?
akoana has joined #commonlisp
<phil_bb>
Mostly for ease of thinking about it. I'm trying to make a hash-table that will hold all these data points in a nested structure, first. Then my plan is to hold the entire thing in memory so I can inspect the relationship of each point with other points.
<phil_bb>
Basically I want to make sure that I can validate that the hash-table matches the raw data
<phil_bb>
It's a silly side-project that helps me debrain after work, basically
<phil_bb>
nothing serious
jonatack has quit [Ping timeout: 248 seconds]
<gilberth>
As this file is #\Tab separated no CSV library is even needed. It's another puzzle piece we don't know what it does without looking. So I would just go with that LOOP. You could also populate your hash table just right in that loop.
<phil_bb>
Suppose so. Currently it's not doing anything. I guess the TL;DR of what I'm trying to do is create a human simulator, where all the basic body processes are driven by Lisa (the expert system), and on top of that I want behaviors, but for behaviors I need an environment.
<phil_bb>
Since I never planned to finish this ever, I thought what better way than import all of Earth?
<phil_bb>
Or, at least a huge chunk of the Earth.
<phil_bb>
I'm basically creating complexity for the sake of personal satisfaction here, coding as therapy.
<gilberth>
phil_bb: WITH-OPEN-FILE just opens a stream. It doesn't read anything or picks lines or so.
<phil_bb>
ahhhhhhhh
<jeffrey>
isnt that for row doing something you do not want?
<gilberth>
If cl-csv:read-csv reads the whole file and returns a list of rows. Try saying for row in (cl-csv:read-csv ...) instead and remove ":while row".
<phil_bb>
jeffrey: Yes but we'll see, I just want to get over this hump at the moment.
<jeffrey>
nvm I never knew for could be used for variable assignment as well
<jeffrey>
as in =
<gilberth>
This loop only makes sense if READ-CSV reads a single row. I suspect it would return a list of rows rather. So should be FOR ROW IN not FOR ROW =. And that :WHILE ROW should go away. Or just try that SPLIT-SEQUENCE version of mine.
<gilberth>
And make a smaller file for testing!
<gilberth>
Or stick a REPEAT 100 clause into the LOOP for testing.
<phil_bb>
I see
<phil_bb>
thanks
<josrr>
phil_bb: I tried increasing --dynamic-space-size to 20000, and fare-csv:read-csv-file still exhausted the heap; gilberth's loop worked with 16384. For splitting the string, I used the function split-sequence from the split-sequence system.
<phil_bb>
Nifty
inline has quit [Ping timeout: 252 seconds]
<phil_bb>
I was also considering using the data-frames system