C# : Fastest way for specific columns in CSV Files -

- April 15, 2012

i have large csv file (millions of records)
have developed smart search algorithm locate specific line ranges in file avoid parsing whole file.

now facing trickier issue : interested in content of specific column.
there smart way avoid looping line line through 200mb files , retrieve content of specific column?

unless all csv fields have fixed width (and if empty there's still n bytes of blank space between separators surrounding it), no.

if yes

then each row, in turn, has fixed length , therefore can skip straight first value column and, once you've read it, advance next row's value same field, without having read intermediate values.

i think pretty simple - i'm on roll @ moment (and @ lunch), i'm going finish anyway :)

to this, first want know how long each row in characters (adjust bytes according unicode, utf8 etc):

row_len = sum(widths[0..n-1]) + n-1 + row_sep_length

where n total number of columns on each row - constant whole file. add n-1 account separators between column values.

and row_sep_length length of separator between 2 rows - newline, or potentially [carriage-return & line-feed] pair.

the value column row[r]col[i] offset characters start of row[r]where offset defined as:

offset = i>0 ? sum(widths[0..i-1]) + i) : 0; //or sum of widths of columns before col[i] //plus 1 character each separator between adjacent columns

and then, assuming you've read whole column value, next separator, offset starting character next column value row[r+1]col[i] calculated subtracting width of column row length. yet constant file:

next-field-offset = row_len - widths[i]; //widths[i] width of field reading.

all while - i zero-based in pseudo code indexing of vectors/arrays.

to read, then, first advance file pointer offset characters - taking first value want. read value (taking next separator) , advance file pointer next-field-offset characters. if reach eof @ point, you're done.

i might have missed character either way in - if it's applicable - check it!

this works if can guarantee field values - nulls - rows same length, , separators same length , alll row separators same length. if not - approach won't work.

if not

you'll have slow way - find column in each line , whatever need do.

if you're doing significant amount of work on column value each time, 1 optimisation pull out column values first list (set known initial capacity too) or (batching @ 100,000 time or that), iterate through those.

if keep each loop focused on single task, should more efficient 1 big loop.

equally, once you've batched 100,000 column values use parallel linq distribute second loop (not first since there's no point parallelising reading file).

Search This Blog

Return