C# : Fastest way for specific columns in CSV Files -
i have large csv file (millions of records)
have developed smart search algorithm locate specific line ranges in file avoid parsing whole file.
now facing trickier issue : interested in content of specific column.
there smart way avoid looping line line through 200mb files , retrieve content of specific column?
unless all csv fields have fixed width (and if empty there's still n bytes of blank space between separators surrounding it), no.
if yes
then each row, in turn, has fixed length , therefore can skip straight first value column and, once you've read it, advance next row's value same field, without having read intermediate values.
i think pretty simple - i'm on roll @ moment (and @ lunch), i'm going finish anyway :)
to this, first want know how long each row in characters (adjust bytes according unicode, utf8 etc):
row_len = sum(widths[0..n-1]) + n-1 + row_sep_length
where n
total number of columns on each row - constant whole file. add n-1
account separators between column values.
and row_sep_length
length of separator between 2 rows - newline, or potentially [carriage-return & line-feed] pair.
the value column row[r]col[i]
offset
characters start of row[r]where offset
defined as:
offset = i>0 ? sum(widths[0..i-1]) + i) : 0; //or sum of widths of columns before col[i] //plus 1 character each separator between adjacent columns
and then, assuming you've read whole column value, next separator, offset starting character next column value row[r+1]col[i]
calculated subtracting width of column row length. yet constant file:
next-field-offset = row_len - widths[i]; //widths[i] width of field reading.
all while - i
zero-based in pseudo code indexing of vectors/arrays.
to read, then, first advance file pointer offset
characters - taking first value want. read value (taking next separator) , advance file pointer next-field-offset
characters. if reach eof
@ point, you're done.
i might have missed character either way in - if it's applicable - check it!
this works if can guarantee field values - nulls - rows same length, , separators same length , alll row separators same length. if not - approach won't work.
if not
you'll have slow way - find column in each line , whatever need do.
if you're doing significant amount of work on column value each time, 1 optimisation pull out column values first list (set known initial capacity too) or (batching @ 100,000 time or that), iterate through those.
if keep each loop focused on single task, should more efficient 1 big loop.
equally, once you've batched 100,000 column values use parallel linq distribute second loop (not first since there's no point parallelising reading file).
Comments
Post a Comment