C# : Fastest way for specific columns in CSV Files -


i have large csv file (millions of records)
have developed smart search algorithm locate specific line ranges in file avoid parsing whole file.

now facing trickier issue : interested in content of specific column.
there smart way avoid looping line line through 200mb files , retrieve content of specific column?

unless all csv fields have fixed width (and if empty there's still n bytes of blank space between separators surrounding it), no.

if yes

then each row, in turn, has fixed length , therefore can skip straight first value column and, once you've read it, advance next row's value same field, without having read intermediate values.

i think pretty simple - i'm on roll @ moment (and @ lunch), i'm going finish anyway :)

to this, first want know how long each row in characters (adjust bytes according unicode, utf8 etc):

row_len = sum(widths[0..n-1]) + n-1 + row_sep_length 

where n total number of columns on each row - constant whole file. add n-1 account separators between column values.

and row_sep_length length of separator between 2 rows - newline, or potentially [carriage-return & line-feed] pair.

the value column row[r]col[i] offset characters start of row[r]where offset defined as:

offset = i>0 ? sum(widths[0..i-1]) + i) : 0; //or sum of widths of columns before col[i] //plus 1 character each separator between adjacent columns 

and then, assuming you've read whole column value, next separator, offset starting character next column value row[r+1]col[i] calculated subtracting width of column row length. yet constant file:

next-field-offset = row_len - widths[i]; //widths[i] width of field reading. 

all while - i zero-based in pseudo code indexing of vectors/arrays.

to read, then, first advance file pointer offset characters - taking first value want. read value (taking next separator) , advance file pointer next-field-offset characters. if reach eof @ point, you're done.

i might have missed character either way in - if it's applicable - check it!

this works if can guarantee field values - nulls - rows same length, , separators same length , alll row separators same length. if not - approach won't work.

if not

you'll have slow way - find column in each line , whatever need do.

if you're doing significant amount of work on column value each time, 1 optimisation pull out column values first list (set known initial capacity too) or (batching @ 100,000 time or that), iterate through those.

if keep each loop focused on single task, should more efficient 1 big loop.

equally, once you've batched 100,000 column values use parallel linq distribute second loop (not first since there's no point parallelising reading file).


Comments

Popular posts from this blog

Javascript line number mapping -

c# - Is it possible to remove an existing registration from Autofac container builder? -

php - Mysql PK and FK char(36) vs int(10) -