python - Is there a way to efficiently yield every file in a directory containing millions of files? -
i'm aware of os.listdir
, far can gather, gets filenames in directory memory, , returns list. want, way yield filename, work on it, , yield next one, without reading them memory.
is there way this? worry case filenames change, new files added, , files deleted using such method. iterators prevent modifying collection during iteration, taking snapshot of state of collection @ beginning, , comparing state on each move
operation. if there iterator capable of yielding filenames path, raise error if there filesystem changes (add, remove, rename files within iterated directory) modify collection?
there potentially few cases cause iterator fail, , depends on how iterator maintains state. using s.lotts example:
filea.txt fileb.txt filec.txt
iterator yields filea.txt
. during processing
, filea.txt
renamed filey.txt
, fileb.txt
renamed filez.txt
. when iterator attempts next file, if use filename filea.txt
find it's current position in order find next file , filea.txt
not there, happen? may not able recover it's position in collection. similarly, if iterator fetch fileb.txt
when yielding filea.txt
, position of fileb.txt
, fail, , produce error.
if iterator instead able somehow maintain index dir.get_file(0)
, maintaining positional state not affected, files missed, indexes moved index 'behind' iterator.
this theoretical of course, since there appears no built-in (python) way of iterating on files in directory. there great answers below, however, solve problem using queues , notifications.
edit:
the os of concern redhat. use case this:
process continuously writing files storage location. process b (the 1 i'm writing), iterating on these files, doing processing based on filename, , moving files location.
edit:
definition of valid:
adjective 1. grounded or justifiable, pertinent.
(sorry s.lott, couldn't resist).
i've edited paragraph in question above.
tl;dr <update>: of python 3.5 (currently in beta) use os.scandir
</update>
as i've written earlier, since "iglob" facade real iterator, have call low level system functions in order 1 @ time want. fortyuantelly, doable python. if have not told wether on posix (linux/mac os x/other unix) or windows system. on later case, should check if win32api has call read "the next entry dir" or how proceed otherwise.
on former case, can proceed call libc functions straight through ctypes , file-dir entry icnluding naming information) time.
the documentation on teh c functions here: http://www.gnu.org/s/libc/manual/html_node/opening-a-directory.html#opening-a-directory
unfortunatelly, "dirent64" c structure determined @ c compile time each system - had figured on system, , on most, put in python on snippet bellow - might want checj "dirent.h" , other fiels includes under /usr/includes.
here snippet using ctypes , libc i've put allow each filename, , perform actions on it. note ctypes automaticaly gives python string when str(...) on char array defined on structure. (i using print statement, implicitly calls python's str)
from ctypes import * libc = cdll.loadlibrary( "libc.so.6") dir_ = c_voidp( libc.opendir("/home/jsbueno")) class dirent(structure): _fields_ = [("d_ino", c_voidp), ("off_t", c_int64), ("d_reclen", c_ushort), ("d_type", c_ubyte), ("d_name", c_char * 2048) ] while true: p = libc.readdir64(dir_) if not p: break entry = dirent.from_address( p) print entry.d_name
update: python 3.5 in beta - , in version new os.scandir
function call avaliable materialization of pep 471 ("a better , faster directory iterator") asked here, besides lot other optimizations can deliver 9 fold speed increase on os.listdir
on large-directories listing under windows (2-3 fold increase in posix systems).
Comments
Post a Comment