Users can upload files to my server's file system via my web app. An uploaded file gets an id from the database. Since all uploads go to the same directory, could there be problems with storing many uploads in the same directory (on linux)? Should I create some subdirectory structure and store subsets of the files in subdirectories instead? If I set up y subdirectories then the program can put a file in subdirectory x if x is remainder of the file id after division with y. e.g. if I keep 100 subdirectories:
If so, what's a future-proof value for y? 10? 100? 1000?
It's not a problem to store all the files in one directory. However, to make it easier for yourself, such as when backing up the files, it's a good idea to store each user's files in its own directory (ex. using the user ID as directory name).
Joined: Apr 08, 2006
No problem then. Thanks a lot Wilson. [ May 20, 2007: Message edited by: Niklas Rosencrantz ]
Without specifying what 'many files' means, it's hard to find an answer. The answer will depend on the filesystem as well.
Wikipedia mentions for ext3 - a popular filesystem on linux:
The maximum number of inodes (and hence the maximum number of files and directories) is set when the file system is created. If V is the volume size in bytes, then the default number of inodes is given by V/2^13 (or the number of blocks, whichever is less), and the minimum by V/2^23. The default was deemed sufficient for most applications.
Interesting. I mean millions of files. Linux can't copy * when there are thousands of files in the same directory, so we need subdirectories. One solution I saw but haven't implemented is that you have 3 directory level, /a/a/a/, /b/a/a, /a/b/a/, /c/a/a/, b/a/a, and so on. Every file get an autoincremented id from a database and some modular transform calculates which directory the file should be in according to the file id. But how should this transform look? Thank you Niklas R
Somebody locally was asking this the other day. The actual capacity of a directory is dependent on what type of filesystem the directory belongs to. However, in addition to the raw storage ability, there are some other things to keep in mind.
For example, the search speed of a directory may increase dramatically for very large directories, depending on the internal directory organization. This can slow down opening (and sometimes updating or closing) files.
Also, you can blow out all sorts of secondary buffers, which can cause commands such as "ls" or "find" to fail.
So in general, I recommend keeping the directories small if you can.
An IDE is no substitute for an Intelligent Developer.
Joined: Apr 08, 2006
My current setup serves thousands of static files from the same directory with httpd. Commands like mv * fails due to "argument list too long..." So we need something better to prepare for millions of files. 3 levels of directories will probably be enough to serve millions of files. I also need a good naming convention to avoid file names with the name name overwrites one another. So I think file abacus.xml will go to ./a/b/a/<id>.abacus.xml. Image files will have thumbnails named such as for abnormal.gif in ./a/b/n/<id>.thumb.abnormal.gif
This is a very practical idea, although you may run into statistical clumping, where some directories are empty and others are jam-packed. If that becomes a problem, some sort of fancy hashing technique based on statistical analysis may be useful. But that's extra work, and it's not as easy to figure out where things are (or the reverse) based on the simple filename when you do that.
My last job ran into a similar problem many years ago with SCO Unix. They were storing large numbers of files for a document imaging system in a directory. Over time, the system took a proportional performance hit as the number of files increased. As a result, longer access times.
The solution was to break the directory structure every 1000 files.
Back then, the os used a linked list scheme in the filesystem, which makes sense for the performance hit. I'm not sure what it's using these days, since filesystems have evolved quite far since then.
Tim is right though, for directories with large numbers of files, find, ls, etc. will fail with a list to large error or something like that (even today).
-- Nothing is impossible if I'mPossible
Joined: Apr 08, 2006
Thank you for the very informative replies. Indeed, my round calculation resulted in about maximum 1000 files per directory if you store a million files and have 3 directory levels, if the first 3 letters in the filenames are somewhat evenly distributed. I assume there would be about the same number of files starting with "aba" as with "abn" and that no three-letter-combination is significantly more popular than another. I can run tests on my thousands of stored files so far to see how naming of the first 3 letters is statistically. Sincerely, Niklas