| Author |
file uploads
|
Niklas Rosencrantz
Ranch Hand
Joined: Apr 08, 2006
Posts: 49
|
|
Users can upload files to my server's file system via my web app. An uploaded file gets an id from the database. Since all uploads go to the same directory, could there be problems with storing many uploads in the same directory (on linux)? Should I create some subdirectory structure and store subsets of the files in subdirectories instead? If I set up y subdirectories then the program can put a file in subdirectory x if x is remainder of the file id after division with y. e.g. if I keep 100 subdirectories: If so, what's a future-proof value for y? 10? 100? 1000? Thanks in advance Niklas
|
 |
Wilson Gordon
Ranch Hand
Joined: Apr 07, 2007
Posts: 89
|
|
|
It's not a problem to store all the files in one directory. However, to make it easier for yourself, such as when backing up the files, it's a good idea to store each user's files in its own directory (ex. using the user ID as directory name).
|
 |
Niklas Rosencrantz
Ranch Hand
Joined: Apr 08, 2006
Posts: 49
|
|
No problem then. Thanks a lot Wilson. [ May 20, 2007: Message edited by: Niklas Rosencrantz ]
|
 |
Bear Bibeault
Author and ninkuma
Marshal
Joined: Jan 10, 2002
Posts: 56157
|
|
Originally posted by Wilson Gordon: It's not a problem to store all the files in one directory.
I'm not so sure about that. There is a limit to the number of files that can be stored in a single folder. Since this is much more about UNIX file storage techniques than servlets, I've moved it to the UNIX forum for further discussion.
|
[Smart Questions] [JSP FAQ] [Books by Bear] [Bear's FrontMan] [About Bear]
|
 |
Jeanne Boyarsky
internet detective
Marshal
Joined: May 26, 2003
Posts: 26144
|
|
Niklas, How about storing the first X files in one directory, the next X in the next directory, etc. If you use "/" rather than "%", you don't need to guess how many subdirectories there will be. X could be the maximum number of files allowed in a subdirectory on your system. Or better yet any common system so you aren't tied to your OS.
|
[Blog] [JavaRanch FAQ] [How To Ask Questions The Smart Way] [Book Promos]
Blogging on Certs: SCEA Part 1, Part 2 & 3, Core Spring 3, OCAJP, OCPJP beta, TOGAF part 1 and part 2
|
 |
Niklas Rosencrantz
Ranch Hand
Joined: Apr 08, 2006
Posts: 49
|
|
Thanks for letting me know. I will implement the solution accordingly. Kind regards, Niklas
|
 |
Dan Howard
Ranch Hand
Joined: Feb 22, 2004
Posts: 47
|
|
We had a similar issue. It is a problem storing many files in a single directory - windows and linux. What we did was base the storage on the date so the folder structure would look like: That way there were never too many files. Additionally it's easier to archive and move old years to other volumes.
|
 |
Stefan Wagner
Ranch Hand
Joined: Jun 02, 2003
Posts: 1923
|
|
Without specifying what 'many files' means, it's hard to find an answer. The answer will depend on the filesystem as well. Wikipedia mentions for ext3 - a popular filesystem on linux:
The maximum number of inodes (and hence the maximum number of files and directories) is set when the file system is created. If V is the volume size in bytes, then the default number of inodes is given by V/2^13 (or the number of blocks, whichever is less), and the minimum by V/2^23. The default was deemed sufficient for most applications.
here: http://en.wikipedia.org/wiki/Ext3#_note-0 and for reiserfs: http://en.wikipedia.org/wiki/Reiserfs and http://namesys.com/faq.html#reiserfsspecs
2^32 - 4 => 4 Gi - 4 but in practice this value is limited by hash function. r5 hash allows about 1 200 000 file names without collisions
[ May 21, 2007: Message edited by: Stefan Wagner ]
|
http://home.arcor.de/hirnstrom/bewerbung
|
 |
Niklas Rosencrantz
Ranch Hand
Joined: Apr 08, 2006
Posts: 49
|
|
Interesting. I mean millions of files. Linux can't copy * when there are thousands of files in the same directory, so we need subdirectories. One solution I saw but haven't implemented is that you have 3 directory level, /a/a/a/, /b/a/a, /a/b/a/, /c/a/a/, b/a/a, and so on. Every file get an autoincremented id from a database and some modular transform calculates which directory the file should be in according to the file id. But how should this transform look? Thank you Niklas R
|
 |
Stefan Wagner
Ranch Hand
Joined: Jun 02, 2003
Posts: 1923
|
|
Perhaps not with a modular transformation, but pattern matching. A file "abacus" -> ./a/b/a/abacus A file "abnormal"->./a/b/n/abnormal and so on.
|
 |
Niklas Rosencrantz
Ranch Hand
Joined: Apr 08, 2006
Posts: 49
|
|
|
Many thanks. It's a very good solution you present. And all files which start with non-ascii character could go into a remainder directory, if there is a filename in chinese for example.
|
 |
Tim Holloway
Saloon Keeper
Joined: Jun 25, 2001
Posts: 14456
|
|
Somebody locally was asking this the other day. The actual capacity of a directory is dependent on what type of filesystem the directory belongs to. However, in addition to the raw storage ability, there are some other things to keep in mind. For example, the search speed of a directory may increase dramatically for very large directories, depending on the internal directory organization. This can slow down opening (and sometimes updating or closing) files. Also, you can blow out all sorts of secondary buffers, which can cause commands such as "ls" or "find" to fail. So in general, I recommend keeping the directories small if you can.
|
Customer surveys are for companies who didn't pay proper attention to begin with.
|
 |
Niklas Rosencrantz
Ranch Hand
Joined: Apr 08, 2006
Posts: 49
|
|
My current setup serves thousands of static files from the same directory with httpd. Commands like mv * fails due to "argument list too long..." So we need something better to prepare for millions of files. 3 levels of directories will probably be enough to serve millions of files. I also need a good naming convention to avoid file names with the name name overwrites one another. So I think file abacus.xml will go to ./a/b/a/<id>.abacus.xml. Image files will have thumbnails named such as for abnormal.gif in ./a/b/n/<id>.thumb.abnormal.gif
|
 |
Tim Holloway
Saloon Keeper
Joined: Jun 25, 2001
Posts: 14456
|
|
|
This is a very practical idea, although you may run into statistical clumping, where some directories are empty and others are jam-packed. If that becomes a problem, some sort of fancy hashing technique based on statistical analysis may be useful. But that's extra work, and it's not as easy to figure out where things are (or the reverse) based on the simple filename when you do that.
|
 |
Doug Slattery
Ranch Hand
Joined: Sep 15, 2007
Posts: 294
|
|
My last job ran into a similar problem many years ago with SCO Unix. They were storing large numbers of files for a document imaging system in a directory. Over time, the system took a proportional performance hit as the number of files increased. As a result, longer access times. The solution was to break the directory structure every 1000 files. Back then, the os used a linked list scheme in the filesystem, which makes sense for the performance hit. I'm not sure what it's using these days, since filesystems have evolved quite far since then. Tim is right though, for directories with large numbers of files, find, ls, etc. will fail with a list to large error or something like that (even today). Aloha, Doug -- Nothing is impossible if I'mPossible
|
 |
Niklas Rosencrantz
Ranch Hand
Joined: Apr 08, 2006
Posts: 49
|
|
Thank you for the very informative replies. Indeed, my round calculation resulted in about maximum 1000 files per directory if you store a million files and have 3 directory levels, if the first 3 letters in the filenames are somewhat evenly distributed. I assume there would be about the same number of files starting with "aba" as with "abn" and that no three-letter-combination is significantly more popular than another. I can run tests on my thousands of stored files so far to see how naming of the first 3 letters is statistically. Sincerely, Niklas
|
 |
 |
|
|
subject: file uploads
|
|
|