File APIs for Java Developers
Manipulate DOC, XLS, PPT, PDF and many others from your application.
http://aspose.com/file-tools
Win a copy of Clojure in Action this week in the Clojure forum!
  • Post Reply
  • Bookmark Topic Watch Topic
  • New Topic

file uploads

 
Niklas Rosencrantz
Ranch Hand
Posts: 49
  • 0
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Users can upload files to my server's file system via my web app. An uploaded file gets an id from the database. Since all uploads go to the same directory, could there be problems with storing many uploads in the same directory (on linux)? Should I create some subdirectory structure and store subsets of the files in subdirectories instead? If I set up y subdirectories then the program can put a file in subdirectory x if x is remainder of the file id after division with y. e.g. if I keep 100 subdirectories:

If so, what's a future-proof value for y? 10? 100? 1000?

Thanks in advance
Niklas
 
Wilson Gordon
Ranch Hand
Posts: 89
  • 0
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
It's not a problem to store all the files in one directory. However, to make it easier for yourself, such as when backing up the files, it's a good idea to store each user's files in its own directory (ex. using the user ID as directory name).
 
Niklas Rosencrantz
Ranch Hand
Posts: 49
  • 0
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
No problem then. Thanks a lot Wilson.
[ May 20, 2007: Message edited by: Niklas Rosencrantz ]
 
Bear Bibeault
Author and ninkuma
Marshal
Pie
Posts: 64192
83
IntelliJ IDE Java jQuery Mac Mac OS X
  • 0
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Originally posted by Wilson Gordon:
It's not a problem to store all the files in one directory.


I'm not so sure about that. There is a limit to the number of files that can be stored in a single folder.

Since this is much more about UNIX file storage techniques than servlets, I've moved it to the UNIX forum for further discussion.
 
Jeanne Boyarsky
author & internet detective
Marshal
Posts: 33697
316
Eclipse IDE Java VI Editor
  • 0
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Niklas,
How about storing the first X files in one directory, the next X in the next directory, etc. If you use "/" rather than "%", you don't need to guess how many subdirectories there will be.

X could be the maximum number of files allowed in a subdirectory on your system. Or better yet any common system so you aren't tied to your OS.
 
Niklas Rosencrantz
Ranch Hand
Posts: 49
  • 0
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Thanks for letting me know. I will implement the solution accordingly.
Kind regards,
Niklas
 
Dan Howard
Ranch Hand
Posts: 47
  • 0
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
We had a similar issue. It is a problem storing many files in a single directory - windows and linux.

What we did was base the storage on the date so the folder structure would look like:


That way there were never too many files. Additionally it's easier to archive and move old years to other volumes.
 
Stefan Wagner
Ranch Hand
Posts: 1923
Linux Postgres Database Scala
  • 0
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Without specifying what 'many files' means, it's hard to find an answer.
The answer will depend on the filesystem as well.

Wikipedia mentions for ext3 - a popular filesystem on linux:
The maximum number of inodes (and hence the maximum number of files and directories) is set when the file system is created. If V is the volume size in bytes, then the default number of inodes is given by V/2^13 (or the number of blocks, whichever is less), and the minimum by V/2^23. The default was deemed sufficient for most applications.

here: http://en.wikipedia.org/wiki/Ext3#_note-0

and for reiserfs:
http://en.wikipedia.org/wiki/Reiserfs
and
http://namesys.com/faq.html#reiserfsspecs
2^32 - 4 => 4 Gi - 4
but in practice this value is limited by hash function. r5 hash allows about 1 200 000 file names without collisions

[ May 21, 2007: Message edited by: Stefan Wagner ]
 
Niklas Rosencrantz
Ranch Hand
Posts: 49
  • 0
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Interesting. I mean millions of files. Linux can't copy * when there are thousands of files in the same directory, so we need subdirectories. One solution I saw but haven't implemented is that you have 3 directory level, /a/a/a/, /b/a/a, /a/b/a/, /c/a/a/, b/a/a, and so on. Every file get an autoincremented id from a database and some modular transform calculates which directory the file should be in according to the file id. But how should this transform look?
Thank you
Niklas R
 
Stefan Wagner
Ranch Hand
Posts: 1923
Linux Postgres Database Scala
  • 0
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Perhaps not with a modular transformation, but pattern matching.

A file "abacus" -> ./a/b/a/abacus
A file "abnormal"->./a/b/n/abnormal and so on.
 
Niklas Rosencrantz
Ranch Hand
Posts: 49
  • 0
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Many thanks. It's a very good solution you present. And all files which start with non-ascii character could go into a remainder directory, if there is a filename in chinese for example.
 
Tim Holloway
Saloon Keeper
Pie
Posts: 17628
39
Android Eclipse IDE Linux
  • 0
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Somebody locally was asking this the other day. The actual capacity of a directory is dependent on what type of filesystem the directory belongs to. However, in addition to the raw storage ability, there are some other things to keep in mind.

For example, the search speed of a directory may increase dramatically for very large directories, depending on the internal directory organization. This can slow down opening (and sometimes updating or closing) files.

Also, you can blow out all sorts of secondary buffers, which can cause commands such as "ls" or "find" to fail.

So in general, I recommend keeping the directories small if you can.
 
Niklas Rosencrantz
Ranch Hand
Posts: 49
  • 0
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
My current setup serves thousands of static files from the same directory with httpd. Commands like mv * fails due to "argument list too long..."
So we need something better to prepare for millions of files. 3 levels of directories will probably be enough to serve millions of files. I also need a good naming convention to avoid file names with the name name overwrites one another. So I think file abacus.xml will go to ./a/b/a/<id>.abacus.xml. Image files will have thumbnails named such as for abnormal.gif in ./a/b/n/<id>.thumb.abnormal.gif
 
Tim Holloway
Saloon Keeper
Pie
Posts: 17628
39
Android Eclipse IDE Linux
  • 0
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
This is a very practical idea, although you may run into statistical clumping, where some directories are empty and others are jam-packed. If that becomes a problem, some sort of fancy hashing technique based on statistical analysis may be useful. But that's extra work, and it's not as easy to figure out where things are (or the reverse) based on the simple filename when you do that.
 
Doug Slattery
Ranch Hand
Posts: 294
  • 0
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
My last job ran into a similar problem many years ago with SCO Unix. They were storing large numbers of files for a document imaging system in a directory. Over time, the system took a proportional performance hit as the number of files increased. As a result, longer access times.

The solution was to break the directory structure every 1000 files.

Back then, the os used a linked list scheme in the filesystem, which makes sense for the performance hit. I'm not sure what it's using these days, since filesystems have evolved quite far since then.

Tim is right though, for directories with large numbers of files, find, ls, etc. will fail with a list to large error or something like that (even today).

Aloha,
Doug

-- Nothing is impossible if I'mPossible
 
Niklas Rosencrantz
Ranch Hand
Posts: 49
  • 0
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Thank you for the very informative replies. Indeed, my round calculation resulted in about maximum 1000 files per directory if you store a million files and have 3 directory levels, if the first 3 letters in the filenames are somewhat evenly distributed. I assume there would be about the same number of files starting with "aba" as with "abn" and that no three-letter-combination is significantly more popular than another. I can run tests on my thousands of stored files so far to see how naming of the first 3 letters is statistically.
Sincerely,
Niklas
 
I agree. Here's the link: http://aspose.com/file-tools
  • Post Reply
  • Bookmark Topic Watch Topic
  • New Topic