This week's book giveaway is in the Mac OS forum.
We're giving away four copies of a choice of "Take Control of Upgrading to Yosemite" or "Take Control of Automating Your Mac" and have Joe Kissell on-line!
See this thread for details.
The moose likes Linux / UNIX and the fly likes file uploads Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login


JavaRanch » Java Forums » Engineering » Linux / UNIX
Bookmark "file uploads" Watch "file uploads" New topic
Author

file uploads

Niklas Rosencrantz
Ranch Hand

Joined: Apr 08, 2006
Posts: 49
Users can upload files to my server's file system via my web app. An uploaded file gets an id from the database. Since all uploads go to the same directory, could there be problems with storing many uploads in the same directory (on linux)? Should I create some subdirectory structure and store subsets of the files in subdirectories instead? If I set up y subdirectories then the program can put a file in subdirectory x if x is remainder of the file id after division with y. e.g. if I keep 100 subdirectories:

If so, what's a future-proof value for y? 10? 100? 1000?

Thanks in advance
Niklas
Wilson Gordon
Ranch Hand

Joined: Apr 07, 2007
Posts: 89
It's not a problem to store all the files in one directory. However, to make it easier for yourself, such as when backing up the files, it's a good idea to store each user's files in its own directory (ex. using the user ID as directory name).
Niklas Rosencrantz
Ranch Hand

Joined: Apr 08, 2006
Posts: 49
No problem then. Thanks a lot Wilson.
[ May 20, 2007: Message edited by: Niklas Rosencrantz ]
Bear Bibeault
Author and ninkuma
Marshal

Joined: Jan 10, 2002
Posts: 61315
    
  66

Originally posted by Wilson Gordon:
It's not a problem to store all the files in one directory.


I'm not so sure about that. There is a limit to the number of files that can be stored in a single folder.

Since this is much more about UNIX file storage techniques than servlets, I've moved it to the UNIX forum for further discussion.


[Asking smart questions] [Bear's FrontMan] [About Bear] [Books by Bear]
Jeanne Boyarsky
author & internet detective
Marshal

Joined: May 26, 2003
Posts: 30586
    
154

Niklas,
How about storing the first X files in one directory, the next X in the next directory, etc. If you use "/" rather than "%", you don't need to guess how many subdirectories there will be.

X could be the maximum number of files allowed in a subdirectory on your system. Or better yet any common system so you aren't tied to your OS.


[Blog] [JavaRanch FAQ] [How To Ask Questions The Smart Way] [Book Promos]
Blogging on Certs: SCEA Part 1, Part 2 & 3, Core Spring 3, OCAJP, OCPJP beta, TOGAF part 1 and part 2
Niklas Rosencrantz
Ranch Hand

Joined: Apr 08, 2006
Posts: 49
Thanks for letting me know. I will implement the solution accordingly.
Kind regards,
Niklas
Dan Howard
Ranch Hand

Joined: Feb 22, 2004
Posts: 47
We had a similar issue. It is a problem storing many files in a single directory - windows and linux.

What we did was base the storage on the date so the folder structure would look like:


That way there were never too many files. Additionally it's easier to archive and move old years to other volumes.
Stefan Wagner
Ranch Hand

Joined: Jun 02, 2003
Posts: 1923

Without specifying what 'many files' means, it's hard to find an answer.
The answer will depend on the filesystem as well.

Wikipedia mentions for ext3 - a popular filesystem on linux:
The maximum number of inodes (and hence the maximum number of files and directories) is set when the file system is created. If V is the volume size in bytes, then the default number of inodes is given by V/2^13 (or the number of blocks, whichever is less), and the minimum by V/2^23. The default was deemed sufficient for most applications.

here: http://en.wikipedia.org/wiki/Ext3#_note-0

and for reiserfs:
http://en.wikipedia.org/wiki/Reiserfs
and
http://namesys.com/faq.html#reiserfsspecs
2^32 - 4 => 4 Gi - 4
but in practice this value is limited by hash function. r5 hash allows about 1 200 000 file names without collisions

[ May 21, 2007: Message edited by: Stefan Wagner ]

http://home.arcor.de/hirnstrom/bewerbung
Niklas Rosencrantz
Ranch Hand

Joined: Apr 08, 2006
Posts: 49
Interesting. I mean millions of files. Linux can't copy * when there are thousands of files in the same directory, so we need subdirectories. One solution I saw but haven't implemented is that you have 3 directory level, /a/a/a/, /b/a/a, /a/b/a/, /c/a/a/, b/a/a, and so on. Every file get an autoincremented id from a database and some modular transform calculates which directory the file should be in according to the file id. But how should this transform look?
Thank you
Niklas R
Stefan Wagner
Ranch Hand

Joined: Jun 02, 2003
Posts: 1923

Perhaps not with a modular transformation, but pattern matching.

A file "abacus" -> ./a/b/a/abacus
A file "abnormal"->./a/b/n/abnormal and so on.
Niklas Rosencrantz
Ranch Hand

Joined: Apr 08, 2006
Posts: 49
Many thanks. It's a very good solution you present. And all files which start with non-ascii character could go into a remainder directory, if there is a filename in chinese for example.
Tim Holloway
Saloon Keeper

Joined: Jun 25, 2001
Posts: 16101
    
  21

Somebody locally was asking this the other day. The actual capacity of a directory is dependent on what type of filesystem the directory belongs to. However, in addition to the raw storage ability, there are some other things to keep in mind.

For example, the search speed of a directory may increase dramatically for very large directories, depending on the internal directory organization. This can slow down opening (and sometimes updating or closing) files.

Also, you can blow out all sorts of secondary buffers, which can cause commands such as "ls" or "find" to fail.

So in general, I recommend keeping the directories small if you can.


Customer surveys are for companies who didn't pay proper attention to begin with.
Niklas Rosencrantz
Ranch Hand

Joined: Apr 08, 2006
Posts: 49
My current setup serves thousands of static files from the same directory with httpd. Commands like mv * fails due to "argument list too long..."
So we need something better to prepare for millions of files. 3 levels of directories will probably be enough to serve millions of files. I also need a good naming convention to avoid file names with the name name overwrites one another. So I think file abacus.xml will go to ./a/b/a/<id>.abacus.xml. Image files will have thumbnails named such as for abnormal.gif in ./a/b/n/<id>.thumb.abnormal.gif
Tim Holloway
Saloon Keeper

Joined: Jun 25, 2001
Posts: 16101
    
  21

This is a very practical idea, although you may run into statistical clumping, where some directories are empty and others are jam-packed. If that becomes a problem, some sort of fancy hashing technique based on statistical analysis may be useful. But that's extra work, and it's not as easy to figure out where things are (or the reverse) based on the simple filename when you do that.
Doug Slattery
Ranch Hand

Joined: Sep 15, 2007
Posts: 294
My last job ran into a similar problem many years ago with SCO Unix. They were storing large numbers of files for a document imaging system in a directory. Over time, the system took a proportional performance hit as the number of files increased. As a result, longer access times.

The solution was to break the directory structure every 1000 files.

Back then, the os used a linked list scheme in the filesystem, which makes sense for the performance hit. I'm not sure what it's using these days, since filesystems have evolved quite far since then.

Tim is right though, for directories with large numbers of files, find, ls, etc. will fail with a list to large error or something like that (even today).

Aloha,
Doug

-- Nothing is impossible if I'mPossible
Niklas Rosencrantz
Ranch Hand

Joined: Apr 08, 2006
Posts: 49
Thank you for the very informative replies. Indeed, my round calculation resulted in about maximum 1000 files per directory if you store a million files and have 3 directory levels, if the first 3 letters in the filenames are somewhat evenly distributed. I assume there would be about the same number of files starting with "aba" as with "abn" and that no three-letter-combination is significantly more popular than another. I can run tests on my thousands of stored files so far to see how naming of the first 3 letters is statistically.
Sincerely,
Niklas
 
GeeCON Prague 2014
 
subject: file uploads