fbpx

File Storage Design for quant development

(Last Updated On: May 16, 2011)


File Storage Design for quant development

What is the best design to store 500 thousand images in a file system?

Is it to use many nested folder names or one flat folder, etc?

It probably depends on the OS. On windows server we started seeing performance issues with anything more than a couple thousand images in one directory. After switching to nested directories we were able to scale up to several million images.

Ten years ago we simply md5 hashed the image file name and used the first ‘n’ characters of the hash as the directory name to put it in. Worked well on Linux and Windows to millions of files. Today, I’d do that same job with Riak or some other nosql. Simplier and more reliable.

I didn’t consider nosql as a storage, good idea. I also see some potential
with using a fuse driver to mount riak / couchdb.

what are you storing them for? archiving? easy retrieval? search? updates?

the “storing” part is usually easy

easy retrieval, since pretty much all the images are 100×100 pixels or
smaller png files and then used as part of a persons account.
files aren’t really searched or updated, since they are mostly meant to
just be all viewed on one page.

then you have a serious breakage problem- how big are those files compared to the disk block size?

 

Don’t know the disk block size, it was a default file system under a VPS
using Debian.
I had one folder say /files/pi/ with each file in it’s own folder.the
files would keep the original filename and let me also post process the
images so I could create for example different sized images. the folder
names were incrementally numbered more or less. once that folder reached
about 8 thousand folders it stopped creating folders, so then all new
uploads were sent to /files/pi2/. then that got full, and so on..

Storage is raid, but no other features like server/file system redundancy
to improve the network bottleneck etc..

 

Assuming you can compress the images to 4K or less, you’re still using a 4K block on the filesystem for each file by default, plus another 4K for the directory, plus room in the inode tree. Not too efficient.

If you decide to stick with the file/folder layout scheme, try to put about 1000 files in each directory, and about 1000 subdirectories in each directory if you have to nest them. Most filesystems will handle that gracefully up to their size limit, and it gives you room for a million files with only one directory level, and a billion with just two.

The indexing code for the above might be a whole page long if you’re liberal with the white space. It’s hard to believe adding nosql to the system is going to improve things, unless it’s a big system with lots of parallel processing power.

Those files are probably small enough that even keeping them in a database such as mysql will work. 50,000 files in a mysql blob is nothing.

 

 

NOTE I now post my TRADING ALERTS into my personal FACEBOOK ACCOUNT and TWITTER. Don't worry as I don't post stupid cat videos or what I eat!

Subscribe For Latest Updates

Sign up to best of business news, informed analysis and opinions on what matters to you.
Invalid email address
We promise not to spam you. You can unsubscribe at any time.

NOTE!

Check NEW site on stock forex and ETF analysis and automation

Scroll to Top