Ticket #59 (closed enhancement: fixed)

Opened 4 years ago

Last modified 3 years ago

Gather disk space used in addition to file size

Reported by: julians@rsp.com.au Assigned to: mlandauer
Priority: minor Milestone: 2007.02.23
Component: Daemon Version:
Keywords: file size disk space Cc:

Description

Ideally, Earth would find out the real disk space usage of each file in addition to its file size.

Depending on the file system, the disk space occupied by a file can differ substantially from its actual size. For instance, a single-byte file (file size 1) occupies at least one block on many file systems, which - depending on block size - can translate to several kilobytes of real disk usage.

It might be worthwhile investigating the source code of the "du" tool from the GNU core utility distribution to find a good way of doing this. For example, "echo x > /tmp/foo; du -sh /tmp/foo" reports 4K on my workstation. (http://www.gnu.org/software/coreutils/coreutils.html)

Change History

12/22/06 13:36:36 changed by julians@rsp.com.au

Conversely, a multi-GB file consisting only of zeroes might be stored in only a couple of blocks due to sparse files (http://www.lrdev.com/lr/unix/sparsefile.html) or other types of compression (such as the LZ77 compression provided by NTFS.)

A naive approach to finding the actual disk usage of a file is to use stats(2)'s st_blksize*st_blocks. This is what du uses to determine disk usage. However, this does not account for the capability of some modern file systems (e.g. Reiser4) to handle small files efficiently by putting the data of multiple small files into a single block, or rather by doing away with block alignment requirements altogether.

Quoting http://www.namesys.com:

"Reiser4 uses dancing trees (...). This makes Reiser4 more space efficient than other filesystems because we squish small files together rather than wasting space due to block alignment like they do."

01/23/07 19:44:52 changed by julians@rsp.com.au

It turns out that st_blksize*st_blocks is not the correct formula to calculate disk usage of a file. According to the stat(2) man page on both OS X 10.4 and a somewhat old Linux distribution (glibc 2.3.2), st_blksize is not to be used as the scale of st_blocks; rather, st_blocks is given in 512-byte blocks regardless of the actual block size of the file system. (st_blksize doesn't necessarily give the actual block size of the file system either; rather, it's the "recommended block size for optimal I/O.")

In summary, the disk usage of a file should be calculated as st_blocks * 512.

01/29/07 23:15:02 changed by mlandauer

Can I close this ticket now?

01/29/07 23:21:19 changed by mlandauer

  • milestone set to milestone2.

02/15/07 11:28:34 changed by julians

  • status changed from new to closed.
  • resolution set to fixed.

Earth meanwhile stores disk space usage so closing this ticket.