Operating Systems Notes Part 5 – File Systems and the Disk

This is the last of the series, since Security wasn’t covered in full.

A storage (disk) has a Master Boot Record (MBR), A Partition Table and 1 or more partitions. Each partition can have a different file system, but no one’s assuring you your OS can use it.
Most of a partition is the Blocks, which are like the pages for the data, only for the disk. Blocks that are too large will cause internal fragmentation and blocks that are too small will cause slower sequential access to files (since disk operations are I/O and are slow).

The Unix File System (UFS) – i-nodes

  • Everything is a file is the guiding principle in Unix-based systems.
  • A part of the partition is the i-nodes, which are the data structure used to describe files and their content’s location/s.
    An i-node is made up of four parts:
    1. File Attributes, like owner id, last access time, etc..
    2. n Direct Blocks (where usually, n = 10), which are pointers to the blocks containing the first (n * BlockSize) bytes of the file’s data.
    3. Indirect Block Pointer, which is a pointer to a block that contains (BlockSize / PointerSize) pointers, each to a block of data.
    4. Double Indirect Block Pointer, which is a pointer to a block that contains (BlockSize / PointerSize) indirect block pointers.
    5. Triple Indirect Block Pointer, which is a pointer to a block that contains (BlockSize / PointerSize) double indirect block pointers.

    Overall, this means that a file’s maximum size is: BlockSize * (n + BlockSize / PointerSize + BlockSize2 / PointerSize + BlockSize3 / PointerSize). If we have 1KB blocks and 4 byte pointers, we can’t create a file larger than ~16GB.
    This structure means that smaller files have the advantage of having to use less disk-reads than the larger files.
    Example: Block size is 1K and pointer size is 4 bytes. Byte 355,000 is in the double indirect’s 80th block (355,000 = (10 + (210 / 4) + 80) * 210 + 696).

  • Hard Links are a way to create more than one file that will reference the same i-node. This i-node will keep a Reference Counter, since deletion of a hard link means freeing of the i-node and blocks only if there are no references left to that file other than the one we have delete now. Hard links to directories are allowed (and will therefore turn the directory tree into a graph), but only for administrators, since it could create a cycle in the graph, when it must remain acyclical (DAG).
  • Symbolic Links (or Soft Links) are a file that’s contents if the path to another file. This file is simply a shortcut, but is treated as if it was the original file when executed, read, etc.
  • Directories are files (as we’ve stated before), whose contents are simply records of the <i-node number, file name> pair. The root directory is the first i-node.
  • lockf is a system call that allows processes to avoid working with the same locations in files, by locking ranges of bytes in a file. Code can decide to not play nice and simply not call lockf and there’s nothing you can do about it, so all code accessing that file needs to know it’s going to use lockf or it’s simply useless.

File Allocation Table (FAT)

  • A part of the partition is the File Allocation Table itself, which has an entry for each block and a pointer to the file’s next block. That means that it’s a fixed length linked list, where the only thing that changes is the pointers (nodes’ order).
  • Directories are implemented as files with records of the <file name+extension (8.3), file attributes, first data block number> tuples. Since file attributes are kept inside a fixed length record, hard links can’t be added to the file system, since there’s no room for the reference counting. The root directory is the first file in the table.
  • Long File Names – in later implementations of FAT (starting with Windows NT 3.5), long filenames were allowed by using the filename part of the record as a pointer to point to an offset in a heap in the directory’s blocks that contained the file names.
  • Since storage (disk) is non-volatile, FAT has an Optional Duplicate FAT, which is a backup of the FAT, but as the name indicates, is optional.
  • The maximum size of FAT-12 is 16MB, that of FAT-16 is 2GB and of FAT-32 is 2TB. This is due to limitations of pointer size (the n in FAT-n) and block size (at most 4KB in FAT-12 and 32KB in the others).

New Technology File System (NTFS)

  • A file in the partition is the Master File Table (MFT), which contains modular records. Since the MFT is a file, the only limit to the number of files is how much space you have on the disk.
  • Except for the record’s header, each module in the record is called an Attribute and is made up of a <header, data> pair and the following are mandatory:
    1. Standard Information – these are the same old file attributes we’ve seen in UFS (i-nodes) and FAT.
    2. File Name – Since this is an independent attribute and due to non-inlining, there’s no limit to the length of the filename (but there’s one set anyway).
    3. Data – The data portion of this attribute is divided into ordered Runs which indicate the location of fragments of the file in <starting block, count> pairs. The first run is <0, total number of runs>.
  • An attribute’s data may either be Inline (that is, inside the MFT record itself) or not (which means there’s a pointer to a block containing the attribute’s data). An attribute’s header will always be in the MFT record. Inlining is a consideration of the file system itself and needs not be controlled by the OS.
  • If a record is too long, we can add an attribute of Extension Records and create more records to keep it.
  • Directories are implemented as files with an index attribute, the data portion of which hold records of <MFT record#, file name length, file name, some flags, …> tuples. Yes, it means there’s a duplication of the file name. The root directory is record 5 in the MFT.
  • One of the records in the MFT points to itself for consistency. Another points to a backup duplicate (like in FAT, only not optional).
  • Transparent File Compression is an ability of NTFS which allows some of the file’s runs to be compacted if their compression rate is high enough (if it wasn’t high enough – it wouldn’t be worth the effort to compress/decompress it). Each run that is compressed is followed by a run like this: <0, number of blocks we saved by compression>.

The Buffer Cache

  • Both Unix and Windows have a cache that’s held by their kernels which allows for faster disk access. Blocks both read and written to disk are cached and high-level algorithms instruct which blocks to pre-cache or delay-write.
  • Unix implements this cache as a hashtable of size n, linking to a triply linked list (doubly linked list with another pointer in each node to simulate a linked list for each hashtable entry that ends in the free list’s header). The doubly linked list is sorted from the most recently used block (list head) to the least recently used block (list tail).
    When a block is looked for, it is looked for in the list pointed at by the relevant hashtable entry. If we reached the free list, the block isn’t cached. Otherwise, the block is moved to the front of
    the list and is loaded from the cache.
    When a new block needs to be added, the doubly linked list’s tail is removed and the new block is moved to the head of the linked list (as well as connected to the hashtable entry’s list).

Log-Structured File Systems (LFS)

  • These file systems are based on the assumption that there is a big buffer cache, which means that most disk operations will be writes.
  • All write operations are kept in the cache and once in a while, they are sequentially committed to the disk in one fell swoop (to a ‘log’).
  • Re-organization can now be a separate daemon (service).

Reliability

  • The file system must be consistent. UFS (i-nodes) has a program named fsck (File System Check) that checks two things:
    • It checks the disk’s blocks by using the number of times they appear on the free list and the number of times they’re used in i-nodes in the following manner:
      1. If there is no reference to the block in either count, it is added to the free list.
      2. If it appears more than once on the free list, delete the excess nodes so that only one is kept.
      3. If it appears on the free list and is also referenced by an i-node, it’s removed from the free list.
      4. If n i-nodes references the block (where n > 1), it is cloned n times and each i-node is corrected to reference a different one.
    • It checks the reference counts for hard links. The number of found files always trumps the number currently in the reference count and replaces it. If either number is 0, the file is erased!
  • Dumps are used to keep a backup of a disk. Full dumps are long and wasteful and so we use Incremental Dumps which only copy the changed files over to the dump whenever it is called.
    Incremental dumps use bit-maps to decide which files and directories are to be dumped in the following manner:
    1. Create a bit-map in the length of the number of files in the system.
    2. Turn the bit of all changed files and all directories on. The rest should stay off.
    3. Turn off the bits of directories not in the path of changed files.
    4. Write all directories to the dump.
    5. Write all files to the dump.

An option to optimize the file system’s speed is to use the physical makeup of the disk and the knowledge about the operations that take it a lot of time. For instance, placing the i-nodes/FAT/MFT in the beginning of the disk, organizing it in groups on clusters, etc.

Advertisements

3 thoughts on “Operating Systems Notes Part 5 – File Systems and the Disk

  1. The file size limit for UFS is a calculation, so you can always use bigger block or longer pointers for larger files. However, a limit will always be in place (although it might be bigger than the disk’s whole size), since there’s a limited number of pointers (divide maximum file size by BlockSize and you’ll get the maximum number of available pointers per file).
    FAT and NTFS are only limited by the size of the disk, since both of them employ linked lists of allocated blocks (the FAT table is a list and NTFS can have extension records and non-inline runs).

  2. Good luck this afternoon :) I just found your OS summary this morning … after making one myself the old fashioned way: 20 hand written pages.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s