MooseFS is a fault tolerant, network distributed file system. It spreads data over several physical servers which are visible to the user as one resource. For standard file operations MooseFS acts as other Unix-alike file systems:
- A hierarchical structure (directory tree)
- Stores POSIX file attributes (permissions, last access and modification times)
- Supports special files (block and character devices, pipes and sockets)
- Symbolic links (file names pointing to target files, not necessarily on MooseFS) and hard links (different names of files which refer to the same data on MooseFS)
- Access to the file system can be limited based on IP address and/or password
Distinctive features of MooseFS are:
- High reliability (several copies of the data can be stored across separate computers)
- Capacity is dynamically expandable by attaching new computers/disks
- Deleted files are retained for a configurable period of time (a file system level "trash bin")
- Coherent snapshots of files, even while the file is being written/accessed
MooseFS consists of four components:
- Managing server (master server) – a single machine managing the whole filesystem, storing metadata for every file (information on size, attributes and file location(s), including all information about non-regular files, i.e. directories, sockets, pipes and devices).
- Data servers (chunk servers) - any number of commodity servers storing files data and synchronizing it among themselves (if a certain file is supposed to exist in more than one copy).
- Metadata backup server(s) (metalogger server) - any number of servers, all of which store metadata changelogs and periodically downloading main metadata file; so as to promote these servers to the the role of the Managing server when primary master stops working.
- Client computers that access (mount) the files in MooseFS - any number of machines using mfsmount process to communicate with the managing server (to receive and modify file metadata) and with chunkservers (to exchange actual file data).
Metadata is stored in the memory of the managing server and simultaneously saved to disk (as a periodically updated binary file and immediately updated incremental logs). The main binary file as well as the logs are synchronized to the metaloggers (if present).
File data is divided into fragments (chunks) with a maximum of 64MiB each. Each chunk is itself a file on selected disks on data servers (chunkservers).
High reliability is achieved by configuring as many different data servers as appropriate to realize the "goal" value (number of copies to keep) set for the given file.
HOW THE SYSTEM WORKS
All file operations on a client computer that has mounted MooseFS are exactly the same as they would be with other file systems. The operating system kernel transfers all file operations to the FUSE module, which communicates with the mfsmount process. The mfsmount process communicates through the network subsequently with the managing server and data servers (chunk servers). This entire process is fully transparent to the user.
mfsmount communicates with the managing server every time an operation on file metadata is required:
- creating files
- deleting files
- reading directories
- reading and changing attributes
- changing file sizes
- at the start of reading or writing data
- on any access to special files on MFSMETA
mfsmount uses a direct connection to the data server (chunk server) that stores the relevant chunk of a file. When writing a file, after finishing the write process the managing server receives information from mfsmount to update a file's length and the last modification time.
Furthermore, data servers (chunk servers) communicate with each other to replicate data in order to achieve the appropriate number of copies of a file on different machines.
This of course does not refer to files with the "goal" set to 1, in which case the file will only exist on a single data server irrespective of how many data servers are deployed in the system.
Exceptionally important files may have their goal set to a number higher than two, which will allow these files to be resistant to a breakdown of more than one server at once.
In general the setting for the number of copies available should be one more than the anticipated number of inaccessible or out-of-order servers.
In the case where a single data server experiences a failure or disconnection from the network, the files stored within it that had at least two copies, will remain accessible from another data server. The data that is now 'under its goal' will be replicated on another accessible data server to again provide the required number of copies.
It should be noted that if the number of available servers is lower than the "goal" set for a given file, the required number of copies cannot be preserved. Similarly if there are the same number of servers as the currently set goal and if a data server has reached 100% of its capacity, it will be unable to begin to hold a copy of a file that is now below its goal threshold due to another data server going offline. In these cases a new data server should be connected to the system as soon as possible in order to maintain the desired number of copies of the file.
A new data server can be connected to the system at any time. The new capacity will immediately become available for use to store new files or to hold replicated copies of files from other data servers.
Administrative utilities exist to query the status of the files within the file system to determine if any of the files are currently below their goal (set number of copies). This utility can also be used to alter the goal setting as required.
The data fragments stored in the chunks are versioned, so re-connecting a data server with older copy of data (such as if it had been offline for a period of time), will not cause the files to become incoherent. The data server will synchronize itself to hold the current versions of the chunks, where the obsolete chunks will be removed and the free space will be reallocated to hold the new chunks.
Failures of a client machine (that runs the mfsmount process) will have no influence on the coherence of the file system or on the other client's operations. In the worst case scenario the data that has not yet been sent from the failed client computer may be lost.
PLATFORMSMooseFS is available on every Operating System with a working FUSE implementation:
- Linux (Linux 2.6.14 and up have FUSE support included in the official kernel)
- MacOS X
The master server, metalogger server and chunkservers can also be run on Solaris or Windows with Cygwin. Unfortunately without FUSE it won't be possible to mount the filesystem within these operating systems.