FAQ

 

What average write/read speeds can we expect?

 

The raw reading and writing speed depends mainly on:
  • the performance of the used hard disk drives
  • the network capacity  its topology
The better the performance of the hard drives used and the better throughput of the network, the higher performance of the whole system.
 
Our in-house commodity servers (which additionally are utilized by other applications ) with simple gigabyte Ethernet network have achieved  a petabyte-class installation on Linux (Debian) with goal=2.  We have write speeds of about 20-30 MiB/s and reads of 30-50MiB/s. For smaller blocks the write speed decreases, but reading is noteffected that much.

A similar FreeBSD based network has observed slightly better writes and worse reads, giving a slightly better performance overall. 

 

Does the goal setting influence writing/reading speeds?

 

Generally speaking, it does not. The goal setting can influence the reading speed under certain conditions. For example, when a file has goal set to two or more, reading the same file at the same time by more than one client would be faster due to the ability for each client to utilize the load balancing effect of the file existing across more than one data server. But in practice, it is very rare in real world situations where several computers would read the same file at the same moment; therefore, the goal setting has little influence on the reading speeds.

Similarly, the writing speed is negligibly influenced by the goal setting. 

 

Are concurrent read operations supported?

 

All read operations are parallel - there is no problem with concurrent reading of the same data by several clients at the same moment.

 

How much CPU/RAM resources are used?

 

In our environment (ca. 500 TiB, 25 million files, 2 million folders distributed on 26 million chunks on 70 machines) the usage of chunkserver CPU (by constant file transfer) is about 15-20% and chunkserver RAM usually consumes about 100MiB (independent of amount of data). The master server consumes about 30% of CPU (ca. 1500 operations per second) and 8GiB RAM. CPU load depends on amount of operations and RAM on the total number of files and folders, not the total size of the files themselves. The RAM usage is proportional to the number of entries in the file system because the master server process keeps the entire metadata cached in memory for performance.

 

Is it possible to add/remove chunkservers and disks on fly?

 

You can add / remove chunk servers on the fly. But keep in mind that it is not wise to disconnect a chunk server if this server contains the only copy of a chunk in the file system (the CGI monitor will mark these in orange). You can also disconnect (change) an individual hard drive. The scenario for this operation would be:
  1. Mark the disk(s) for removal (see How to mark a disk for removal?)
  2. Restart the chunkserver process
  3. Wait for the replication (there should be no “undergoal” or “missing” chunks marked in yellow, orange or red in CGI monitor)
  4. Stop the chunkserver process
  5. Delete entry(ies) of the disconnected disk(s) in 'mfshdd.cfg'
  6. Stop the chunkserver machine
  7. Remove hard drive(s)
  8. Start the machine
  9. Start the chunkserver process
If you have hotswap disk(s) after step 5 you should follow these:
  • 6. Unmount disk(s)
  • 7. Remove hard drive(s)
  • 8. Start the chunkserver process
If you follow the above steps work of client computers would be not interrupted and the whole operation would not be noticed by MooseFS users.

 

How to mark a disk for removal?

 

When you want to mark a disk for removal from a chunkserver you need to edit the chunkserver's mfshdd.cfg configuration file and put an asterisk '*' at the start of the line of the disk that is to be removed. For example, in this mfshdd.cfg we have marked "/mnt/hdd" for removal:
/mnt/hda
/mnt/hdb
/mnt/hdc
*/mnt/hdd
/mnt/hde


After changing the mfshdd.cfg you need to restart chunkserver (mfschunkserver restart).

Once the disk has been marked for removal and the chunkserver process has been restarted, the system will make an appropriate number of copies of the chunks stored on this disk, as to maintain the required "goal" number of copies.

Finally, before the disk can be disconnected you need to confirm there are no "undergoal" chunks on the other disks. This can be done using the CGI Monitor. On the "Info" tab select "Regular chunks state matrix" mode.

 

My experience with clustered filesystems is that metadata operations are quite slow. How did you resolve this problem?

 

During our research and development we also observed the problem of slow metadata operations. We decided to aleviate some of the speed considerations by caching the file system structure in RAM on the metadata server. This is why metadata server has increased memory requirements. The cached metadata is frequently flushed out to logging files on the master server. Additionally, the metadata logger server(s) would also frequently receive updates to the metadata structure and log these to their file systems.

 

When doing df -h on a filesystem the results are different from what I would expect taking into account actual sizes of written files.

 

Every chunkserver sends its own disk usage increased by 256MB for each used partition/hdd, and a sum of these master sends to the client as total disk usage. If you have 3 chunkservers with 7 hdd each, your disk usage will be increased by 3*7*256MB (about 5GB). In practice, this is not usually a concern for example when you have 150TB of HDD space.  

There is one other thing. If you use disks exclusively for MooseFS on chunkservers df will show correct disk usage, but if you have other data on your MooseFS disks df will count your own files too.  

If you want to see usage of your MooseFS files use 'mfsdirinfo' command.

 

Can I keep source code in MooseFS? Why do small files occupy more space than I would have expected?

 

The system was initially designed for keeping large amounts (like several thousands) of very big files (of tens of gigabytes) and has a hard-coded chunk size of 64MiB and block size of 64KiB. Using a consistent block size helps improve the networking performance and efficiencies, as all nodes in the system are able to work with a single 'bucket' size. That’s why even a small file will occupy 64KiB plus additionally 4KiB of checksums and 1KiB for the header. The whole transfer which takes place in the system is done in blocks of 64KiB. However it doesn’t have any impact on the performance. (A normal file system will typically also use some degree of block read-ahead, while sometimes will fetch some superfluous data).

The issue regarding the occupied space of a small file stored inside a MooseFS chunk is really more significant, but in our opinion it is still negligible. Let’s take 25 million files with a goal set to 2. Counting the storage overhead, this could create about 50 million 69 KiB chunks, that may not be completely utilized due to internal fragmentation (wherever the file size was less than the chunk size). So the overall wasted space for the 50 million chunks would be approximately 3.2TiB. By modern standards, this should not be a significant concern. A more typical, medium to large project with 100,000 small files would consume at most 13GiB of extra space due to this internal chunk fragmentation.

So it is quite reasonable to store source code files in a MooseFS system, either for active use during development or for long term reliable storage or archival purposes.

Perhaps the larger factor to consider is the comfort of developing the code taking into account the performance of a network file system. When using MooseFS (or any other network based file system such as NFS, CIFS for that matter) for a project under active development, the network filesystem performance may not able to perform file IO operations to the same speed that would be possible with a directly attached regular hard drive.

Some modern integrated development environments (IDE) such as Eclipse make frequent IO requests on several small workspace metadata files. Running Eclipse with the workspace folder on a MooseFS file system (and again, with any other networked file system) will experience slightly slower user interface performance, than running Eclipse with the workspace on a local hard drive.

This is more of a feature of the way these IDE products are designed, utilizing the file system as its active model and not employing memory based caching. So you may need to evaluate for yourself if using MooseFS for your working copy of active development within an IDE is right for you.

In a different example, using a typical text editor for source code editing and a version control system such as Subversion to check out project files into a MooseFS file system, does not typically observe any performance degradation. The IO overhead of the network file system nature of MooseFS is offset by the larger IO latency of interacting with the remote Subversion repository. And the individual file operations (open, save) do not have any observable latencies when using simple text editors (outside of complicated IDE products).

A more likely situation would be to have the Subversion repository files hosted within a MooseFS file system, where the svnserve or Apache + mod_svn would service requests to the subversion repository and users would check out working sandboxes onto their local hard drives (which possibly would not be using MooseFS).

 

Do chunkservers and metadata server do their own checksumming?

 

Yes there is checksumming done by the system itself. We thought it would be CPU consuming but it is not really. Overhead is about 4B per a 64KiB block which is 4KiB per a 64MiB chunk (per goal).

 

What sort of sizing is required for the Master server?

 

The most important factor is RAM of mfsmaster machine, as the full file system structure is cached in RAM for speed. Besides RAM mfsmaster machine needs some space on HDD for main metadata file together with incremental logs.  
The size of the metadata file is dependent on the number of files (not on their sizes). The size of incremental logs depends on the number of operations per hour, but length (in hours) of this incremental log is configurable.  

1 million files takes approximately 300 MiB of RAM. Installation of 25 million files requires about 8GiB of RAM and 25GiB space on HDD.

 

When I delete files or directories the MooseFS size doesn’t change. Why?

 

MooseFS does not immediately erase files on deletion, to allow you revert the delete operation. Deleted files are kept in the trash bin for the configured amount of time before they are deleted.  

You can configure for how long files are kept in trash and empty the trash manually (to release the space). There are more details in Reference Guide in section "Operations specific for MooseFS".  

In short - the time of storing a deleted file can be verified by the mfsgettrashtime command and changed with mfssettrashtime.

 

When I added a third server as an extra chunkserver it looked like it started replicating data to the 3rd server even though the file goal was still set to 2.

 

Yes. Disk usage ballancer uses chunks independently, so one file could be redistributed across all of your chunkservers.

 

Is MooseFS 64bit compatible?

 

Yes!

 

Can I modify the chunk size?

 

No. File data is divided into fragments (chunks) with a maximum of 64MiB each. The value of 64 MiB is hard coded into system so you cannot modify its size. We based the chunk size on real-world data and determined it was a very good compromise between number of chunks and speed of rebalancing / updating the filesystem. Of course if a file is smaller than 64 MiB it occupies less space.

In the systems we take care of, several file sizes well exceed 100GB with no noticable chunk size penalty.

 

How do I know if a file has been successfully written to MooseFS?

 

Let's briefly discuss the process of writing to the file system and what programming consequences this bears.

In all contemporary filesystems, files are written through a buffer (write cache). As a result, execution of the write command itself only transfers the data to a buffer (cache), with no actual writing taking place. Hence, a confirmed execution of the write command does not mean that the data has been correctly written on a disc. It is only with the invocation and completion of the fsync (or close) command that causes all data kept within the buffers (cache) to get physically written out. If an error occurs while such buffer-kept data is being written, it could cause the fsync (or close) command to return an error response.   

The problem is that a vast majority of programmers do not test the close command status (which is generally a very common mistake). Consequently, a program writing data to a disc may "assume" that the data has been written correctly from a success response from the write command, while in actuality, it could have failed during the subsequent close command.

As far as MooseFS is concerned – first, it's write buffers are larger than classic file systems (an issue of efficiency); second, write errors may be more frequent than in case of a classic hard drive (the network nature of MooseFS introduces some additional error-inducing situations). As a consequence, the amount of data processed during execution of the close command is often significant and if an error occurs while the data is being written [from the close command], this will be returned as an error during the execution of the close command. Hence, before executing close, it is recommended (especially when using MooseFS) to perform an fsync operation after writing to a file and then checking the status of the result of the fsync operation. Then, for good measure, also check the return status of close as well.

NOTE! When stdio is used, the fflush function only executes the "write" command, so correct execution of fflush is not sufficient to be sure that all data has been written successfully – you should also check the status of fclose.

The above problem frequently occurs when redirecting a standard output of a program to a file in shell. Bash (and many other programs) do not check the status of the close execution. So the syntax of "application > outcome.txt" type may wrap up successfully in shell, while in fact there has been an error in writing out the "outcome.txt" file. You are strongly advised to avoid using the above shell output redirection syntax when writing to a MooseFS mount point. If necessary, you can create a simple program that reads the standard input and writes everything to a chosen file, where this simple program would correctly employ the appropriate check of the result status from the fsync command. For example,  "application | mysaver outcome.txt", where mysaver is the name of your writing program instead of application > outcome.txt.

Please note that the problem discussed above is in no way exceptional and does not stem directly from the characteristics of MooseFS itself. It may affect any system of files – only that network type systems are more prone to such difficulties. Technically speaking, the above recommendations should be followed at all times (also in cases where classic file systems are used).

 

Does MooseFS have file size limit?

 

Currently MooseFS imposes a maximum file size limit of 2 TiB (2,199,023,255,552 bytes). However we are considering removing this limitation in the near future, at which point the maximum file size will reach the limits of the operating system, which is currently 16EiB (18,446,744,073,709,551,616 bytes).

 

How to setup the CGI monitor?

 

If you normally installed the system using "make install", you should be able to just to run "mfscgiserv" on the master machine:

/usr/local/sbin/mfscgiserv

You can also specify the HTTP port number to use:

/usr/local/sbin/mfscgiserv -P 9425

And then navigate your browser to the URL of the master server and this port, such as:

http://MASTER_IP:9425/

 

Can I set up HTTP basic authentication for the mfscgiserv?

 

Mfscgiserv is a very simple HTTP server written just to run the MooseFS CGI scripts. It does not support any additional features like HTTP authentication. However, the MooseFS CGI scripts may be served from another full-featured HTTP server with CGI support, such as lighttpd or Apache. When using a full-featured HTTP server such as Apache, you may also take advantage of features offered by other modules, such as HTTPS transport. Just place the CGI and its data files (index.html, mfs.cgi, chart.cgi, mfs.css, logomini.png, err.gif) under chosen DocumentRoot. If you already have an HTTP server instance on a given host, you may optionally create a virtual host to allow access to the MooseFS CGI monitor through a different hostname or port.  

Can I run a mail server application on MooseFS? Mail server is a very busy application with a large number of small files - will I not lose any files?

 
You can run a mail server on MooseFS. You won’t lose any files under a large system load. When the file system is busy, it will block until its operations complete, which will just cause the mail server to slow down. 
 

Are there any suggestions for the network, MTU or bandwidth?

 
We recommend using jumbo-frames (MTU=9000). With a greater amount of chunkservers, switches should be connected through optical fiber or use aggregated links. The network should be built upon 1GB Ethernet. Though it unlikely there would be any performance benefit from using 10GB Ethernet, as the inherit latencies in the hard drives [on the chunk servers] would become the bottleneck of the entire system.

Some users noticed problems when GSO/TSO (Generic Segmentation Offload) driver setting was switched off in some network cards (eg. Broadcom). To set it on you should use "ethtool -K gso on".

 

Does MooseFS support supplementary groups?

 

FUSE supports supplementary groups from version 2.8.0 on but unfortunately the implementation has only been done on the Linux platform and it has not been done in an efficient way according to their release notes:

Add fuse_getgroups (high level lib) and fuse_req_getgroups (low level lib) functions to query the supplementary group IDs for the current request. Currently this is implemented on Linux by reading from the /proc filesystem.

The only reasonable option for MooseFS would be to ignore group privileges and make calls to the kernel to test main and supplementary groups (on the kernel side FUSE fully tests privileges). However this facility for testing privileges in the kernel is optional and can be disabled.

So by default Moosefs will test privileges only of main groups because it is much safer. In order to change this behaviour you can add ignoregid option in mfsexports.cfg (see man mfsexports.cfg) and then the master server will not test group privileges by itself and will fully depend on privilege checks done by the kernel in FUSE.

 

Does MooseFS support file locking?

 

At the moment file locking works only "locally", where if a file is locked on a given client node (e.g. machine A) its kernel remembers it is locked for any other process on the same node. However another client node (e.g. machine B) doesn't know about it. A “global” locking feature should be introduced in the 1.7 branch.


For now one work around could be to use the mechanism of chunk locking for write. Which causes the entire chunk which contains the file to be locked while one node is writing to the file. Another client then not be able to write to this chunk and would attempt to try the write again every second. In certain edge cases this can in theory lead to starvation but practically it shouldn’t.


The only problem would be with simultaneous appending (writing at the end) to the same file by two clients. See also the thread “Append and seek while writing functionality” on the group archive:
https://sourceforge.net/mailarchive/message.php?msg_id=25488203


While chunk locking for write is a safe operation, it is not recommended. Instead it is better when different process on different machines write to different files and later on have some other system operation combine the data from these many files into one single target file (as is typically done in “map-reduce” processing).

 

Is it possible to assign IP addresses to chunk servers via DHCP?

 

Yes, at this moment it is possible to assign (any) IP address to chunk server via DHCP.


However, a planned enhancement for a future release is "chunkserver awareness". Where the master server remembers the chunkservers that have connected to it and in case any of them were disconnected, the master server would then be able to report the disconnection.


So for chunkserver awareness to work the master server would need to know IP addresses of the chunkservers and this IP addresses should not change between chunkserver restarts. Thus, if you do use DHCP, it is advisable to configure DHCP so that it gives the same IP address to the chunkserver machine based on its MAC address, i.e. to use “DHCP reservations”.

 

Space on some of my chunkservers are 90% utilized while others are only 10%. Why does the rebalancing process take so long?

 

Our experiences from working in a production environment have shown that aggressive replication is not desirable as it can substantially slow down the whole system. The overall performance of the system is more important than equal utilization of hard drives over all of the chunk servers.. By default replication is configured to be a non-aggressive operation. At our environment normally it takes 3-4 weeks for a new chunkserver to get to a standard hdd utilization. Aggressive replication would make the whole system considerably slow for several days.


Replication speeds can be adjusted on master server startup by setting these two options:
CHUNKS_WRITE_REP_LIMIT
how many chunks may be saved in parallel on one chunkserver while replicating (by default 1).
CHUNKS_READ_REP_LIMIT
how many chunks may be read in parallel from one chunkserver while replicating (by default 5).


Tuning these in your environment will require experimentation. When adding a new chunkserver, try setting the first option to 5 and the second to 15 and restart the master server. After replication finishes you should restore these settings back to their default values and again restart the master server.

 

I have metalogger(s) running - should I additionally backup the metadata file on the master server?

 

Yes, it is highly advisable to additionally backup the metadata file. This provides a worst case recovery option in the event for some reason the metalogger data is not useable for restoring the master server (for example the metalogger server is also destroyed).


The master server flushes metadata kept in RAM to the metadata.mfs.back binary file every hour on the hour (xx:00). So a good time to copy the metadata file is every hour on the half hour (30 minutes after the dump). This would limit the amount of data loss to about 1.5h of data. Backing up the file can be done using any conventional method of copying the metadata file – cp, scp, rsync, etc.


After restoring the system based on this backed up metadata file the most recently created files will have been lost. Additionally files that were appended to would have their previous size, at the time of the metadata backup. Files that were deleted would exist again. And files that were renamed or moved would be back to their previous names (and locations). But still you would have all of data for the files created in the X past years before the crashed occurred.

 

I think one of my disks is slower / damaged. How should I find it?

 

In the CGI monitor go to the “Disks” tab and choose “switch to hour” in “I/O stats” column and sort the results by “write” in “max time” column. Now look for disks which have a significantly larger write time. You can also sort by the “fsync” column and look at the results. Maximum times should not exceed 2 seconds (2 million microseconds). It is a good idea to find individual disks that are operating slower as they may be a bottleneck to the system.


It might be helpful to create a test operation that continuously copies some data to create enough load on the system for there to be observable statisics in the CGI monitor. On the “Disks” tab specify units of “minutes” instead of hours for the “I/O stats” column.


Once a disk has been discovered to replace it follow the usual operation of marking the chunkserver for replacement, and waiting until the color changes to indicate that all of the chunks stored on this chunkserver have been replicated to achieve the sufficient goal settings onto other chunkservers.

 

How can I find the master server PID?

 

In the directory where all metadata and changelogs are located (by default /usr/local/var/mfs) there is also a .master.lock file. The PID of the process which keeps a lock on this file is the PID of the master server. So the PID can be obtained by using the fcntl function:


struct flock fl;
if (fcntl(fd,F_GETLK,&fl)<0) { // get lock owner
	return -1; // getting lock error 
}
if (fl.l_type!=F_UNLCK) {  // found lock
	return fl.l_pid; // return lock owner
}
return -1;

 

Web interface shows there are some copies of chunks of goal 0. What does it mean?

 

This is a way to mark chunks belonging to the non-existing (i.e. deleted) files. Deleting a file is done asynchronously in MooseFS. First a file is removed from metadata and its chunks are marked as unnecessary (goal=0). Later the chunks are removed during an "idle" time. This is much more efficient than erasing everything at the moment the file was deleted.


Unnecessary chunks may also appear after a recovery of the master server if they were created shortly before the failure and were not available in the restored metadata file.


Another case is when the master tries to delete a chunk and the operation returns an error. In this scenario if the master deletes information about it from metadata before the error occurs there is no guarantee that the chunk has actually been deleted, or if it still resides on the disk on the chunkserver.


Such chunks would be discovered by the master server and unless admin does something with them, they would eventually be deleted after a week.

 

Is every error message reported by mfsmount a serious problem?

 

No. Mfsmount writes every failure encountered during communication with chunkservers to the syslog. Transient communication problems with the network might cause IO errors to be displayed, but this does not mean data loss or that mfsmount will return an error code to the application. Each operation is retried by the client (mfsmount) several times and only after the number of failures (reported as try counter) reaches a certain limit (typically 30) is the error returned to the application that data was not read/saved.


Of course it is important to monitor these messages. When messages appear more often from one chunkserver than from the others it may mean there are issues with this chunkserver - maybe hard drive is broken, maybe network card has some problems - check its charts, hard disk operation times, etc. in the CGI monitor.


Note: Chunkserver IP is written in hexadecimal format.

 

What does 'file: NNN, index: NNN, chunk: NNN, version: NNN - writeworker: connection with (XXXXXXXX:PPPP) was timed out (unfinished writes: Y; try counter: Z)' message mean?

 

This means that Zth try to write the chunk was not successful and writing of Y blocks sent to the chunkserver were not confirmed. After reconnecting these blocks would be sent again for saving. The limit of trials is set by default to 30.


This message is for informational purposes and doesn't mean data loss.

 

What does 'file: NNN, index: NNN, chunk: NNN, version: NNN, cs: XXXXXXXX:PPPP - readblock error (try counter: Z)' message mean?

 

This means that Zth try to read the chunk was not successful and system will try to read the block again. If value of Z equals 1 it is a transitory problem and you should not worry about it. The limit of trials is set by default to 30.


XXXXXXXX is a chunkserver IP written in hexadecimal format. This can be helpful for determining if there is an issue with a chunkserver. When messages appear more often from one chunkserver than from the others it may mean there are issues with this chunkserver – check its charts, hard disk operation times, etc. in the CGI monitor.

 

How do I verify that the MooseFS cluster is online? What happens with mfsmount when the master server goes down?

 

When the master server goes down while mfsmount is already running, mfsmount doesn't disconnect the mounted resource and files awaiting to be saved would stay quite long in the queue while trying to reconnect to the master server. After a specified number of tries they eventually return EIO - "input/output error". On the other hand it is not possible to start mfsmount when the master server is offline.


There are several ways to assure that the master server is online, we present a few of these below.

  1. Check if you can connect to the TCP port of the master server (e.g. socket connection test).
  2. In order to assure that a MooseFS resource is mounted it is enough to check the inode number – MooseFS root will always have inode equal to 1. For example if we have MooseFS installation in /mnt/mfs then stat /mnt/mfs command (in Linux) will show:
    
    $ stat /mnt/mfs
      File: `/mnt/mfs'
      Size: xxxxxx    	Blocks: xxx        IO Block: 4096   directory
    Device: 13h/19d	Inode: 1           Links: xx
    (...)
    
  3. Additionaly mfsmount creates a virtual hidden file .stats in the root mounted folder. For example, to get the statistics of mfsmount when MooseFS is mounted we can cat this .stats file, eg.:
    
    $ cat /mnt/mfs/.stats
    fuse_ops.statfs: 241
    fuse_ops.access: 0
    fuse_ops.lookup-cached: 707553
    fuse_ops.lookup: 603335
    fuse_ops.getattr-cached: 24927
    fuse_ops.getattr: 687750
    fuse_ops.setattr: 24018
    fuse_ops.mknod: 0
    fuse_ops.unlink: 23083
    fuse_ops.mkdir: 4
    fuse_ops.rmdir: 1
    fuse_ops.symlink: 3
    fuse_ops.readlink: 454
    fuse_ops.rename: 269
    (...)
    
  4. If you want to be sure that master server properly responds you need to try to read the goal of any object, specifically of the root folder:
    
    $ mfsgetgoal /mnt/mfs
    /mnt/mfs: 2
    

    If you get a proper goal of the root folder you can be sure that the master server is up and running.