CSYNC User Guide

1. Introduction

It is often the case that we have multiple copies (called replicas) of a filesystem or part of a filesystem (for example on a notebook and desktop computer). Changes to each replica are often made independently, and as a result, they do not contain the same information. In that case, a file synchronizer is used to make them consistent again, without losing any information.

The goal is to detect conflicting updates (files which have been modified) and propagate non-conflicting updates to each replica. If there are no conflicts left, we are done, and the replicas are identical. To resolve or handle conflicts there are several algorithms available. They are discussed later in this documents.

2. Basics

This section describes some basics of file synchronization.

2.1. Paths

A path normally refers to a point which contains a set of files which should be synchronized. It is specified relative to the root of the replica locally, but has to be absolute if you use a protocol. The path is just a sequence of names separated by /.

The path separator is always a forward slash /, even for Windows.

csync always uses the absolute path on remote replicas. This could sftp://gladiac:secret@myserver/home/gladiac for sftp.

2.2. What is an update?

The contents of a path could be a file, a directory or a symbolic link (symbolic links are not supported yet). To be more precise, if the path refers to:

a regular file: the contents of the file are the byte stream and the metadata of the file.
a directory: then the content is the metadata of the directory.
a symbolic link: the content is the named file the link points to.

csync keeps a record of each path which has been successfully synchronized. The path gets compared with the record and if it has changed since the last synchronization, we have an update. This is done by comparing the modification or change (modification time of the metadata) time. This is the way how updates are detected.

2.3. What is a conflict?

A path is conflicting if it fulfills the following conditions:

it has been updated in one replica,
it or any of its descendants has been updated on the other replica too, and
its contents in are not identical.

3. File Synchronization

The primary goal of the file synchronizer is correctness. It may change scattered or large parts of the filesystem. Since this in mostly not monitored by the user, and the file synchronizer is in a position to harm the system, csync must be safe, even in the case of unexpected errors (e.g. disk full). What was done to make csync safe is described in the following sections.

On problem concerning correctness is the handling of conflicts. Each file synchronizer tries to propagate conflicting changes to the other replica. At the end both replicas should be identical. There are different strategies to fulfill these goals.

csync is a three-phase file synchronizer. The decision for this design was that user interaction should be possible and it should be easy to understand the process. The three phases are update detection, reconciliation and propagation. These will be described in the following sections.

3.1. Update detection

There are different strategies for update detection. csync uses a state-based modtime-inode update detector. This means it uses the modification time to detect updates. It doesn’t require many resources. A record of each file is stored in a database (called statedb) and compared with the current modification time during update detection. If the file has changed since the last synchronization an instruction is set to evaluate it during the reconciliation phase. If we don’t have a record for a file we investigate, it is marked as new.

It can be difficult to detect renaming of files. This problem is also solved by the record we store in the statedb. If we don’t find the file by the name in the database, we search for the inode number. If the inode number is found then the file has been renamed.

3.2. Reconciliation

The most important component is the update detector, because the reconciler depends on it. The correctness of reconciler is mandatory because it can damage a filesystem. It decides which file:

Stays untouched
Has a conflict
Gets synchronized
or is deleted

A wrong decision of the reconciler leads in most cases to a loss of data. So there are several conditions which a file synchronizer has to follow.

3.2.1. Algorithms

For conflict resolution several different algorithms could be implemented. The most common algorithms are the merge and the conflict algorithm. The first is a batch algorithm and the second is one which needs user interaction.

Merge algorithm

The merge algorithm is an algorithm which doesn’t need any user interaction. It is simple and used for example by Microsoft for Roaming Profiles. If it detects a conflict (the same file changed on both replicas) then it will use the most recent file and overwrite the other. This means you can loose some data, but normally you want the latest file.

Conflict algorithm

This is not implemented yet.

If a file has a conflict the user has to decide which file should be used.

3.3. Propagation

The next instance of the file synchronizer the propagator. It uses the calculated records to apply them on the current replica.

The propagator uses a two-phase-commit mechanism to simulate an atomic filesystem operation.

In the first phase we copy the file to a temporary file on the opposite replica. This has the advantage that we can check if file which has been copied to the opposite replica has been transfered successfully. If the connection gets interrupted during the transfer we still have the original states of the file. This means no data will be lost.

In the second phase the file on the opposite replica will be overwritten by the temporary file.

After a successful propagation we have to merge the trees to reflect the current state of the filesystem tree. This updated tree will be written as a journal into the state database. It will be used during the update detection of the next synchronization. See above for a description of the state database during synchronization.

3.4. Robustness

This is a very important topic. The file synchronizer should not crash, and if it has crashed, there should be no loss of data. To achieve this goal there are several mechanisms which will be discussed in the following sections.

3.4.1. Crash resistance

The synchronization process can be interrupted by different events, this can be:

the system could be halted due to errors.
the disk could be full or the quota exceeded.
the network or power cable could be pulled out.
the user could force a stop of the synchronization process.
various communication errors could occur.

That no data will be lost due to an event we enforce the following invariant:

At every moment of the synchronization each file, has either its original content or its correct final content.

This means that the original content can not be incorrect, no data can be lost until we overwrite it after a successful synchronization. Therefore, each interrupted synchronization process is a partial sync and can be continued and completed by simply running csync again. The only problem could be an error of the filesystem, so we reach this invariant only approximately.

3.4.2. Transfer errors

With the Two-Phase-Commit we check the file size after the file has transferred and we are able to detect transfer errors. A more robust approach would be a transfer protocol with checksums, but this is not doable at the moment. We may add this in the future.

Future filesystems, like btrfs, will help to compare checksums instead of the filesize. This will make the synchronization safer. This does not imply that it is unsafe now, but checksums are safer than simple filesize checks.

3.4.3. Database loss

It is possible that the state database could get corrupted. If this happens, all files get evaluated. In this case the file synchronizer wont delete any file, but it could occur that deleted files will be restored from the other replica.

To prevent a corruption or loss of the database if an error occurs or the user forces an abort, the synchronizer is working on a copy of the database and will use a Two-Phase-Commit to save it at the end.

4. Getting started

4.1. Installing csync

See the README and INSTALL files for install prerequisites and procedures. Packagers should take a look at Appendix A: Packager Notes.

4.2. Using the commandline client

The synopsis of the commandline client is

csync [OPTION...] SOURCE DESTINATION

It synchronizes the content of SOURCE with DESTINATION and vice versa. The DESTINATION can be a local directory or a remote file server.

csync /home/csync scheme://user:password@server:port/full/path

4.2.1. Examples

To synchronize two local directories:

csync /home/csync/replica1 /home/csync/relplica2

Two synchronizer a local directory with an smb server, use

csync /home/csync smb://rupert.galaxy.site/Users/csync

If you use kerberos, you don’t have to specify a username or a password. If you don’t use kerberos, the commandline client will ask about the user and the password. If you don’t want to be prompted, you can specify it on the commandline:

csync /home/csync smb://csync:secret@rupert.galaxy.site/Users/csync

If you use the sftp protocol and want to specify a port, you do it the following way:

csync /home/csync sftp://csync@krikkit.galaxy.site:2222/home/csync

The remote destination is supported by plugins. By default csync ships with smb and sftp support. For more information, see the manpage of csync(1).

4.3. Exclude lists

csync provides exclude lists with simple shell wildcard patterns. There is a global exclude list, which is normally located in /etc/csync/csync_exclude.conf and it has already some sane defaults. If you run csync the first time, it will create an empty exclude list for the user. This file will be ~/.csync/csync_exclude.conf. If you run both files are used and maybe an additional one if you specify it.

The entries in the file are newline separated. Use /etc/csync/csync_exclude.conf as an example.

4.4. Debug messages and dry run

For log messages csync uses log4c. It is a logging mechanism which uses debug levels and categories. There is a config file where you can specify the debug level for each component. It is located at ~/.csync/csync_log.conf.

Available debug priorities are:

trace
debug
info
warn
error
fatal
none

A more detailed description can be found at the log4c homepage. A good introduction can be found here.

To simulate a run of the file synchronizer, you should set the priority to debug for the categories csync.updater and csync.reconciler in the config file ~/.csync/csync_log.conf. Then run csync with the --dry-run option. This will only run update detection and reconciliation.

4.5. The PAM module

pam_csync is a PAM module to provide roaming home directories for a user session. This module is aimed at environments with central file servers where a user wishes to store his home directory. The Authentication Module verifies the identity of a user and triggers a synchronization with the server on the first login and the last logout. More information can be found in the manpage of the module pam_csync(8) or pam itself pam(8).