Transferring files to and from the clusters

Copying a file or directory

The simplest way to copy a file to or from a cluster is to use the scp command. Providing your SSH client configuration file is correct and you are using an SSH agent as explained in the Connecting from a UNIX/Linux or MacOS computer section, copying files is as simple as, using Hmem for an example:

 scp ./file.txt hmem:destination/path/

Copying it back is done with

 scp hmem:path/to/file.txt .

If you want to copy a directory and its content, use the -r option, just like with cp.

If your SSH client is configured to forward your SSH agent (cfr supra), and your client configuration is copied to the clusters, then copying from one cluster (e.g. Hmem) to another cluster (e.g. Lemaitre2) is as simple as:

 scp hmem:path/to/file.txt lemaitre2:/path/

The above command will initiate a connection from Hmem to Lemaitre2 directly. If, beceause of a firewall, the source and destination cannot see each other (which should not be the case with the CECI clusters), but your laptop can see both, you can use the -3 option to route the traffice through your laptop.

Transferring a large number of small files

Transferring a lot of small files will take a very long time with scp because of the overhead of copying every file individually. In such case, using the tar command will reduce the transfer time significantly. You can first create a tar archive, then scp it as a single file and then 'untar' the file. But the most efficient way is to do all three operations in one go, without creating an intermediate file, like this:

tar cz ./source_dir | ssh hmem 'tar xvz -C destination/path'

This will create a large file containing the small files and remove the overhead of dealing with many small files.

Transferring large files

When transferring large files, it is often interesting to use the -C option of scp to first compress the file, send it, and then decompress it. Using it simply with

 scp -C ./large_file.txt hmem:destination/path/

Resuming interrupted transfers

If, for any reason, a transfer is interrupted, you might end up with part of the files being transfered. Rather than restarting the transfer from scratch, you should then use the rsync command. The rsync command will compare the source and destination directories and only transfer what needs to be transfered: missing files, modified files, etc.

Use it this way (assuming again that your SSH client is properly configured):

rsync -va ./source_dir hmem:destination/path

Make sure not to leave trailing slashes in your path names (e.g. NOT destination/path/) as you might end up with a full copy of the directory inside the existing, partial, one. Use the -n (dry-run) option of rsync to check what will happen before you run the actual command.

If one large file is left half-transferred, you can resume it using the --partial.

Transferring code

Source code is a specific type of data and should be treated as such. The best way to transfer code from one computer to another is to host the code in a source code repository using a versioning system such as git (more common) or mercurial (easier to use) and clone the repository from your laptop to the cluster.

Synchronising with a local directory

If you want to keep two directories (one on your laptop, and one on the cluster) in sync, you can do that with rsync using its --delete option. But that is only one-way so you need to realy think in what direction you do it, and it does not scale beyond two synchronized directories.

A real option is to use Unison, a piece of software that can detect and handle conficts (incompatible changes made to the same file in the two directories that must be kept in sync.)

© CÉCI.