Working with remote servers

With the advent of cloud computing much of the work we do might be done on machines that we don’t have physical access to (in fact, they might even be virtual machines on a server). Because of the existence of services like AWS it can be very cheap to use a large number of cloud servers to do data processing work. These services also offer bulk online storage, which can be used to store data sets that don’t fit on personal laptops.

The main tool we use to work on remote servers is the ssh protocol. This is a tool that can securely provide a connection to a remote server from which you can execute commands as if you were accessing that computer normally.

Long running jobs

While establishing an ssh connection is easy, if you start a long-running job – say, an analysis of a large data set – if your connection gets interrupted before the job is done it will be terminated. To avoid this, we can use a program such as tmux. tmux establishes a session on the remote machine that can survives network disconnects, both intentional and accidental. When you log into a machine with an existing tmux session you can tmux attach-session to reconnect to the existing session. A session is a collection of virtual terminal windows with processes running in them. This allows you to start a long running job, disconnect, and reconnect later to see its progress. It also lets you leave a programming environment or editing session in progress.

Getting files to and from the server

The simplest tool to move files to and from the server is the scp or secure copy command. For example, scp local-file server.com/files/new-file will copy the local-file to the server and rename it to new-file in the files directory. Another option is to sync files using Git (pushes and pulls), perhaps using an intermediary server such as Bitbucket.org or GitHub.com.