What is an FTP cluster?

To make it short, FTP cluster does not mean a setup where several (more or less identical FTP servers) run in a high availability fail over installation.

What I call an FTP cluster is the following: Consider you have 5 Linux boxes. Every machine has a unused 20GB (or better more) harddisk drive. The FTP cluster integrates individual FTP servers running on each machine into a single virtual FTP server. The user doesn't really notice that the server he's using is distributed; is simply sees a 100GB FTP server. If you decide to add another 80GB drive to the cluster you'll get an 180GB FTP server.

In other words: we are talking about a network harddisk stripe that uses FTP as access protocol.

Term definition

Before we continue we should define two terms:

Cluster Server
The term cluster server stands for the server that "creates the cluster", the server the user connects to.
Cluster Node
The term cluster nodes refers to the machines that store the files. The nodes are the machines that "export" their unused harddrives.
Where we are already talking about some basic things: the software on the cluster nodes is standard FTP server software only the cluster server runs a special server software.

How can this work?

Well, the basic idea is that the cluster server doesn't really store the files the users put in on the server. Instead it only knows which cluster node has the file. Whenever a user wants to retrieve a file the user's FTP client is redirected to the cluster node having the file.

Notice another "core feature" of the cluster server: the user receives the data directly from the cluster node. If you have 5 nodes up and running each one connected to your network with a 100MBit/s cable you may get a total cluster network troughput of 500Mbit/s. This depends only on how the data is distributed on the nodes.

But how does redirection with FTP work?
Are you familiar with the FTP protocol? I don't expect it, so I'll try to explain the plot.

FTP uses a control connection for the client/server communication. The user (or the user's program) sends commands through this control connection to the server and the server sends it's responses the same way back. This is the same scheme as for other usual TCP/IP application protocols like SMTP, POP3 or HTTP.
But when it comes to file transmission (storing or retrieving) client and server use a second, the so called data channel, for the file transport. How is this done?

Step 1
First of all the client allocates a listening port where it is willing to accept a connection. It then send information about this data port to the server through the control connection and the server takes notice of it.
C --> S: PORT 192,168,5,9,4,5
S --> C: 200 ok
The PORT command above tells the server that it listens on port 1029 (4 * 256 + 5) on the IP number 192.168.5.9, the response code 200 means that this is ok for the server.
Step 2
The client send the transport command to the server, the server connects to the client on the port it got in step 1, the file is transmitted.
C --> S: RETR readme.txt
S --> C: 150 sending file
server connects on client's data port and sends file
Step 3
When client and server are done with the file transport the data connection is closed. The next file transmission will use a different data port.
server closes data connection
S --> C: 226 transfer complete
The important thing here is that when it comes to data transmission the client becomes a server (it opens the listening port) waiting for an incoming connection. But the data transmission connect doesn't have to come from the server to which the client has the active control connection, it can be a different server.

The basic idea now is as follows: the cluster server accepts the client's PORT command and when it receives the RETR command it looks which node server has the file, the cluster connects to the node (if not already done) sends the client's original PORT and then the RETR command to initiate the data transfer.

Passive transfer mode

To be exact, the scheme above is not the only one to retrieve data. Client and server can change roles, the server allocates the listening data port and the client connects to it. Firewall administrators love passive mode, most FTP client try passive mode as default mode. Anyway, passive mode doesn't work well for the FTP cluster. Since the cluster server doesn't know which file the user wants and can therefore determine which server does the actual transfer the best it can do is allocate the port on it's own and act as proxy (or better relay) in the data transmission. This is a workaround but I don't like it. Passive mode brings the stripe's performance down and active transfer mode isn't really a problem.

Key features

To repeat the key features:

Interested? Then read one.