To make it short, FTP cluster does not mean a setup where several (more or less identical FTP servers) run in a high availability fail over installation.
What I call an FTP cluster is the following: Consider you have 5 Linux boxes. Every machine has a unused 20GB (or better more) harddisk drive. The FTP cluster integrates individual FTP servers running on each machine into a single virtual FTP server. The user doesn't really notice that the server he's using is distributed; is simply sees a 100GB FTP server. If you decide to add another 80GB drive to the cluster you'll get an 180GB FTP server.
In other words: we are talking about a network harddisk stripe that uses FTP as access protocol.
Term definition
Before we continue we should define two terms:
How can this work?
Well, the basic idea is that the cluster server doesn't really store the files the users put in on the server. Instead it only knows which cluster node has the file. Whenever a user wants to retrieve a file the user's FTP client is redirected to the cluster node having the file.
Notice another "core feature" of the cluster server: the user receives the data directly from the cluster node. If you have 5 nodes up and running each one connected to your network with a 100MBit/s cable you may get a total cluster network troughput of 500Mbit/s. This depends only on how the data is distributed on the nodes.
But how does redirection with FTP work?
Are you familiar with the FTP protocol? I don't expect it, so I'll try
to explain the plot.
FTP uses a control connection for the client/server communication. The
user (or the user's program) sends commands through this control
connection to the server and the server sends it's responses the same
way back. This is the same scheme as for other usual TCP/IP application
protocols like SMTP, POP3 or HTTP.
But when it comes to file transmission (storing or retrieving) client
and server use a second, the so called data channel, for the file
transport.
How is this done?
C --> S: PORT 192,168,5,9,4,5 S --> C: 200 okThe PORT command above tells the server that it listens on port 1029 (4 * 256 + 5) on the IP number 192.168.5.9, the response code 200 means that this is ok for the server.
C --> S: RETR readme.txt S --> C: 150 sending file server connects on client's data port and sends file
server closes data connection S --> C: 226 transfer complete
The basic idea now is as follows: the cluster server accepts the client's PORT command and when it receives the RETR command it looks which node server has the file, the cluster connects to the node (if not already done) sends the client's original PORT and then the RETR command to initiate the data transfer.
Passive transfer mode
To be exact, the scheme above is not the only one to retrieve data. Client and server can change roles, the server allocates the listening data port and the client connects to it. Firewall administrators love passive mode, most FTP client try passive mode as default mode. Anyway, passive mode doesn't work well for the FTP cluster. Since the cluster server doesn't know which file the user wants and can therefore determine which server does the actual transfer the best it can do is allocate the port on it's own and act as proxy (or better relay) in the data transmission. This is a workaround but I don't like it. Passive mode brings the stripe's performance down and active transfer mode isn't really a problem.
Key features
To repeat the key features: