Available for: Windows, Linux, macOS Agents. Full sync shares. NFS/SMB/cloud storage. MC API
Not available for: NAS, mobile Agents, MC Agent regardless of its OS. Caching gateways, TSS shares. Script jobs.
Understanding scaleout cluster
System requirements and expected performance
Load balancing and using scaleout in cloud
Scaleout cluster roles
Failover
Using scaleout clusters in jobs
Peculiarities and limitations
Understanding scaleout cluster
Scaleout clusters are designed to maximize transfer speed and scalability by distributing workloads across multiple agents. This architecture is ideal for environments that demand high-performance data movement and fault tolerance. Unlike traditional single-server configurations, a scaleout cluster dynamically assigns roles and optimizes resource utilization to eliminate single points of failure, ensuring seamless operation even if individual nodes go offline.
System requirements and expected performance
Management Console
Ensure that the Management Console meets the general system requirements.
Agents
A scale-out cluster typically includes multiple industry-standard systems (VMs or physical nodes), each running 1–2 Agents depending on available CPU cores. To maximize efficiency, scale up first - run multiple Agents per machine to utilize up to 10–20 Gbps of throughput; then scale out by adding more nodes as needed.
Maximum verified setup: 40 Agents per cluster.
Reference VM recommendations:
- Google Cloud: n2-standard-8 for a single Agent
- ARM-based: c4a-standard-8
- Example cluster: Five VMs (c6gn.2xlarge, 8 vCPU each) with one Agent per VM achieving 40 Gbps upload to S3 within the same cluster. Spot instances can be used for Agents with helper role, as assigned by admin.
Resource requirements:
-
CPU: 8 cores per Agent on a VM. Total core count depends on the number of Agents per VM.
-
RAM: 32 GB or more, depending on the number of files. In a scale-out scenario, folder trees are distributed unevenly across Agents: if a branch leader goes offline and a new branch follower is assigned, the previous branch follower retains the folder tree even if it comes back online, continuing to consume RAM. The same applies to instances manually downgraded to the helper role.
-
Storage: Minimum 1 GB/s read/write speed per Agent, as measured by
fio
. Average indexing speed is 20k files/sec. -
Network: Minimum 10 Gbps per Agent, as measured by
iperf
between source and destination machines (TCP or UDP). To maximize scale-out performance, TCP multistreaming must be enabled for the Agents
TCP is recommended for stable connections without packet loss, such as between data centers or between a data center and a cloud region. Most cloud providers and enterprise firewalls limit UDP traffic, while TCP connections are usually unrestricted.
ZGT is recommended for:
- High-latency connections (RTT > 200 ms)
- Connections with packet loss
- Unstable network environments
Load balancing scaleout cluster
Scale-out clusters incorporate automatic and configurable load-balancing mechanisms to optimize resource allocation and job execution. The key principles of load balancing include:
Jobs load balancing
Cluster leader assigns branch leader and branch follower roles only to specific amount of Agents based on on their load, number of jobs, last assigned role, file errors. The maximum number of branch followers per branch leader is controlled by parameter scaleout.redundancy_factor
, default is 10. Any remaining Agents are assigned the helper role.
Jobs are load balanced in batches per parameter scaleout.leader_job_block_size
, default value is 10. Jobs are assigned the Agents to fill in the block, i.e. with up to 10 jobs, next - other agents in pseudo-random manner, last - by available RAM.
File transfers load balancing
File transfers are handled by helpers, ensuring optimized distribution of workload across the cluster. This follows standard scale-out operational logic.
Connectivity Management
All cluster communication is routed through the branch leader, which determines internal connection endpoints. The scale-out implementation uses redirect mechanism at the tunnel handshake level, allowing remote helpers or peers to locate and establish connections with an appropriate helper within the cluster
Scaleout cluster roles
Scaleout clusters must meet the following requirements:
- All Agents must be running the same version.
- All Agents must be on the same OS.
- Minimum: 1 Agent. Recommended maximum: 50 Agents per scale-out cluster.
Auto-group assignment rules "Add all new Agents" and "Add new Agents that match the rule" require at least one cluster member at creation. Auto-groups are supported, with the default role assigned as "helper." Additional automation assigns specific roles to Agents through tagging. Possible tag values: helper
, cluster member
or cluster member and helper
.
The following configurations are available. Changing an Agent's role after cluster creation is supported:
- Cluster Member and Helper: The Agent can take any role in the cluster.
- Cluster Member: The Agent can be a cluster leader, branch leader, or branch follower but not a helper. At least one cluster member or cluster member and helper is required in the cluster. A cluster leader is elected among Agents with these roles. The cluster remains idle if a leader is not elected (e.g., all agents with these roles are offline).
- Helper: The Agent can only serve as a helper and cannot take on other roles.
Agents perform different roles within the cluster, depending on their assigned configuration:
Cluster Leader: responsible for assigning branch leader, branch follower, and helper roles (including to itself). Manages job status reporting. Each cluster configuration will have one branch leader, with additional branch followers for load balancing. The leader builds cluster configurations based on updates from members, who report job load, memory usage, and previous role assignments.
Branch leader: responsible for external communication and is a primary source of file tree - maintains files meta information (file tree), scans, uploads and downloads files. Assigns download operations to helpers.
Branch follower: cluster member, holds a copy of file tree but doesn’t share it outside cluster - maintains file tree (but not files metadata), merges with branch leader inside its own cluster, scans. Its purpose is to become a new branch leader if the current leader fails. Branch follower may be a helper at the same time.
Helper: Responsible for data transfer. Requests download operations from the branch leader, executes them, and reports back. Does not store the file tree (although branch leaders or followers downgraded to helper retain it temporarily). Handles only delegated tasks.
Cluster follower: a passive cluster member who participates in leader elections. HA follower.
Disconnected: a cluster member without a connection to the leader..
Unassigned Role: An Agent that does not participate in leader elections and can go offline without disrupting the cluster.
External connections outside cluster
External connections outside the cluster are managed by the branch leader but can be delegated to a helper. Helpers attempt to reuse already established connections for downloading data, generally preferring to connect to the nearest Agent in a remote cluster or an external Agent.
Failover
Two types of failovers may occur in a scale-out cluster.
- Cluster Role Failover: Triggered when the cluster leader loses connection to other cluster members and a new leader is elected.
- Cluster Transfer Failover: Occurs when a branch leader or branch follower changes due to failure.
Cluster Roles Failover
This failover involves switching roles within the cluster. It occurs when the cluster leader disconnects from other cluster members due to a network disruption, Agent process stopping or restart. Connectivity to other Agents, not the Management Console (MC), determines failover—if the leader stays connected to Agents but loses MC access, failover won’t occur.
During failover Agents in scaleout cluster report status "failover" in the job run and event job's logs, and job progress is temporarily halted.
The process typically takes around 30 seconds. New cluster leader is elected using RAFT algorithm and it continues the job. If no leader is selected, failover continues until one is elected or the process times out.
Cluster leader configures the cluster based on updates from members, who report number of active jobs, memory usage and their role is last configuration known to a cluster member. New cluster configuration goes in two steps:
a) selecting from Agents who are already branch leader or branch followers.
b) filling the rest of roles.
At all stages agents with file errors or excessive memory usage (over 90%) are excluded from selection.
Cluster transfers failover
It's similar to failover in High Availability groups. Triggered by branch leader failure due to:
- disconnection from branch followers in the job.
- deletion of identifying .sync/ID file .
- Database error.
- storage misconfiguration, causing branch followers to reference a different storage location than the branch leader.
Some of the operations are interrupted by the failover and started again after it, for example:
- initial indexing of the job (Agent scanning the folder and building the folder database). The new branch follower will have to scan the whole folder before it continues the job.
- trigger execution in a Distribution/Consolidation job, the new leader will start these from scratch.
- file download from the leader. If Lazy indexing parameter in the Profile is enabled, remote Agents, outside the scaleout cluster will re-start active downloads from the new leader.
- Metadata recalculation. While file hashes are retained, metadata remains on the previous leader and needs to be recalculated.
Using scaleout clusters in jobs
All Agents in the scaleout cluster synchronize the same physical location. If the storage location is misconfigured, the followers report the error and cease synchronization.
Only direct path or a storage connector location are supported. Path macros are not supported for scaleout cluster, and if used, the behavior will be undefined. Changing the job path for an already configured job is not supported, you need to remove the cluster from the job and add again with a new path.
Resilio supports different job configurations: cluster-to-cluster, cluster-to-Agents.
It's recommended to enable TCP multistream and increase number of allowed connections for Agents in cluster before creating the job
{
'net.tcp.streams'='10', //10 is optimal value for majority of cases.
'transfer_peers_limit': 'X', // where X= (size of cluster)*3
'overwrite_changes':yes //otherwise, Agents in cluster won't be able to properly detect the file changes and the job will stuck.
}
The following parameters are enforced for scaleout cluster in the jobs:
{
'disk.unwritten_max_load_factor': '5',
'disk.unwritten_async_max_load_factor': '10',
'disk.out_of_order_threads': '1'
}
Job run details and its progressing - number of files, size of the job, etc. - are reported by the сcluster leader that is taken from brunch leader. Leader reports the total datasets, while branch leaders reports the part of the data set they manage. Helpers' information is not taken into consideration.
The general scaleout cluster transfer speed is calculated as a sum of the speed from all agents in the cluster.
Total cluster files counter and and size are reported per branch leaders.
For getting the correct understanding and expectations of how the job run is progressing, all leaders in scaleout clusters in this job must be elected (not in the process of failover) and online (connected to MC). Otherwise the job's progress will be reduced accordingly.
Additionally, more detailed transfer status is available for leaders from its overview in job run.
Scaleout cluster members in addition to existing statuses, report the following statuses:
working - Agent is a branch leader in the scaleout cluster and is performing a task. Check the leader Agent for details.
active - Agent is a helper in scaleout cluster and is downloading or seeding files.
inactive - Agent is passive and does not perform any activity in the scaleout cluster.
Sync jobs/File cache/Hybrid work
Scaleout cluster can be used in these jobs with the "not supported" limitations mentioned in paragraph below.
Synchronization of file permissions is supported. A whole scaleout cluster can be selected as a Reference Agent. See here for more details about Reference Agent in general and working with scaleout clusters in particular.
Cross-platform synchronization of file permissions
Using scaleout clusters with only one Agent assigned cluster member role on a system that cannot apply the replicated permissions (for example, on a Linux with replicated NTFS permissions or vice versa) should be avoided. It may lead to unexpected permissions issues and access problems. Always ensure that scaleout clusters are used on systems with compatible file permission structures.Distribution/Consolidation jobs
Scaleout clusters can be used as source or destination in these jobs.
Triggers in the job are executed by the leader in the scaleout cluster. If failover happens during script execution, the newly elected leader starts executing the script from scratch. Before finalizing download trigger is not supported for scaleout cluster.
An active job run can be stopped on an Agent from scaleout cluster - the job run is stopped for all Agents in the cluster.
Adding new agents to an active job run with a scaleout cluster in it is not supported.
Restarting the job run that has a scaleout cluster in it is not supported, Agents will report the error about the misconfigured storage path.
Job run will be aborted on all Agents in scaleout cluster in case of some error, which is added to "Abort on error" list, appeared on the cluster leader. "Agent offline" error is ignored on all scaleout agents.
Tags AGENT_NAME and AGENT_ID are supported for scaleout clusters. The name and ID of the scaleout cluster will be used.
Peculiarities and limitations
Not supported:
- all Agents of version older than 5.0.0
- mobile Agents, Agents install on NAS devices, MC Agent.
- Script jobs
- path macros for scaleout clusters
- changing the job path for scaleout cluster
- adding new agents to a job run or restarting the job run that has a scaleout cluster
- network policy rules
- Before finalizing download trigger
- Job priorities
File query results are reported only from the cluster leader.
Scaleout clusters are not compatible with Priority Agent functionality.
No automatic support for scaling up or down.
Temporary error "Share's identifying .sync/ID file is broken" for scaleout clusters may appear after changing the job type or recreating the job using the same files storage.
Deduplication implementation is incomplete; helpers may re-download pieces.
Single large file split download in OneDrive is not supported due to of sequential write requirement from OneDrive.
Helpers wont join download from if their tunnel peer is already seeding to their cluster.