Receive (RX)

flowchart LR
    nic[NIC hardware buffer]
    irq[hard IRQ]
    softirq[soft IRQ]
    recvq[app socket queue]
    app[application]
    nic==>irq==>softirq==>recvq==>app

Scaling for multi-core system is implemented with hardware-based Receive-Side Scaling (RSS).

If the number of hardware queue of a single network interface card becomes a bottleneck, software-based Receive Packet Steering (RPS) can be used to further distribute load across CPU cores, at the cost of increased inter-processor interrupts.

Receive Flow Steering (RFS) extends RPS behavior to increase the CPU cache hit rate and thereby reduce network latency. RFS uses the RPS backend to calculate the most appropriate CPU, then forwards packets based on the location of the application consuming the packet.

Accelerated RFS (aRFS) boosts the speed of RFS by adding hardware assistance. Unlike traditional RFS, however, packets are sent directly to a CPU that is local to the thread consuming the data.

On the other hand, the Linux kernel’s SO_ATTACH_REUSEPORT_EBPF option allows a program to attach a fully functional BPF program as a load balancing algorithm, which can be used to steer packets as well.

Both aRFS and SO_REUSEPORT locality can improve CPU cache efficiency, but the performance improvement is usually negligible when compared to other CPU or I/O intensive operations.

Transmit (TX)

In hosts with a network interface controller (NIC) that supports multiple queues, transmit packet steering (XPS) distributes the processing of outgoing network packets among several queues. This enables multiple CPUs to process the outgoing network traffic and to avoid transmit queue lock contention and, consequently, packet drops.

Certain drivers, such as ixgbe, i40e, and mlx5 automatically configure XPS. To identify if the driver supports this capability, consult the documentation of your NIC driver. Consult your NIC driver’s documentation to identify if the driver supports this capability. If the driver does not support XPS auto-tuning, you can manually assign CPU cores to the transmit queues.

References