NUMA Topology

AMD recommends pinning instances within a NUMA node, but does not recommend doing so via the application-level worker_cpu_affinity option.

NGINX’s worker_cpu_affinity option may lower performance by increasing the time a process spends waiting for a free CPU. This can be monitored by running runqlat on one of NGINX workers’s PIDs. On the other hand, worker_cpu_affinity eliminates CPU migrations, reduce cache misses and page faults, and slightly increases instructions per cycle. All of which can be verified with perf stat.

Node Per Socket (NPS) Settings

For example, a 96-core processor with 96 vCPUs per NGINX instance (2 instances in total) should set NPS2 in BIOS, with each NGINX instance pinned to a NUMA node.

For NIC tuning, AMD recommends a combination of NPS=1 with LLC as NUMA enabled. Here, LLC means Last Level Cache, or L3 cache, so the OS will see one NUMA node per L3 cache. This can help the OS schedulers maintain locality to the LLC without causing unnecessary cache-to-cache transactions.

If deployment restrictions prevent pinning of VM or NGINX instances, NPS1 will deliver the most consistent performance. This is the best trade-off for this situation, according to AMD.

NIC Configuration

Configure NIC Queues

Broadcom recommends the use of combined queues no more than a single IRQ per physical core.

ethtool -L [interface] combined 8 tx 0 rx 0

Ensure IRQ distribution, i.e. CPU affinity for the NIC queue interrupts, is properly set up. AWS does not recommend disabling the irqbalance service because its ENA driver doesn’t provide affinity hints, and if device reset happens while irqbalance is disabled, this might cause undesirable IRQ distribution. On bare-metal, this could be another case.

RX and TX ring sizes

AMD recommends setting the maximum allowable ring size to boost network performance, but not on older kernels or derivers without byte queue limit support (non-BQL drivers).

Broadcom does not suggest this for all cases as it could result in higher latency and other side effects.

ethtool -G [interface] tx 2047 rx 2047

Interrupt Coalescing

Broadcom recommends enabling adaptive-rx to improve RX latency or throughput adaptively.

ethtool -C [interface] adaptive-rx on

GRO (Generic Receive Offload)

This should be disabled on routers and bridges, including virtual hosts using bridging. See also NIC Offload.

Broadcom NICs support Hardware GRO, which can be enabled with the following command.

ethtool -K [interface] rx-gro-hw on lro off gro on

System Configuration

Linux Kernel version

AMD recommends using Linux kernel 5.20 or newer that includes IOMMU optimized patches.

CPU Scaling Governor

Set the CPU scaling governor to Performance mode.

echo performance | sudo tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor

On Ubuntu, you can use cpupower as well.

sudo cpupower frequency-set -g performance

Note: C0 is active state and Cx is sleep state in cpupower monitor -i 60 -m Mperf.

Configure Limits

Add the following configuration to a new file in /etc/security/limits.d/*.conf, if your current limit is lower.

#nproc – number of processes
#nofile – number of file descriptors
* soft nproc 32768
* hard nproc 65535
* soft nofile 32768
* hard nofile 65535
root soft nproc 32768
root hard nproc 65535
root soft nofile 32768
root hard nofile 65535

Then, add LimitNOFILE=65535 option to nginx.service or set worker_rlimit_nofile in Nginx configuration to increase the maximum number of open files for worker processes.

Firewall

AMD recommends disabling the firewall if possible to improve performance.

If you have firewall and connection tracking enabled, make sure nf_conntrack_max is set to an appropriate value.

Transparent Hugepages (THP)

Make it opt-in with madvise. Only enable THP if you are sure they are beneficial.

echo madvise | sudo tee /sys/kernel/mm/transparent_hugepage/enabled

sysctl

This is a sample sysctl.conf based on various sources including AMD’s recommendations.

# /etc/sysctl.conf
########## Kernel ##############
# Controls the System Request debugging functionality of the kernel
kernel.sysrq = 0
 
# Controls whether core dumps will append the PID to the core
# filename. Useful for debugging multi-threaded applications
kernel.core_uses_pid = 1
 
# increase system file descriptor limit
fs.file-max = 65535
 
# Allow for more PIDs
kernel.pid_max = 65536
 
########## Swap ##############
vm.swappiness = 10 # Favor RAM over swap
 
# Disk Caching. Data isn't critical and can be lost? Favor raising the cache.
# NOT recommended on modern systems with very alrge amounts of RAM. Comment it out!
vm.vfs_cache_pressure = 50
vm.dirty_background_ratio = 50
vm.dirty_ratio = 80
 
########## IPv4 networking ##############
# Controls IP packet forwarding
net.ipv4.ip_forward = 0
 
# Do not accept source routing
net.ipv4.conf.default.accept_source_route = 0
 
# Send redirects, if router, but this is just server
net.ipv4.conf.all.send_redirects = 0
net.ipv4.conf.default.send_redirects = 0
 
# Accept packets with SRR option? No
net.ipv4.conf.all.accept_source_route = 0
 
# Accept Redirects? No, this is not router
net.ipv4.conf.all.accept_redirects = 0
net.ipv4.conf.all.secure_redirects = 0
 
# Log packets with impossible addresses to kernel log? yes
net.ipv4.conf.all.log_martians = 1
net.ipv4.conf.default.log_martians = 1
net.ipv4.conf.default.accept_source_route = 0
net.ipv4.conf.default.accept_redirects = 0
net.ipv4.conf.default.secure_redirects = 0
 
# Ignore all ICMP ECHO and TIMESTAMP requests sent to it via broadcast/multicast
net.ipv4.icmp_echo_ignore_broadcasts = 1
# Turn on protection for bad icmp error messages
net.ipv4.icmp_ignore_bogus_error_responses = 1
 
# Prevent against the common 'syn flood attack'
net.ipv4.tcp_syncookies = 1
 
# Controls the use of TCP syncookies
net.ipv4.tcp_synack_retries = 2
 
# Enable source validation by reversed path, as specified in RFC1812
net.ipv4.conf.all.rp_filter = 1
net.ipv4.conf.default.rp_filter = 1
 
# TCP and memory optimization
# increase TCP max buffer size to 8MiB
net.ipv4.tcp_rmem = 4096 131072 8388608
net.ipv4.tcp_wmem = 4096 16384 8388608
 
# increase Linux auto tuning TCP buffer limits
net.core.rmem_max = 8388608
net.core.wmem_max = 8388608
net.core.netdev_max_backlog = 5000
net.ipv4.tcp_window_scaling = 1
 
#Increase system IP port limits
net.ipv4.ip_local_port_range = 2000 65499
net.ipv4.tcp_tw_reuse = 1
net.ipv4.tcp_fin_timeout = 60
 
net.ipv4.tcp_slow_start_after_idle = 0
 
# Recommended for hosts with jumbo frames enabled
net.ipv4.tcp_mtu_probing = 1
 
# TCP Fast Open
net.ipv4.tcp_fastopen = 3
 
#net.ipv4.tcp_congestion_control = cubic
 
########## IPv4 networking ends ##############
 
########## IPv6 networking start ##############
 
# Number of Router Solicitations to send until assuming no routers are present.
# This is host and not router
net.ipv6.conf.default.router_solicitations = 0
 
# Accept Router Preference in RA?
net.ipv6.conf.default.accept_ra_rtr_pref = 0
 
# Learn Prefix Information in Router Advertisement
net.ipv6.conf.default.accept_ra_pinfo = 0
 
# Setting controls whether the system will accept Hop Limit settings
# from a router advertisement
net.ipv6.conf.default.accept_ra_defrtr = 0
 
# Router Advertisements can cause the system to assign a global unicast
# address to an interface
net.ipv6.conf.default.autoconf = 0
 
# how many neighbor solicitations to send out per address?
net.ipv6.conf.default.dad_transmits = 0
 
# How many global unicast IPv6 addresses can be assigned to each interface?
net.ipv6.conf.default.max_addresses = 1
 
########## IPv6 networking ends ##############

Nginx Configuration

Open file cache

Useful for serving lots of static or cached data. See official documentation for the open_file_cache* directives. Check potential benefits with:

# funclatency /srv/nginx-bazel/sbin/nginx:ngx_open_cached_file -u
     usecs               : count     distribution
         0 -> 1          : 10219    |****************************************|
         2 -> 3          : 21       |                                        |
         4 -> 7          : 3        |                                        |
         8 -> 15         : 1        |                                        |

If there are too many open calls or there are some that take too much time, you can look at enabling the open file cache.

HTTP Keepalive

keepalive_requests can be increased, at the risk of introducing additional DDoS attack vectors.

Buffering requests and responses with large body

For buffering the request body, enabling sendfile improves performance for requests with large content size (>1MB) and results in a small performance loss for requests with small content sizes. AMD recommends setting sendfile_max_chunk to the typical average request size.

Enabling tcp_nopush can be beneficial to serving large content to users by maxing out packet size until the file is fully sent.

sendfile on;
tcp_nopush on;

Worker configuration

worker_processes should be set to the number of vCPUs. worker_connections should be increased as needed.

Logging

Use ext4slower to identify disk I/O latency issues. Enable buffering and gzip for the access_log directive to help reduce blocking on I/O.

If multiple NGINX processes attempt to write to the same log file, lock contention could be a dominating factor in your CPU profile.

Caching and Compression

Properly configured cache can increase performance, especially if serving stale content is enabled. Make sure there is sufficient RAM to store the hot cached content in OS page cache.

NGINX supports Gzip compression, which accelerates the transfer rate from the server to the client and reduces bandwidth usage.

Enable AIO (Asynchronous file I/O)

Write temporary files with data received from proxied servers with AIO to boost performance.

If you have a reasonable amount of RAM, you are not using spinning disks and your working data set isn’t very big, NGINX can utilize OS page cache recommends not enabling AIO to reduce the overhead of offloading.

aio threads;
aio_write on;

Enable PCRE JIT

JIT can speed up processing of regular expressions significantly if you have a lot of them.

pcre_jit on;

Spot Event Loop Stalls

If you start noticing that your nginx is spending too much time inside ngx_process_events_and_timers, and distribution is bimodal, then you probably are affected by event loop stalls.

# funclatency '/srv/nginx-bazel/sbin/nginx:ngx_process_events_and_timers' -m
     msecs               : count     distribution
         0 -> 1          : 3799     |****************************************|
         2 -> 3          : 0        |                                        |
         4 -> 7          : 0        |                                        |
         8 -> 15         : 0        |                                        |
        16 -> 31         : 409      |****                                    |
        32 -> 63         : 313      |***                                     |
        64 -> 127        : 128      |*                                       |

You will need more skills to root cause and fix such issues, which is beyond the scope of this article.

References