NUMA Topology
AMD recommends pinning instances within a NUMA node, but does not recommend doing so via the application-level worker_cpu_affinity
option.
NGINX’s worker_cpu_affinity
option may lower performance by increasing the time a process spends waiting for a free CPU. This can be monitored by running runqlat
on one of NGINX workers’s PIDs. On the other hand, worker_cpu_affinity
eliminates CPU migrations, reduce cache misses and page faults, and slightly increases instructions per cycle. All of which can be verified with perf stat
.
Node Per Socket (NPS) Settings
For example, a 96-core processor with 96 vCPUs per NGINX instance (2 instances in total) should set NPS2 in BIOS, with each NGINX instance pinned to a NUMA node.
For NIC tuning, AMD recommends a combination of NPS=1 with LLC as NUMA enabled. Here, LLC means Last Level Cache, or L3 cache, so the OS will see one NUMA node per L3 cache. This can help the OS schedulers maintain locality to the LLC without causing unnecessary cache-to-cache transactions.
If deployment restrictions prevent pinning of VM or NGINX instances, NPS1 will deliver the most consistent performance. This is the best trade-off for this situation, according to AMD.
NIC Configuration
Configure NIC Queues
Broadcom recommends the use of combined queues no more than a single IRQ per physical core.
ethtool -L [interface] combined 8 tx 0 rx 0
Ensure IRQ distribution, i.e. CPU affinity for the NIC queue interrupts, is properly set up. AWS does not recommend disabling the irqbalance
service because its ENA driver doesn’t provide affinity hints, and if device reset happens while irqbalance is disabled, this might cause undesirable IRQ distribution. On bare-metal, this could be another case.
RX and TX ring sizes
AMD recommends setting the maximum allowable ring size to boost network performance, but not on older kernels or derivers without byte queue limit support (non-BQL drivers).
Broadcom does not suggest this for all cases as it could result in higher latency and other side effects.
ethtool -G [interface] tx 2047 rx 2047
Interrupt Coalescing
Broadcom recommends enabling adaptive-rx
to improve RX latency or throughput adaptively.
ethtool -C [interface] adaptive-rx on
GRO (Generic Receive Offload)
This should be disabled on routers and bridges, including virtual hosts using bridging. See also NIC Offload.
Broadcom NICs support Hardware GRO, which can be enabled with the following command.
ethtool -K [interface] rx-gro-hw on lro off gro on
System Configuration
Linux Kernel version
AMD recommends using Linux kernel 5.20 or newer that includes IOMMU optimized patches.
CPU Scaling Governor
Set the CPU scaling governor to Performance mode.
echo performance | sudo tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
On Ubuntu, you can use cpupower
as well.
sudo cpupower frequency-set -g performance
Note: C0 is active state and Cx is sleep state in cpupower monitor -i 60 -m Mperf
.
Configure Limits
Add the following configuration to a new file in /etc/security/limits.d/*.conf
, if your current limit is lower.
#nproc – number of processes
#nofile – number of file descriptors
* soft nproc 32768
* hard nproc 65535
* soft nofile 32768
* hard nofile 65535
root soft nproc 32768
root hard nproc 65535
root soft nofile 32768
root hard nofile 65535
Then, add LimitNOFILE=65535
option to nginx.service
or set worker_rlimit_nofile
in Nginx configuration to increase the maximum number of open files for worker processes.
Firewall
AMD recommends disabling the firewall if possible to improve performance.
If you have firewall and connection tracking enabled, make sure nf_conntrack_max
is set to an appropriate value.
Transparent Hugepages (THP)
Make it opt-in with madvise
. Only enable THP if you are sure they are beneficial.
echo madvise | sudo tee /sys/kernel/mm/transparent_hugepage/enabled
sysctl
This is a sample sysctl.conf
based on various sources including AMD’s recommendations.
# /etc/sysctl.conf
########## Kernel ##############
# Controls the System Request debugging functionality of the kernel
kernel.sysrq = 0
# Controls whether core dumps will append the PID to the core
# filename. Useful for debugging multi-threaded applications
kernel.core_uses_pid = 1
# increase system file descriptor limit
fs.file-max = 65535
# Allow for more PIDs
kernel.pid_max = 65536
########## Swap ##############
vm.swappiness = 10 # Favor RAM over swap
# Disk Caching. Data isn't critical and can be lost? Favor raising the cache.
# NOT recommended on modern systems with very alrge amounts of RAM. Comment it out!
vm.vfs_cache_pressure = 50
vm.dirty_background_ratio = 50
vm.dirty_ratio = 80
########## IPv4 networking ##############
# Controls IP packet forwarding
net.ipv4.ip_forward = 0
# Do not accept source routing
net.ipv4.conf.default.accept_source_route = 0
# Send redirects, if router, but this is just server
net.ipv4.conf.all.send_redirects = 0
net.ipv4.conf.default.send_redirects = 0
# Accept packets with SRR option? No
net.ipv4.conf.all.accept_source_route = 0
# Accept Redirects? No, this is not router
net.ipv4.conf.all.accept_redirects = 0
net.ipv4.conf.all.secure_redirects = 0
# Log packets with impossible addresses to kernel log? yes
net.ipv4.conf.all.log_martians = 1
net.ipv4.conf.default.log_martians = 1
net.ipv4.conf.default.accept_source_route = 0
net.ipv4.conf.default.accept_redirects = 0
net.ipv4.conf.default.secure_redirects = 0
# Ignore all ICMP ECHO and TIMESTAMP requests sent to it via broadcast/multicast
net.ipv4.icmp_echo_ignore_broadcasts = 1
# Turn on protection for bad icmp error messages
net.ipv4.icmp_ignore_bogus_error_responses = 1
# Prevent against the common 'syn flood attack'
net.ipv4.tcp_syncookies = 1
# Controls the use of TCP syncookies
net.ipv4.tcp_synack_retries = 2
# Enable source validation by reversed path, as specified in RFC1812
net.ipv4.conf.all.rp_filter = 1
net.ipv4.conf.default.rp_filter = 1
# TCP and memory optimization
# increase TCP max buffer size to 8MiB
net.ipv4.tcp_rmem = 4096 131072 8388608
net.ipv4.tcp_wmem = 4096 16384 8388608
# increase Linux auto tuning TCP buffer limits
net.core.rmem_max = 8388608
net.core.wmem_max = 8388608
net.core.netdev_max_backlog = 5000
net.ipv4.tcp_window_scaling = 1
#Increase system IP port limits
net.ipv4.ip_local_port_range = 2000 65499
net.ipv4.tcp_tw_reuse = 1
net.ipv4.tcp_fin_timeout = 60
net.ipv4.tcp_slow_start_after_idle = 0
# Recommended for hosts with jumbo frames enabled
net.ipv4.tcp_mtu_probing = 1
# TCP Fast Open
net.ipv4.tcp_fastopen = 3
#net.ipv4.tcp_congestion_control = cubic
########## IPv4 networking ends ##############
########## IPv6 networking start ##############
# Number of Router Solicitations to send until assuming no routers are present.
# This is host and not router
net.ipv6.conf.default.router_solicitations = 0
# Accept Router Preference in RA?
net.ipv6.conf.default.accept_ra_rtr_pref = 0
# Learn Prefix Information in Router Advertisement
net.ipv6.conf.default.accept_ra_pinfo = 0
# Setting controls whether the system will accept Hop Limit settings
# from a router advertisement
net.ipv6.conf.default.accept_ra_defrtr = 0
# Router Advertisements can cause the system to assign a global unicast
# address to an interface
net.ipv6.conf.default.autoconf = 0
# how many neighbor solicitations to send out per address?
net.ipv6.conf.default.dad_transmits = 0
# How many global unicast IPv6 addresses can be assigned to each interface?
net.ipv6.conf.default.max_addresses = 1
########## IPv6 networking ends ##############
Nginx Configuration
Open file cache
Useful for serving lots of static or cached data. See official documentation for the open_file_cache*
directives. Check potential benefits with:
# funclatency /srv/nginx-bazel/sbin/nginx:ngx_open_cached_file -u
usecs : count distribution
0 -> 1 : 10219 |****************************************|
2 -> 3 : 21 | |
4 -> 7 : 3 | |
8 -> 15 : 1 | |
If there are too many open calls or there are some that take too much time, you can look at enabling the open file cache.
HTTP Keepalive
keepalive_requests
can be increased, at the risk of introducing additional DDoS attack vectors.
Buffering requests and responses with large body
For buffering the request body, enabling sendfile
improves performance for requests with large content size (>1MB) and results in a small performance loss for requests with small content sizes. AMD recommends setting sendfile_max_chunk
to the typical average request size.
Enabling tcp_nopush
can be beneficial to serving large content to users by maxing out packet size until the file is fully sent.
sendfile on;
tcp_nopush on;
Worker configuration
worker_processes
should be set to the number of vCPUs. worker_connections
should be increased as needed.
Logging
Use ext4slower
to identify disk I/O latency issues. Enable buffering and gzip for the access_log
directive to help reduce blocking on I/O.
If multiple NGINX processes attempt to write to the same log file, lock contention could be a dominating factor in your CPU profile.
Caching and Compression
Properly configured cache can increase performance, especially if serving stale content is enabled. Make sure there is sufficient RAM to store the hot cached content in OS page cache.
NGINX supports Gzip compression, which accelerates the transfer rate from the server to the client and reduces bandwidth usage.
Enable AIO (Asynchronous file I/O)
Write temporary files with data received from proxied servers with AIO to boost performance.
If you have a reasonable amount of RAM, you are not using spinning disks and your working data set isn’t very big, NGINX can utilize OS page cache recommends not enabling AIO to reduce the overhead of offloading.
aio threads;
aio_write on;
Enable PCRE JIT
JIT can speed up processing of regular expressions significantly if you have a lot of them.
pcre_jit on;
Spot Event Loop Stalls
If you start noticing that your nginx is spending too much time inside ngx_process_events_and_timers
, and distribution is bimodal, then you probably are affected by event loop stalls.
# funclatency '/srv/nginx-bazel/sbin/nginx:ngx_process_events_and_timers' -m
msecs : count distribution
0 -> 1 : 3799 |****************************************|
2 -> 3 : 0 | |
4 -> 7 : 0 | |
8 -> 15 : 0 | |
16 -> 31 : 409 |**** |
32 -> 63 : 313 |*** |
64 -> 127 : 128 |* |
You will need more skills to root cause and fix such issues, which is beyond the scope of this article.
References
- https://dropbox.tech/infrastructure/optimizing-web-servers-for-high-throughput-and-low-latency
- https://netdevconf.org/1.2/papers/bbr-netdev-1.2.new.new.pdf
- https://www.youtube.com/watch?v=aGL8a3Agj-c
- https://techdocs.broadcom.com/us/en/storage-and-ethernet-connectivity/ethernet-nic-controllers/bcm957xxx/adapters/Tuning.html
- https://www.amd.com/content/dam/amd/en/documents/epyc-technical-docs/tuning-guides/58489_amd-epyc-9005-tg-nginx.pdf
- https://www.amd.com/content/dam/amd/en/documents/epyc-technical-docs/tuning-guides/58472_amd-epyc-9005-tg-linux-network.pdf
- https://cdrdv2-public.intel.com/334019/334019_Intel%20Ethernet%20700%20Series%20Linux%20Performance%20Tuning%20Guide.pdf