|
|
Subscribe / Log in / New account

The SO_REUSEPORT socket option

This article brought to you by LWN subscribers

Subscribers to LWN.net made this article — and everything that surrounds it — possible. If you appreciate our content, please buy a subscription and make the next set of articles possible.

By Michael Kerrisk
March 13, 2013

One of the features merged in the 3.9 development cycle was TCP and UDP support for the SO_REUSEPORT socket option; that support was implemented in a series of patches by Tom Herbert. The new socket option allows multiple sockets on the same host to bind to the same port, and is intended to improve the performance of multithreaded network server applications running on top of multicore systems.

The basic concept of SO_REUSEPORT is simple enough. Multiple servers (processes or threads) can bind to the same port if they each set the option as follows:

    int sfd = socket(domain, socktype, 0);

    int optval = 1;
    setsockopt(sfd, SOL_SOCKET, SO_REUSEPORT, &optval, sizeof(optval));

    bind(sfd, (struct sockaddr *) &addr, addrlen);

So long as the first server sets this option before binding its socket, then any number of other servers can also bind to the same port if they also set the option beforehand. The requirement that the first server must specify this option prevents port hijacking—the possibility that a rogue application binds to a port already used by an existing server in order to capture (some of) its incoming connections or datagrams. To prevent unwanted processes from hijacking a port that has already been bound by a server using SO_REUSEPORT, all of the servers that later bind to that port must have an effective user ID that matches the effective user ID used to perform the first bind on the socket.

SO_REUSEPORT can be used with both TCP and UDP sockets. With TCP sockets, it allows multiple listening sockets—normally each in a different thread—to be bound to the same port. Each thread can then accept incoming connections on the port by calling accept(). This presents an alternative to the traditional approaches used by multithreaded servers that accept incoming connections on a single socket.

The first of the traditional approaches is to have a single listener thread that accepts all incoming connections and then passes these off to other threads for processing. The problem with this approach is that the listening thread can become a bottleneck in extreme cases. In early discussions on SO_REUSEPORT, Tom noted that he was dealing with applications that accepted 40,000 connections per second. Given that sort of number, it's unsurprising to learn that Tom works at Google.

The second of the traditional approaches used by multithreaded servers operating on a single port is to have all of the threads (or processes) perform an accept() call on a single listening socket in a simple event loop of the form:

    while (1) {
        new_fd = accept(...);
        process_connection(new_fd);
    }

The problem with this technique, as Tom pointed out, is that when multiple threads are waiting in the accept() call, wake-ups are not fair, so that, under high load, incoming connections may be distributed across threads in a very unbalanced fashion. At Google, they have seen a factor-of-three difference between the thread accepting the most connections and the thread accepting the fewest connections; that sort of imbalance can lead to underutilization of CPU cores. By contrast, the SO_REUSEPORT implementation distributes connections evenly across all of the threads (or processes) that are blocked in accept() on the same port.

As with TCP, SO_REUSEPORT allows multiple UDP sockets to be bound to the same port. This facility could, for example, be useful in a DNS server operating over UDP. With SO_REUSEPORT, each thread could use recv() on its own socket to accept datagrams arriving on the port. The traditional approach is that all threads would compete to perform recv() calls on a single shared socket. As with the second of the traditional TCP scenarios described above, this can lead to unbalanced loads across the threads. By contrast, SO_REUSEPORT distributes datagrams evenly across all of the receiving threads.

Tom noted that the traditional SO_REUSEADDR socket option already allows multiple UDP sockets to be bound to, and accept datagrams on, the same UDP port. However, by contrast with SO_REUSEPORT, SO_REUSEADDR does not prevent port hijacking and does not distribute datagrams evenly across the receiving threads.

There are two other noteworthy points about Tom's patches. The first of these is a useful aspect of the implementation. Incoming connections and datagrams are distributed to the server sockets using a hash based on the 4-tuple of the connection—that is, the peer IP address and port plus the local IP address and port. This means, for example, that if a client uses the same socket to send a series of datagrams to the server port, then those datagrams will all be directed to the same receiving server (as long as it continues to exist). This eases the task of conducting stateful conversations between the client and server.

The other noteworthy point is that there is a defect in the current implementation of TCP SO_REUSEPORT. If the number of listening sockets bound to a port changes because new servers are started or existing servers terminate, it is possible that incoming connections can be dropped during the three-way handshake. The problem is that connection requests are tied to a specific listening socket when the initial SYN packet is received during the handshake. If the number of servers bound to the port changes, then the SO_REUSEPORT logic might not route the final ACK of the handshake to the correct listening socket. In this case, the client connection will be reset, and the server is left with an orphaned request structure. A solution to the problem is still being worked on, and may consist of implementing a connection request table that can be shared among multiple listening sockets.

The SO_REUSEPORT option is non-standard, but available in a similar form on a number of other UNIX systems (notably, the BSDs, where the idea originated). It seems to offer a useful alternative for squeezing the maximum performance out of network applications running on multicore systems, and thus is likely to be a welcome addition for some application developers.

Index entries for this article
KernelNetworking/SO_REUSEPORT


(Log in to post comments)

The SO_REUSEPORT socket option

Posted Mar 14, 2013 7:25 UTC (Thu) by kugel (subscriber, #70540) [Link]

I guess it was not feasible to fix "multiple servers accept() on the same socket" to distribute connections more evenly?

The SO_REUSEPORT socket option

Posted Mar 14, 2013 10:55 UTC (Thu) by jezuch (subscriber, #52988) [Link]

> I guess it was not feasible to fix "multiple servers accept() on the same socket" to distribute connections more evenly?

My thought exactly.

Also:

"Incoming connections and datagrams are distributed to the server sockets using a hash based on the 4-tuple of the connection—that is, the peer IP address and port plus the local IP address and port. (...) This eases the task of conducting stateful conversations between the client and server."

It is presented as an "implementation detail" so I guess one should not be surprised if it stops working this way sometime in the future? :)

The SO_REUSEPORT socket option

Posted Mar 14, 2013 11:34 UTC (Thu) by andresfreund (subscriber, #69562) [Link]

> > I guess it was not feasible to fix "multiple servers accept() on the same socket" to distribute connections more evenly?

> My thought exactly.

One argument for SO_REUSEPORT is that it makes it easier to use independently started processes on the same socket. E.g. with it you can simply start the new server - potentially in a new version - and shutdown the old one after that, without any service interruption. At the moment you need to have a unix socket between the servers, send over the tcp socket file handle, start accept()ing in the new server and then shutdown the old one.

The SO_REUSEPORT socket option

Posted Jun 8, 2014 5:40 UTC (Sun) by wahern (subscriber, #37304) [Link]

The current implementation doesn't actually support that. That's because each socket has its own queue, and when the 3-way handshake completes a connection is assigned to exactly one of those queues. That creates a race condition between accept(2) and close(2).

So, no, this doesn't support seamless server restarts.

Ironically it's the BSD semantics which support seamless server restarts. In my tests OS X's behavior (which I presume is identical to FreeBSD and other BSDs) is that the last socket to bind is the only one to receive new connections. That allows the old server to drain its queue and retire without worrying about any dropped connections.

The SO_REUSEPORT socket option

Posted Mar 15, 2013 16:51 UTC (Fri) by giraffedata (guest, #1954) [Link]

I guess it was not feasible to fix "multiple servers accept() on the same socket" to distribute connections more evenly?
"Incoming connections and datagrams are distributed to the server sockets using a hash based on the 4-tuple of the connection—that is, the peer IP address and port plus the local IP address and port."

That probably explains it. If you used this technique on multiple threads accepting on the same traditional socket, you would be fixing one thing and breaking another. Today, if a thread is blocked in accept() and no other thread is, and a connection request arrives, the thread gets it. It sounds like with a SO_REUSEPORT socket, the connection request would wait until its predetermined accepter is ready to take it.

The SO_REUSEPORT socket option

Posted Mar 15, 2013 19:00 UTC (Fri) by dlang (guest, #313) [Link]

I'll point out that this is very similar to the CLUSTERIP capability that iptables provides.

CLUSTERIP shares one IP across multiple machines, doing a hash to make decide which machine gets the packet to userspace (they have the option of using the full 4-tuple or only part of it) You then use heartbeat or other clustering software to make sure that there is always some box that will handle the connection (with the expected gaps due to races when machines go down and the clustering software hasn't responded yet)

SO_REUSEPORT sounds like it extends this capability to multiple processes/threads inside one box, but with the re-balancing being done automatically by the kernel (with with very similar races when processes/threads go down and the kernel hasn't responded yet)

This is a very nice new feature to have available.

The SO_REUSEPORT socket option

Posted Feb 8, 2014 18:19 UTC (Sat) by batth.maninder@gmail.com (guest, #81449) [Link]

The ability to route to same tuple sounds like premature optimization. For example, if i have spawned few stateless servers, and if a connection request arrives, it would simply sit in queue till the original server is ready? Most typical architecture for stateful servers is to spawn 3 of them on different physical machines for reliability purposes and have load balancer perform sticky sessions. Another option is to spawn stateless servers with distributed cache to maintain cache. The ability to route connections and even datagrams to the same tuple is really an "application" level concern. What is the original server tuple is not running any more? Do i need to plug in policies for SO_REUSEPORT to determine how to handle failure conditions?

The SO_REUSEPORT socket option

Posted Jun 2, 2020 21:14 UTC (Tue) by nilsocket (guest, #135507) [Link]

I think, one should remember that, this is done for the sake of performance, for those who need it.

It isn't meant to replace load balancers or something similar.

Maybe within a single application, with multiple threads to handle the load, seems like a good option to have.

The SO_REUSEPORT socket option

Posted Mar 14, 2013 11:07 UTC (Thu) by jengelh (subscriber, #33263) [Link]

>The SO_REUSEPORT option is non-standard (but available in a similar form on a number of other UNIX systems notably, the BSDs, where the idea originated).

In other news, what Linux comes up is setting standards, because there have not been any standards before :)

The SO_REUSEPORT socket option

Posted Mar 15, 2013 2:02 UTC (Fri) by pr1268 (subscriber, #24648) [Link]

> In other news, what Linux comes up is setting standards, because there have not been any standards before :)

Hey, Linux is THE trendsetter here. Linux hackers and organizations behind its development are certainly not going to wait for some standards body like POSIX, IETF, or whoever to get off their keisters to address this issue! Especially when we're...

...dealing with applications that accept 40,000 connections per second. :)

The SO_REUSEPORT socket option

Posted Aug 24, 2013 21:12 UTC (Sat) by lyda (guest, #7429) [Link]

Well, again, it was in BSD first. Credit where it's due.

But generally, Unix "standards" have always trailed implementations. It's just that now Linux and BSD are in reality the primary Unix implementations.

(cue screams about which is the real unix)

The SO_REUSEPORT socket option

Posted Oct 1, 2014 16:09 UTC (Wed) by vsrinivas (subscriber, #56913) [Link]

DragonFly BSD has implemented SO_REUSEPORT since July 2013 too; http://lists.dragonflybsd.org/pipermail/users/2013-July/0... has some very interesting performance numbers.

SO_REUSEPORT naturally aligns with networks stacks parallelized like DFly's/Solaris's (hash connection state early, map & fanout to a fixed CPU per connection, no locking till you hit socket buffer layer).

The SO_REUSEPORT socket option

Posted Mar 15, 2013 5:48 UTC (Fri) by kynde (guest, #73236) [Link]

I'm curious about the described defect, in a sense that where this option would be useful I would've expected to see syn cookies being used.

The SO_REUSEPORT socket option

Posted Mar 28, 2013 6:04 UTC (Thu) by jpb (guest, #87584) [Link]

Related to "The second of the traditional approaches [...] wake-ups are not fair"

I am just wondering does that really matter? I mean all it matters is that whoever as CPU cycles should pick up the next workload. Now I am not sure why the thread itself picking up the load is actually relevant. The CPU being underutilized seems to solve by itself. I mean if a core is saturated then effectively an extra thread will pick up the next connection and this should work well with what the CPU scheduler is already doing for balancing.

The SO_REUSEPORT socket option

Posted Feb 8, 2014 18:23 UTC (Sat) by batth.maninder@gmail.com (guest, #81449) [Link]

The author seems to be talking in the context of multi-core machines. Imagine a thread pinned to each core. It sounds like author is saying that load is improperly balanced, hence one core would be at 100% and others maybe at 30%, leading to under utilization of cores.

The SO_REUSEPORT socket option

Posted Aug 2, 2013 16:56 UTC (Fri) by edsiper (guest, #65392) [Link]

It looks very interesting just when thousands of connections are getting stuck in the queue before to be accepted. For our case which is a HTTP server with multiple threads and each one with its own epoll(7) queue, its critical to decide just after the accept(2) which thread will work that new connection to keep a balanced load.

The SO_REUSEPORT socket option

Posted Apr 30, 2014 2:45 UTC (Wed) by edsiper (guest, #65392) [Link]

just using SO_REUSEPORT on Monkey HTTP Server:

https://github.com/monkey/monkey/commit/d1da249a0b5e8f576...

The SO_REUSEPORT socket option

Posted Sep 19, 2014 13:51 UTC (Fri) by RamanGupta16 (guest, #98942) [Link]

Is this socket option valid and usable for SCTP also? Using this option can multiple SCTP sockets on the same host bind to the same port and work precisely like what is said in this article for TCP/UDP.

The SO_REUSEPORT socket option

Posted Oct 27, 2016 3:13 UTC (Thu) by jaybuff (guest, #97725) [Link]

the SO_REUSEPORT implementation distributes connections evenly across all of the threads (or processes) that are blocked in accept() on the same port.
I read this to mean that single threaded processes that have established connections will not receive new connections because they are not waiting in accept(). After testing and reading some code it's clear that processes that aren't in accept() continue to receive new connections.

workarround SO_REUSEPORT for seamless reloads

Posted Jul 26, 2017 18:54 UTC (Wed) by kazan417 (guest, #117591) [Link]

Nice work for deffect workarround was provided by using SCM_RIGHTS
https://www.haproxy.com/blog/truly-seamless-reloads-with-...

The SO_REUSEPORT socket option

Posted Feb 17, 2022 14:53 UTC (Thu) by clohr (guest, #156932) [Link]

Hello all,
I'm terribly sorry, but things are still not so clear for me.
What's about _multicast_?

It is rather common for multicast listeners to use the SO_REUSEADDR option.
Should this be replaced by SO_REUSEPORT? (or not?)
Should one use both? (if yes, in which order?)
What is the impact?
What are recommendations?

Many thanks for your guidance.
Best regards.

The SO_REUSEPORT socket option

Posted Feb 27, 2023 14:00 UTC (Mon) by deavmi (guest, #163870) [Link]

Just learnt of this today, that's rather neat.


Copyright © 2013, Eklektix, Inc.
This article may be redistributed under the terms of the Creative Commons CC BY-SA 4.0 license
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds