problem with default local port(nl_pid) when netlink used both via libnl and directly in same application

Mon May 7 05:05:32 EDT 2012

I've just diagnosed a problem in libvirt that traces back to libnl's
unilateral decision to use getpid() of the calling process as the
default "local port" (nl_pid) for the first netlink socket it creates
for each process.

The problem is that this is also the default value used it a piece of
code running in that process uses direct system calls to create/bind a
netlink socket. In our example, this was the result of calling glibc's
getaddrinfo() function, so we weren't even aware that it was happening.
Even though getaddrinfo() only keeps its netlink socket connected for a
short period, if that is running in a separate thread from the thread
that calls nl_handle_alloc()/nl_connect(), the result will be that the
bind() in nl_connect() fails with EADDRINUSE.

Although we're working around the problem in libvirt, I thought I should
bring it up here as well, since this same problem could bite any other
application that has has similar dual uses of netlink both directly and
via libnl (in many cases without even realizing it).

Here are pointers to the two emails describing the problem I found, as
well as a bug filed against libvirt in RHEL6:


   https://www.redhat.com/archives/libvir-list/2012-May/msg00202.html
   https://www.redhat.com/archives/libvir-list/2012-May/msg00238.html
   https://bugzilla.redhat.com/show_bug.cgi?id=816465

Although the system in question is using libnl-1.1, as far as I can see
the latest upstream from git would have the same behavior.

Note that during the discussion (duplicated below for convenience) I
point out that, while it avoids the collision, simply modifying libnl to
skip the first local port is not a good solution in the general case
because existing applications that use libnl may be depending on that
behavior; as a matter of fact that was the case with communication
between libvirt and lldpad.

Also note that changing nl_connect() to retry the bind with a different
port is also not a full solution, both for the above reason as well as
because libnl allows the application to retrieve local port with
nl_socket_get_local_port() before nl_connect() is called, so an
application may have already stored the local port information, and
silently changing it during nl_connect() would lead to inconsistency
between what the application believes and reality.

-------- Original Message --------
Message-ID: 	<4FA19AA4.2030203 at laine.org>
Date: 	Wed, 02 May 2012 16:35:48 -0400
From: 	Laine Stump <laine at laine.org>
To: 	libvir-list at redhat.com
Subject: 	Re: [libvirt] [PATCH alternative 1] util: fix libvirtd startup
failure due to netlink error
List-Archive: 	<https://www.redhat.com/archives/libvir-list>


On 05/02/2012 11:32 AM, Daniel P. Berrange wrote:
> On Wed, May 02, 2012 at 11:29:36AM -0400, Laine Stump wrote:
>> On 05/02/2012 05:11 AM, Daniel P. Berrange wrote:
>>> On Tue, May 01, 2012 at 03:10:42PM -0400, Laine Stump wrote:
>>>> This patch is one alternative to solve the problem detailed in:
>>>>
>>>>   https://bugzilla.redhat.com/show_bug.cgi?id=816465
>>>>
>>>> Some other unidentified library in use by libvirtd (in another thread)
>>>> is apparently temporarily binding to a NETLINK_ROUTE raw socket with
>>>> an address of "pid of libvirtd" during startup. 
>>> Can you identify this library.
>>
>> I made a few attempts, but didn't have any luck and decided to post
>> these patches based on the other evidence I'd gathered. I agree that I
>> would much prefer understanding who is doing this, even if it doesn't
>> change the workaround method.
>>
>>
>>>  It should be possible to do so using
>>> systemtap without all that much trouble.
>> My full experience with systemtap is using some of the examples from
>> your blog posting on the topic :-)
> I assume you mean this one
>
> http://berrange.com/posts/2011/11/30/watching-the-libvirt-rpc-protocol-using-systemtap/

Yes, that's the one. I wasn't actually interested in watching the rpc
protocol, but the interaction between libvirtd and the qemu monitor,
which was very helpful.


>> I would love to figure this out, though. The complicating factor I can
>> see (aside from me needing to learn how to write a systemtap script) is
>> that in this case stap needs to be run on a daemonizing process, from
>> the very beginning. If you can give me any better advice than "go read
>> the systemtap website", please do.
> I can't help today, but ping me on IRC tomorrow and I'll help you
> get sorted with systemtap. You can start the stap scripts before
> even running libvirtd, so there's no issue with the daemonizing
> side of things.


With some help from mjw in #systemtap on freenode, I was able to figure
out how to use systemtap to print a backtrace all calls to bind, and
although the failures ceased as soon as I turned on the tracing (of
course), it did at least give me a list of bind calls to research.

It turns out that this is the interesting one (or one example of it,
anyway):

[23876,init
 0x35b90e8277 : bind+0x7/0x30 [/lib64/libc-2.12.so]
 0x35b910e540 : __check_pf+0x80/0xf0 [/lib64/libc-2.12.so]
 0x35b90d1ab7 : getaddrinfo+0xe7/0x890 [/lib64/libc-2.12.so]
 0x7fa695f1e61d : virSocketAddrParse+0x4d/0x190
[/usr/lib64/libvirt.so.0.9.10]
 0x7fa695f47f2a : virNetworkIPParseXML+0xaa/0x4c0
[/usr/lib64/libvirt.so.0.9.10]
 0x7fa695f48f37 : virNetworkDefParseNode+0xbf7/0x19e0
[/usr/lib64/libvirt.so.0.9.10]
 0x7fa695f49d77 : virNetworkDefParse+0x57/0x70
[/usr/lib64/libvirt.so.0.9.10]
 0x7fa695f49e2c : virNetworkLoadConfig+0x8c/0x1b0
[/usr/lib64/libvirt.so.0.9.10]
 0x7fa695f49fb3 : virNetworkLoadAllConfigs+0x63/0x100
[/usr/lib64/libvirt.so.0.9.10]
 0x4d5f97 : networkStartup+0x157/0x460 [/usr/sbin/libvirtd]
 0x7fa695f806d0 : virStateInitialize+0x60/0xd0
[/usr/lib64/libvirt.so.0.9.10]
 0x420ff1 : daemonRunStateInit+0x11/0x80 [/usr/sbin/libvirtd]
 0x7fa695f08749 : virThreadHelper+0x29/0x40 [/usr/lib64/libvirt.so.0.9.10]
 0x35b9c07851 : start_thread+0xd1/0x3d4 [/lib64/libpthread-2.12.so]
 0x35b90e767d : __clone+0x6d/0x90 [/lib64/libc-2.12.so]
]

__check_pf() is in glibc - sysdeps/unix/sysv/linux/check_pf.c, and it
does directly (not through libnl) call socket(PF_NETLINK, SOCK_RAW,
NETLINK_ROUTE), set the nladdr to 0's, then bind() it. In the kernel,
netlink_bind() uses 0 as an indicator that it should auto-bind,
preferring the pid of the calling process (i.e. "pid of libvirtd") as
its nl_pid in the nladdr. This NETLINK socket is used for a short period
to get a list of interface addresses, and is then closed.

Once main() has started up its other threads, these threads may call
virSocketAddrParse (and thus __check_pf()) any number of times, creating
many socket/bind/close cycles of NETLINK sockets. Meanwhile, in the main
thread, virNetlinkEventServiceStart() is the first function in libvirtd
to call libnl's nl_handle_alloc(), which mistakenly assumes that it has
all control over netlink sockets, and that it can assign the address of
"pid of libvirtd" to this nlhandle. Shortly after that, nl_connect() is
called, which calls bind() with a *fixed* address of "pid of libvirtd".
If another thread happens to currently be in a call to __pf_check(), we
lose the lottery and bind() fails. If not, we win the lottery, bind()
succeeds, and future calls to bind() by __check_pf() will auto-bind to a
different address (unlike with libnl, which assigns subsequent sockets
the address of "pid + (n << 22)" with a maximum of 1024 sockets per
process (i.e. it will always be positive), auto-binds in the kernel will
assign the first free address found between -2047 and -2,147,483,648
(i.e. it will always be negative)).

So, the conclusions to draw from this analysis are:

1) my "alternative 1" patch was only coincidentally succeeding, and
would be about as useful as everyone removing their shoes at airport
security checkpoints.

2) If libvirtd has multiple threads started up before any netlink
sockets have been bound to "pid of libvirtd", there is a possibility
that the first call to nl_connect will fail (due to another thread being
in getaddrinfo/__check_pf()). This is just as true for the macvtap and
netcf uses of libnl as for the virNetlinkEventService use.

3) Once the first call to nl_connect is successfully completed (and/or
if an extra (and otherwise unused) nlhandle is created with
nl_handle_alloc() before creating any nlhandles that are subsequently
nl_connect()ed), the likelyhood of a subsequent nl_connect() failure is
effectively 0, since the address space used by libnl is all positive 32
bit numbers, and the address space used by the auto-bind address in the
kernel is (almost) all negative 32 bit numbers.

4) libnl should, at the very least, be modified to not use exactly
nl_pid = pid, since there is a very high likelihood that particular
address will already be taken by a library function that is calling bind
directly, rather than through libnl. Really, its API shouldn't allow
applications to retrieve the bind address used until after nl_connect()
has already completed successfully; unfortunately, that would require an
incompatible change in the API.

Now that I completely understand the problem, I actually think that
neither of these patches is quite correct; the first because it is
simply bogus, and the second because it only solves the problem to
virNetlinkEventService - it still leaves open the possibility that
macvtap or netcf usage of libnl could result in a failure (although
*only* if one of those uses happened to be called prior to
virNetlinkEventService).

To be 100% safe, I think what we need to do is put an extra call to
nl_handle_alloc() very early in main, prior to calling
virNetServerNew(), which is when all the other worker threads are
created. I'll put together such a patch and send it to the list later
tonight.

[...]


As far as I can tell, the libnl3 code is unchanged in this respect -
nl_handle_alloc() is for some reason replaced with nl_socket_alloc(),
but the address (stored in nl_pid) is still generated at that time, and
in the same manner, is available at any time via
nl_socket_get_local_port(), and is used by nl_connect() to call bind(),
with no attempts to adjust it if EADDRINUSE is encountered.

By the way, a workaround in libnl (as suggested in my conclusion (4)
would also be very simple - that could be done by patching libnl's
generate_local_port to skip over "pid + (0 << 22)", like this:

--- a/lib/socket.c
+++ b/lib/socket.c
@@ -58,7 +58,7 @@ static uint32_t generate_local_port(void)
                if (used_ports_map[i] == 0xFFFFFFFF)
                        continue;
 
-               for (n = 0; n < 32; n++) {
+               for (n = (i ? 0 : 1); n < 32; n++) {
                        if (1UL & (used_ports_map[i] >> n))
                                continue;
 
(this is based on upstream git, but generate_local_port() is identical
in libnl-1.1). Note that I call this a "workaround" rather than a "fix"
- in my opinion, a true fix isn't possible with the existing libnl API -
as I've said before, not only should libnl not be unilaterally deciding
the bind address, it certainly shouldn't be allowing the application to
retrieve the bind address before a successful bind has taken place, as
this eliminates the possibility of changing that address in case of
EADDRINUSE.

The downside to the above patch (compared to patching libvirt) is that
there may be some other application that is (knowingly or unknowingly)
depending on the bind address being based on "pid" rather than "pid + (1
<< 22)", and that application would mysteriously begin to fail. So as a
localized patch that is certain to fix libvirt's libnl problem with no
chance of making life worse for anyone else, I still prefer a patch to
libvirt.

> Agree that creating a netlink handle in libvirtd main() sounds like a
> way to workaround it.
>

I just sent a new patch upstream that does this:

  https://www.redhat.com/archives/libvir-list/2012-May/msg00235.html

As noted there, I don't want to push it until I get experimental
verification that it doesn't damage lldpad<-->libvirtd communication.


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.infradead.org/pipermail/libnl/attachments/20120507/a23635ba/attachment-0001.html>