-
Notifications
You must be signed in to change notification settings - Fork 18.9k
Create bridge veth in container netns #49302
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
1326967 to
9d90ced
Compare
| } | ||
|
|
||
| if err := nlh.LinkAdd(veth); err != nil { | ||
| return nil, types.InternalErrorf("failed to add the host (%s) <=> sandbox (%s) pair interfaces: %v", hostIfName, containerIfName, err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was about to ask whether the underlying error should be preserved (%w), but then noticed that types.InternalErrorf uses fmt.Sprintf, which doesn't allow %w.
And then I noticed that libnetwork's InternalError does NOT implement errdefs.System, which may be intentional or unintentional, but something to look into.
(It not implementing errdefs.System or any other errdefs will mean that while it gets returned as a 500 status, it will also trigger the "FIXME" debug logs;
moby/api/server/httpstatus/status.go
Lines 64 to 68 in 29e2353
| log.G(context.TODO()).WithFields(log.Fields{ | |
| "module": "api", | |
| "error": err, | |
| "error_type": fmt.Sprintf("%T", err), | |
| }).Debug("FIXME: Got an API for which error does not match any expected type!!!") |
In either case; I started to somewhat dismantle the libnetwork errors; first step is in;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, yes - that PR looks good. In terms of this PR, that error is unchanged. It's just moved to here from CreateEndpoint.
| } | ||
| } | ||
|
|
||
| if path, ok := sb.NetnsPath(); ok { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I got confused for a second because the InterfaceInfo interface change didn't match this signature, but they're separate things, just with the same name 😂
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Initially I had an "ok" in the driver interface's version too, but got rid of it to make things a little simpler on the basis that the thing that's about to use the path should understand that an empty path is invalid. But, left it in the Sandbox version so that this code didn't need to understand the rules for a usable path.
But, it's all a bit marginal - I'm happy to align them, either by adding back the "ok" for the driver interface, or removing this one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's probably fine as-is; it was just my confusing after seeing the interface changes below, and thinking "they're not the same?".
|
|
||
| // NetnsPath returns the path of the network namespace, if there is one. Else "". | ||
| NetnsPath() string | ||
|
|
||
| // SetCreatedInContainer can be called by the driver to indicate that it's | ||
| // created the network interface in the container's network namespace (so, | ||
| // it doesn't need to be moved there). | ||
| SetCreatedInContainer(bool) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Silly question here; changing the interface here means that every implementation now MUST implement these. I think there's quite some places where this resulted in stub-implementations because we tried to shoehorn too many things into a single interface.
I see this is currently implemented in 4 places, but the one that stood out to me were the ones in "remote" files; are those related to plugins? (and if so, do we expect those to implement this?).
Without having looked closely if that'd be better (👈 so possibly silly ideas ahead), we could consider;
- making this an optional "extra" interface, similar to (I think)
InterfaceNameIntobelow;moby/libnetwork/driverapi/driverapi.go
Lines 132 to 136 in 29e2353
// InterfaceNameInfo provides a go interface for the drivers to assign names // to interfaces. type InterfaceNameInfo interface { // SetNames method assigns the srcName and dstPrefix for the interface. SetNames(srcName, dstPrefix string) error
Or (but that may be more of a separate change), we should look at separate interfaces for removes / plugins and some of the builtin parts (ISTR there were many more parts already that effectively were only implemented for the bridge interface, and stubbed-out everywhere else.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, indeed. Initially, to try this out, I added an extra interface (driverapi.NetnsInfo) that was implemented by libnetwork.EndpointInterface on Linux only.
The EndpointInterface (which is a struct describing a network interface, not a Go interface) is passed to CreateEndpoint as Go interface ifInfo driverapi.InterfaceInfo.
So, in the bridge driver, if it was possible to cast ifInfo to driverapi.NetnsInfo, it got the netns path and set the createdInContainer flag using that interface. Then, I didn't need to add these new methods to all the test implementations of InterfaceInfo, and the new interface wasn't implemented at all for Windows.
But, converting ifInfo to a NetnsInfo seemed a bit hacky. And, @akerouanton's comment from a first glance at the code was "I'm wondering if we could avoid introducing an extra interface though" ... so, I merged it into the InterfaceInfo interface instead. That force-push change is here (for now at least, not sure if this link will carry on working) ... https://github.com/moby/moby/compare/1ecac4d6befa42fb5f4f12b1a69aa4cf21635df5..13269676b23c9e0145b98bf82b597be3be174119
I also considered the InterfaceNameInfo style - but it's fetched from a method in the JoinInfo interface (which CreateEndpoint doesn't have access to). I could have added a similar method to InterfaceInfo, to fetch the NetnsInfo interface - but that seemed a bit pointless ... the test implementations of InterfaceInfo would still have needed an update.
moby/libnetwork/driverapi/driverapi.go
Lines 141 to 144 in 38e76eb
| type JoinInfo interface { | |
| // InterfaceName returns an InterfaceNameInfo go interface to facilitate | |
| // setting the names for the interface. | |
| InterfaceName() InterfaceNameInfo |
Any of these approaches work, and I'm happy to change things ... whatever we all prefer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I gotta be honest here that I don't know what the best approach would be. Let me try to put in words what's all in the back of my head 🙈
- First and foremost; I never know what the scope is for interface changes in the
driverapipackage; I know that libnetwork originally went 100% the reverse of go conventions, and was created with an "interface first" design. This meant that the interface was defined before the implementation, but also that those interfaces became a contract - ☝️ the part I'm always unsure about is the scope of that contract; if it's only what's implemented in this repository, that's fine, but if it's also for external (plugin) implementations, that's more problematic, as changing the interface would break those (but maybe that's not at hand here at all)
As to "where to define";
- Trying to avoid broad interfaces, to prevent too many stubs, but also because they make it really hard to detect dead code; while the stubs on their own are "meh" (just a few lines), I'm more concerned about the ripple effect
- ☝️ I once went cleaning up some parts, to discover we had stub-implementations on Windows that were ONLY there to statisfy stub-implementations of another interface
- 👉 in reality there only was a single use of that part of the interface, and the only use of it was in Linux code, not Windows.
So, we could even define the interface locally in the bridge code, and have something like
if ifInfo2, ok := ifInfo.(interfaceInfoWithFoo); ok {
nlhCtr, ok = createVethInContainerNS(ctx, ifInfo2, containerIfName, veth)
if ok {
ifInfo2.SetCreatedInContainer(true)
defer func() {
if retErr != nil {
nlhCtr.Close()
}
}()
}
}There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Slightly curious about that ifInfo2.SetCreatedInContainer(true), because it only has a public "setter" but nothing to read it back (which is all "internal"); would there be a way to deduct this from any of the other properties? (e.g. NetnsPath() being non-empty?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In terms of the contract ... we've been working on the assumption that libnetwork is effectively "internal", and occasionally discussed making it actually internal to clarify that. The "remote" network driver translates all this stuff into an HTTP interface, so there's some separation.
I originally put the NetnsInfo in the driverapi package because that's where everything else was, it seemed a bit less hidden, and (although I've not checked in detail) we can probably give the ipvlan/macvlan drivers the same treatment - so, they'd either need to use the driverapi interface or also define their own.
The explicit SetCreatedInContainer(true) is there, instead of just relying on the path being set, in case the driver fails to create the veth in the container's netns for any reason - in which case, it just falls back to the old behaviour. No "getter" is needed in the interface because the info only flows upwards from the driver, and libnet has the value in the EndpointInterface struct that implements the Go interface.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The "remote" network driver translates all this stuff into an HTTP interface, so there's some separation.
Thanks! That's the part I always got confused about; most notably because those wouldn't be tested in our CI, so breaking changes would not be detected, and wording like this doesn't help there 🙈;
moby/libnetwork/driverapi/driverapi.go
Lines 13 to 14 in f32ec6b
| // Driver is an interface that every plugin driver needs to implement. | |
| type Driver interface { |
More so because that one includes InterfaceInfo in its signature, so changing that interface also changes the Driver interface;
moby/libnetwork/driverapi/driverapi.go
Line 44 in f32ec6b
| CreateEndpoint(ctx context.Context, nid, eid string, ifInfo InterfaceInfo, options map[string]interface{}) error |
If this is indeed all "internal interfaces", then I have no problem with the current implementation. We could still (and perhaps should) splitting up the interface to address my concerns about the ripple-effect of the stubs, but if those are all (effectively) internal changes, then that's something we can do in follow-ups without concerns of that meaning a change in contract elsewhere.
In terms of the contract ... we've been working on the assumption that libnetwork is effectively "internal", and occasionally discussed making it actually internal to clarify that.
Perhaps we should consider this indeed; at least it could help with confusions I had like the above; "driverapi" is for sure advertising too much as (public) contract; having it internal could help take away any doubt.
Before that, we could consider addressing these concerns through documentation, and updating the descriptions of these interfaces to define scope.
The explicit SetCreatedInContainer(true) is there, instead of just relying on the path being set, in case the driver fails to create the veth in the container's netns for any reason - in which case, it just falls back to the old behaviour. No "getter" is needed in the interface because the info only flows upwards from the driver, and libnet has the value in the EndpointInterface struct that implements the Go interface.
Gotcha! Yeah, part of that was also driven by my other concerns; if this was an interface we'd be stuck with, then I was looking for cleaner (or "as small as possible") changes; if this is all internal, then definitely not a concern (i.e., we can cleanup / improve as we go).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, let's stick with this for now (and get thoughts from others if there are any) ... not having to assert the driverinfo.InterfaceInfo to a different interface type in the bridge driver probably makes things a bit less confusing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The best approach is to totally get rid of those interfaces and go The Go Way. I experimented with that approach last year, see #47217. It's definitely doable, but libnetwork being quite fragile, it's not so easy.
And, @akerouanton's comment from a first glance at the code was "I'm wondering if we could avoid introducing an extra interface though"
Ultimately we should merge CreateEndpoint and Join into a single operation. This would help making those interfaces useless.
thaJeztah
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
but could use a review from @akerouanton 🤗
| if nspath := ifInfo.NetnsPath(); nspath == "" { | ||
| log.G(ctx).WithField("ifname", containerIfName).Debug("No container netns path, creating interface in host netns") | ||
| } else if netnsh, err := netns.GetFromPath(nspath); err != nil { | ||
| log.G(ctx).WithFields(log.Fields{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also for a follow-up; wondering if extracting this to a separate function could make the logic easier (being able to do an early return from that function)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes - I've tried it a few ways, this seemed least-bad.
I wanted to split the whole createVeth out (including the LinkAdd) to make it unit-testable. But splitting out the netns-opening part gets messy in terms of making sure handles are closed when they should be, and passing enough info around for logging.
9d90ced to
e58382a
Compare
|
Rebased to resolve conflicts. |
e58382a to
e3d98b7
Compare
Signed-off-by: Rob Murray <[email protected]>
Signed-off-by: Rob Murray <[email protected]>
e3d98b7 to
fda5868
Compare
|
@akerouanton PTAL 🤗 |
|
The second of the two commits I just added is to make sure |
| }() | ||
|
|
||
| iface := &testInterface{netnsPath: netnsPath} | ||
| nlhCtr, err := createVeth(context.TODO(), hostIfName, containerIfName, iface, nlh) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's no plan to plumb OTel in unit tests, so there no need to mark this context as a TODO:
| nlhCtr, err := createVeth(context.TODO(), hostIfName, containerIfName, iface, nlh) | |
| nlhCtr, err := createVeth(context.Background(), hostIfName, containerIfName, iface, nlh) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
| } else { | ||
| defer netnsh.Close() | ||
|
|
||
| if nh, err := nlwrap.NewHandleAt(netnsh, syscall.NETLINK_ROUTE); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We tend to abbreviate 'netlink handle' to nlh everywhere, better stick with that here:
| if nh, err := nlwrap.NewHandleAt(netnsh, syscall.NETLINK_ROUTE); err != nil { | |
| if nlh, err := nlwrap.NewHandleAt(netnsh, syscall.NETLINK_ROUTE); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think that's a good idea, it'd shadow the function's param.
| dstEpi.dstPrefix = epi.dstPrefix | ||
| dstEpi.v4PoolID = epi.v4PoolID | ||
| dstEpi.v6PoolID = epi.v6PoolID | ||
| dstEpi.createdInContainer = epi.createdInContainer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is that needed? Once the veth pair is created, it doesn't matter if both ends where created in the same netns initially. If the daemon crashes midway during veth creation, and both ends are in the same namespace, deleting the host end should be enough to delete both interfaces.
Am I missing somnething?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes ... even though it looks like the Daemon passes the Endpoint from the Create (where the flag is set) to the Join (where it's needed) - sbJoin pulls the Endpoint from the store ...
Line 500 in fda5868
| ep, err = n.getEndpointFromStore(ep.ID()) |
| } | ||
|
|
||
| if path, ok := sb.NetnsPath(); ok { | ||
| createOptions = append(createOptions, libnetwork.WithNetnsPath(path)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That seems to be the only place where the daemon has to know about netns, otherwise this concept is abstracted away through Sandbox. I think it'd be best to keep it that way.
Instead we could add an extra argument to the CreateEndpoint method (or abuse the options argument) to let the driver know where to create the endpoint (if the valus isn't empty). This would also remove the need for NetnsPath in InterfaceInfo.
In an ideal world, with Join merged into CreateEndpoint, that's probably what the method would look like. Better try to get closer to that state. That will be one less thing to change in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As mentioned offline - I might be missing something, but...
Making it a param to libnetwork's CreateEndpoint wouldn't help, it'd just be equivalent to CreateEndpoint option set up in the current code. So, I think you mean make it a param to the driver's CreateEndpoint, where the veth is created.
But, until the Join, libnetwork doesn't know which Sandbox (and therefore netns) the Endpoint will end up in, only the Daemon knows that. In the Join it's too late, the veth has already been created.
So, I don't see a way libnetwork could pass the netns path into the driver's CreateEndpoint, without this extra info from the Daemon.
Merging Create/Join would sort it out ... but that's certainly out of scope for this PR. If/when they're merged, it'll just naturally go-away and the Daemon won't need to be involved any more - because it'll tell libnetwork about the Network and the Sandbox in the same "CreateAndJoin" call.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As it is though - the only thing the Daemon knows about the netns path is that it's a thing it needs to forward if it can. That's why there's an "ok" return val from sb.NetnsPath, the Daemon doesn't need to know what an empty path means. Although, at the moment, passing on an empty path would be fine.
IPv6-ness of the Sandbox is in the same category - the Daemon shouldn't need to know about it, but it has to tell CreateEndpoint about it, because libnetwork doesn't know which Sandbox is involved until the Join. (It's handled immediately above the lines you've highlighted.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should just pass the Sandbox to network.CreateEndpoint, as a WithSandbox endpoint option. That'd solve the problem here, as well as for IPv6-ness, and there might be more we can tidy up.
I'll look at doing that in a separate PR that we can merge first, then update this one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After discussing this more during a 1-1, we decided to get it in as is and address my original comment in a follow-up.
Since commit 933fcc9 (Re-remove the SetKey OCI prestart hook), the network namespace will be set up before endpoints are created in most cases, apart from build containers. So, when possible, create the veth with one end in that netns to save moving it in later. On my host, that saves about 20ms for each bridge network a container is connected to. Signed-off-by: Rob Murray <[email protected]>
fda5868 to
65120d5
Compare
- What I did
Since commit 933fcc9 (Re-remove the SetKey OCI prestart hook), the network namespace will be set up before endpoints are created in most cases (apart from build containers).
So, when possible, create the veth with one end in that netns to save moving it in later.
On my Mac M2 host, that reduces container startup time by about 20ms for each bridge network it's connected to.
- How I did it
Added methods for a driver to ask for the network namespace, and to announce that it's created the interface in the container's netns ... if anything goes wrong, fall back to the old behaviour, creating the veth in the host's namespace.
Tell the
oslcode when it doesn't need to move the interface into the namespace.No change to the
srcNameproperty, the name the interface would have been created with in the host namespace. When created in the container's namespace, the interface has that same name (it has to be renamed, as it was before).- How to verify it
New and existing tests.
Before -
moveLinkin this run took 26ms of the total 94ms ...After -
moveLinkisn't called, reducing the total time to 68ms ...- Description for the changelog