Skip to content

Conversation

@robmry
Copy link
Contributor

@robmry robmry commented Jan 19, 2025

- What I did

Since commit 933fcc9 (Re-remove the SetKey OCI prestart hook), the network namespace will be set up before endpoints are created in most cases (apart from build containers).

So, when possible, create the veth with one end in that netns to save moving it in later.

On my Mac M2 host, that reduces container startup time by about 20ms for each bridge network it's connected to.

- How I did it

Added methods for a driver to ask for the network namespace, and to announce that it's created the interface in the container's netns ... if anything goes wrong, fall back to the old behaviour, creating the veth in the host's namespace.

Tell the osl code when it doesn't need to move the interface into the namespace.

No change to the srcName property, the name the interface would have been created with in the host namespace. When created in the container's namespace, the interface has that same name (it has to be renamed, as it was before).

- How to verify it

New and existing tests.

Before - moveLink in this run took 26ms of the total 94ms ...

before

After - moveLink isn't called, reducing the total time to 68ms ...

after

- Description for the changelog

Faster connection to bridge networks, in most cases.

@robmry robmry added kind/feature Functionality or other elements that the project doesn't currently have. Features are new and shiny area/networking Networking area/networking/d/bridge Networking labels Jan 19, 2025
@robmry robmry self-assigned this Jan 19, 2025
@robmry robmry force-pushed the create_veth_in_container branch 2 times, most recently from 1326967 to 9d90ced Compare January 20, 2025 17:14
@robmry robmry added this to the 28.0.0 milestone Jan 20, 2025
@robmry robmry marked this pull request as ready for review January 20, 2025 18:38
@robmry robmry changed the title WIP - create bridge veth in container netns Create bridge veth in container netns Jan 21, 2025
}

if err := nlh.LinkAdd(veth); err != nil {
return nil, types.InternalErrorf("failed to add the host (%s) <=> sandbox (%s) pair interfaces: %v", hostIfName, containerIfName, err)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was about to ask whether the underlying error should be preserved (%w), but then noticed that types.InternalErrorf uses fmt.Sprintf, which doesn't allow %w.

And then I noticed that libnetwork's InternalError does NOT implement errdefs.System, which may be intentional or unintentional, but something to look into.

(It not implementing errdefs.System or any other errdefs will mean that while it gets returned as a 500 status, it will also trigger the "FIXME" debug logs;

log.G(context.TODO()).WithFields(log.Fields{
"module": "api",
"error": err,
"error_type": fmt.Sprintf("%T", err),
}).Debug("FIXME: Got an API for which error does not match any expected type!!!")

In either case; I started to somewhat dismantle the libnetwork errors; first step is in;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, yes - that PR looks good. In terms of this PR, that error is unchanged. It's just moved to here from CreateEndpoint.

}
}

if path, ok := sb.NetnsPath(); ok {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I got confused for a second because the InterfaceInfo interface change didn't match this signature, but they're separate things, just with the same name 😂

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Initially I had an "ok" in the driver interface's version too, but got rid of it to make things a little simpler on the basis that the thing that's about to use the path should understand that an empty path is invalid. But, left it in the Sandbox version so that this code didn't need to understand the rules for a usable path.

But, it's all a bit marginal - I'm happy to align them, either by adding back the "ok" for the driver interface, or removing this one?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's probably fine as-is; it was just my confusing after seeing the interface changes below, and thinking "they're not the same?".

Comment on lines +130 to +137

// NetnsPath returns the path of the network namespace, if there is one. Else "".
NetnsPath() string

// SetCreatedInContainer can be called by the driver to indicate that it's
// created the network interface in the container's network namespace (so,
// it doesn't need to be moved there).
SetCreatedInContainer(bool)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Silly question here; changing the interface here means that every implementation now MUST implement these. I think there's quite some places where this resulted in stub-implementations because we tried to shoehorn too many things into a single interface.

I see this is currently implemented in 4 places, but the one that stood out to me were the ones in "remote" files; are those related to plugins? (and if so, do we expect those to implement this?).

Screenshot 2025-01-21 at 12 51 40

Without having looked closely if that'd be better (👈 so possibly silly ideas ahead), we could consider;

  • making this an optional "extra" interface, similar to (I think) InterfaceNameInto below;
    // InterfaceNameInfo provides a go interface for the drivers to assign names
    // to interfaces.
    type InterfaceNameInfo interface {
    // SetNames method assigns the srcName and dstPrefix for the interface.
    SetNames(srcName, dstPrefix string) error

Or (but that may be more of a separate change), we should look at separate interfaces for removes / plugins and some of the builtin parts (ISTR there were many more parts already that effectively were only implemented for the bridge interface, and stubbed-out everywhere else.

Copy link
Contributor Author

@robmry robmry Jan 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, indeed. Initially, to try this out, I added an extra interface (driverapi.NetnsInfo) that was implemented by libnetwork.EndpointInterface on Linux only.

The EndpointInterface (which is a struct describing a network interface, not a Go interface) is passed to CreateEndpoint as Go interface ifInfo driverapi.InterfaceInfo.

So, in the bridge driver, if it was possible to cast ifInfo to driverapi.NetnsInfo, it got the netns path and set the createdInContainer flag using that interface. Then, I didn't need to add these new methods to all the test implementations of InterfaceInfo, and the new interface wasn't implemented at all for Windows.

But, converting ifInfo to a NetnsInfo seemed a bit hacky. And, @akerouanton's comment from a first glance at the code was "I'm wondering if we could avoid introducing an extra interface though" ... so, I merged it into the InterfaceInfo interface instead. That force-push change is here (for now at least, not sure if this link will carry on working) ... https://github.com/moby/moby/compare/1ecac4d6befa42fb5f4f12b1a69aa4cf21635df5..13269676b23c9e0145b98bf82b597be3be174119

I also considered the InterfaceNameInfo style - but it's fetched from a method in the JoinInfo interface (which CreateEndpoint doesn't have access to). I could have added a similar method to InterfaceInfo, to fetch the NetnsInfo interface - but that seemed a bit pointless ... the test implementations of InterfaceInfo would still have needed an update.

type JoinInfo interface {
// InterfaceName returns an InterfaceNameInfo go interface to facilitate
// setting the names for the interface.
InterfaceName() InterfaceNameInfo

Any of these approaches work, and I'm happy to change things ... whatever we all prefer?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I gotta be honest here that I don't know what the best approach would be. Let me try to put in words what's all in the back of my head 🙈

  • First and foremost; I never know what the scope is for interface changes in the driverapi package; I know that libnetwork originally went 100% the reverse of go conventions, and was created with an "interface first" design. This meant that the interface was defined before the implementation, but also that those interfaces became a contract
  • ☝️ the part I'm always unsure about is the scope of that contract; if it's only what's implemented in this repository, that's fine, but if it's also for external (plugin) implementations, that's more problematic, as changing the interface would break those (but maybe that's not at hand here at all)

As to "where to define";

  • Trying to avoid broad interfaces, to prevent too many stubs, but also because they make it really hard to detect dead code; while the stubs on their own are "meh" (just a few lines), I'm more concerned about the ripple effect
  • ☝️ I once went cleaning up some parts, to discover we had stub-implementations on Windows that were ONLY there to statisfy stub-implementations of another interface
  • 👉 in reality there only was a single use of that part of the interface, and the only use of it was in Linux code, not Windows.

So, we could even define the interface locally in the bridge code, and have something like

if ifInfo2, ok := ifInfo.(interfaceInfoWithFoo); ok {
    nlhCtr, ok = createVethInContainerNS(ctx, ifInfo2, containerIfName, veth)
    if ok {
        ifInfo2.SetCreatedInContainer(true)
        defer func() {
            if retErr != nil {
                nlhCtr.Close()
            }
        }()
    }
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Slightly curious about that ifInfo2.SetCreatedInContainer(true), because it only has a public "setter" but nothing to read it back (which is all "internal"); would there be a way to deduct this from any of the other properties? (e.g. NetnsPath() being non-empty?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In terms of the contract ... we've been working on the assumption that libnetwork is effectively "internal", and occasionally discussed making it actually internal to clarify that. The "remote" network driver translates all this stuff into an HTTP interface, so there's some separation.

I originally put the NetnsInfo in the driverapi package because that's where everything else was, it seemed a bit less hidden, and (although I've not checked in detail) we can probably give the ipvlan/macvlan drivers the same treatment - so, they'd either need to use the driverapi interface or also define their own.

The explicit SetCreatedInContainer(true) is there, instead of just relying on the path being set, in case the driver fails to create the veth in the container's netns for any reason - in which case, it just falls back to the old behaviour. No "getter" is needed in the interface because the info only flows upwards from the driver, and libnet has the value in the EndpointInterface struct that implements the Go interface.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The "remote" network driver translates all this stuff into an HTTP interface, so there's some separation.

Thanks! That's the part I always got confused about; most notably because those wouldn't be tested in our CI, so breaking changes would not be detected, and wording like this doesn't help there 🙈;

// Driver is an interface that every plugin driver needs to implement.
type Driver interface {

More so because that one includes InterfaceInfo in its signature, so changing that interface also changes the Driver interface;

CreateEndpoint(ctx context.Context, nid, eid string, ifInfo InterfaceInfo, options map[string]interface{}) error

If this is indeed all "internal interfaces", then I have no problem with the current implementation. We could still (and perhaps should) splitting up the interface to address my concerns about the ripple-effect of the stubs, but if those are all (effectively) internal changes, then that's something we can do in follow-ups without concerns of that meaning a change in contract elsewhere.

In terms of the contract ... we've been working on the assumption that libnetwork is effectively "internal", and occasionally discussed making it actually internal to clarify that.

Perhaps we should consider this indeed; at least it could help with confusions I had like the above; "driverapi" is for sure advertising too much as (public) contract; having it internal could help take away any doubt.

Before that, we could consider addressing these concerns through documentation, and updating the descriptions of these interfaces to define scope.

The explicit SetCreatedInContainer(true) is there, instead of just relying on the path being set, in case the driver fails to create the veth in the container's netns for any reason - in which case, it just falls back to the old behaviour. No "getter" is needed in the interface because the info only flows upwards from the driver, and libnet has the value in the EndpointInterface struct that implements the Go interface.

Gotcha! Yeah, part of that was also driven by my other concerns; if this was an interface we'd be stuck with, then I was looking for cleaner (or "as small as possible") changes; if this is all internal, then definitely not a concern (i.e., we can cleanup / improve as we go).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, let's stick with this for now (and get thoughts from others if there are any) ... not having to assert the driverinfo.InterfaceInfo to a different interface type in the bridge driver probably makes things a bit less confusing.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The best approach is to totally get rid of those interfaces and go The Go Way. I experimented with that approach last year, see #47217. It's definitely doable, but libnetwork being quite fragile, it's not so easy.

And, @akerouanton's comment from a first glance at the code was "I'm wondering if we could avoid introducing an extra interface though"

Ultimately we should merge CreateEndpoint and Join into a single operation. This would help making those interfaces useless.

Copy link
Member

@thaJeztah thaJeztah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

but could use a review from @akerouanton 🤗

Comment on lines +1307 to +1273
if nspath := ifInfo.NetnsPath(); nspath == "" {
log.G(ctx).WithField("ifname", containerIfName).Debug("No container netns path, creating interface in host netns")
} else if netnsh, err := netns.GetFromPath(nspath); err != nil {
log.G(ctx).WithFields(log.Fields{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also for a follow-up; wondering if extracting this to a separate function could make the logic easier (being able to do an early return from that function)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes - I've tried it a few ways, this seemed least-bad.

I wanted to split the whole createVeth out (including the LinkAdd) to make it unit-testable. But splitting out the netns-opening part gets messy in terms of making sure handles are closed when they should be, and passing enough info around for logging.

@robmry robmry force-pushed the create_veth_in_container branch from 9d90ced to e58382a Compare January 23, 2025 09:51
@robmry
Copy link
Contributor Author

robmry commented Jan 23, 2025

Rebased to resolve conflicts.

@robmry robmry force-pushed the create_veth_in_container branch from e58382a to e3d98b7 Compare January 23, 2025 10:56
@robmry robmry force-pushed the create_veth_in_container branch from e3d98b7 to fda5868 Compare January 23, 2025 14:55
@thaJeztah
Copy link
Member

@akerouanton PTAL 🤗

@robmry
Copy link
Contributor Author

robmry commented Jan 23, 2025

The second of the two commits I just added is to make sure waitForIfUpped is looking at the interface in the container's netns ... originally, moveLink returned the netns handle it should use, but it's not called in the new flow. So, things appeared to work when the host's netns had an interface with the same ifIndex as the container's interface, but a 5s timeout was hit when the host had no such interface. (And the first commit would have made that easier to debug.)

}()

iface := &testInterface{netnsPath: netnsPath}
nlhCtr, err := createVeth(context.TODO(), hostIfName, containerIfName, iface, nlh)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's no plan to plumb OTel in unit tests, so there no need to mark this context as a TODO:

Suggested change
nlhCtr, err := createVeth(context.TODO(), hostIfName, containerIfName, iface, nlh)
nlhCtr, err := createVeth(context.Background(), hostIfName, containerIfName, iface, nlh)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

} else {
defer netnsh.Close()

if nh, err := nlwrap.NewHandleAt(netnsh, syscall.NETLINK_ROUTE); err != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We tend to abbreviate 'netlink handle' to nlh everywhere, better stick with that here:

Suggested change
if nh, err := nlwrap.NewHandleAt(netnsh, syscall.NETLINK_ROUTE); err != nil {
if nlh, err := nlwrap.NewHandleAt(netnsh, syscall.NETLINK_ROUTE); err != nil {

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that's a good idea, it'd shadow the function's param.

dstEpi.dstPrefix = epi.dstPrefix
dstEpi.v4PoolID = epi.v4PoolID
dstEpi.v6PoolID = epi.v6PoolID
dstEpi.createdInContainer = epi.createdInContainer
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is that needed? Once the veth pair is created, it doesn't matter if both ends where created in the same netns initially. If the daemon crashes midway during veth creation, and both ends are in the same namespace, deleting the host end should be enough to delete both interfaces.

Am I missing somnething?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes ... even though it looks like the Daemon passes the Endpoint from the Create (where the flag is set) to the Join (where it's needed) - sbJoin pulls the Endpoint from the store ...

ep, err = n.getEndpointFromStore(ep.ID())

}

if path, ok := sb.NetnsPath(); ok {
createOptions = append(createOptions, libnetwork.WithNetnsPath(path))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That seems to be the only place where the daemon has to know about netns, otherwise this concept is abstracted away through Sandbox. I think it'd be best to keep it that way.

Instead we could add an extra argument to the CreateEndpoint method (or abuse the options argument) to let the driver know where to create the endpoint (if the valus isn't empty). This would also remove the need for NetnsPath in InterfaceInfo.

In an ideal world, with Join merged into CreateEndpoint, that's probably what the method would look like. Better try to get closer to that state. That will be one less thing to change in the future.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As mentioned offline - I might be missing something, but...

Making it a param to libnetwork's CreateEndpoint wouldn't help, it'd just be equivalent to CreateEndpoint option set up in the current code. So, I think you mean make it a param to the driver's CreateEndpoint, where the veth is created.

But, until the Join, libnetwork doesn't know which Sandbox (and therefore netns) the Endpoint will end up in, only the Daemon knows that. In the Join it's too late, the veth has already been created.

So, I don't see a way libnetwork could pass the netns path into the driver's CreateEndpoint, without this extra info from the Daemon.

Merging Create/Join would sort it out ... but that's certainly out of scope for this PR. If/when they're merged, it'll just naturally go-away and the Daemon won't need to be involved any more - because it'll tell libnetwork about the Network and the Sandbox in the same "CreateAndJoin" call.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As it is though - the only thing the Daemon knows about the netns path is that it's a thing it needs to forward if it can. That's why there's an "ok" return val from sb.NetnsPath, the Daemon doesn't need to know what an empty path means. Although, at the moment, passing on an empty path would be fine.

IPv6-ness of the Sandbox is in the same category - the Daemon shouldn't need to know about it, but it has to tell CreateEndpoint about it, because libnetwork doesn't know which Sandbox is involved until the Join. (It's handled immediately above the lines you've highlighted.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should just pass the Sandbox to network.CreateEndpoint, as a WithSandbox endpoint option. That'd solve the problem here, as well as for IPv6-ness, and there might be more we can tidy up.

I'll look at doing that in a separate PR that we can merge first, then update this one.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After discussing this more during a 1-1, we decided to get it in as is and address my original comment in a follow-up.

Since commit 933fcc9 (Re-remove the SetKey OCI prestart hook),
the network namespace will be set up before endpoints are
created in most cases, apart from build containers.

So, when possible, create the veth with one end in that netns
to save moving it in later. On my host, that saves about 20ms
for each bridge network a container is connected to.

Signed-off-by: Rob Murray <[email protected]>
@robmry robmry force-pushed the create_veth_in_container branch from fda5868 to 65120d5 Compare January 24, 2025 18:44
@akerouanton akerouanton merged commit ac23ddd into moby:master Jan 28, 2025
147 checks passed
@robmry robmry deleted the create_veth_in_container branch January 28, 2025 10:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/networking/d/bridge Networking area/networking Networking area/performance impact/changelog kind/feature Functionality or other elements that the project doesn't currently have. Features are new and shiny

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants