Skip to content

(CEL version) Add experimental syscall_overrides config option#3369

Draft
stevenengler wants to merge 2 commits intoshadow:mainfrom
stevenengler:syscall-stub
Draft

(CEL version) Add experimental syscall_overrides config option#3369
stevenengler wants to merge 2 commits intoshadow:mainfrom
stevenengler:syscall-stub

Conversation

@stevenengler
Copy link
Contributor

@stevenengler stevenengler commented Jul 14, 2024

In #3280 we discussed allowing users to add stubs for features that Shadow is missing. There are a lot of different features that users may want to add stubs for (ex: syscalls, socket options, ioctl commands, misc syscall flags, etc).

This PR adds an experimental syscall_overrides configuration option that allows you to add custom overrides for syscalls. This allows you to add simple/limited custom syscall implementations using CEL expressions.

There have been two recent requests for stubs/workarounds: One for having setsockopt with SOL_IP and IP_BIND_ADDRESS_NO_PORT return 0, and one for returning EINVAL from sendfile. This feature would support both these usecases without needing to modify Shadow.

experimental.syscall_overrides

Default: {}
Type: Object

Override the behaviour of emulated syscalls.

A map of syscall numbers to lists of CEL expressions. For each syscall number listed, a list of expressions will be evaluated before running Shadow's syscall handler. If an expression returns a non-null integer result, that result will be used as the return value for the syscall. Otherwise if all expressions return null, Shadow will perform the syscall as usual.

The following variables are available in expressions:

  • args: Array of syscall arguments cast as integers.
  • host: The hostname of the current host.
  • process: The name of the current process.
  • pid: The pid of the current process.
  • tid: The tid of the current thread.

Example:

experimental:
  syscall_overrides:
    # SYS_SETSOCKOPT = 54, SOL_IP = 0, IP_BIND_ADDRESS_NO_PORT = 24, IP_TTL = 2
    54:
    - # setsockopt(_, SOL_IP, IP_BIND_ADDRESS_NO_PORT, ...) -> 0
      "args[1] == 0 && args[2] == 24 ? 0 : null"
    - # setsockopt(_, SOL_IP, IP_TTL, ...) -> 0
      "args[1] == 0 && args[2] == 2 ? 0 : null"

To easily find the integer value for a given libc constant, you can search for the constant name within the libc crate's documentation. For example to find the integer value of SO_REUSEADDR, you can search for SO_REUSEADDR at https://docs.rs/libc/latest/libc/.

There are some downsides:

  1. This brings in a bunch more dependencies.
  2. The "cel-rust" project isn't super popular and may not receive support for a long time. I think we can simply drop this experimental feature if it were to ever become a maintenance burden.
  3. This may reduce the simulation runtime performance, but only in cases where these expressions are used (typical tor simulations don't require any of these overrides so would not be affected). The expressions are compiled only once at the start of the simulation, so I'd expect them to be fairly performant.
  4. An expression like "args[1] == 0 && args[2] == 24 ? 0 : null" maybe looks a little ugly at first glance, but I think it's pretty straightforward. In this case if the second argument is 0 (SOL_IP) and the third is 24 (IP_BIND_ADDRESS_NO_PORT), then the syscall should return 0.

@robgjansen @sporksmith Do you have any feelings about this approach?

@stevenengler stevenengler added this to the Support missing syscalls milestone Jul 14, 2024
@stevenengler stevenengler self-assigned this Jul 14, 2024
@github-actions github-actions bot added Component: Main Composing the core Shadow executable Component: Build Build/install tools and dependencies Component: Documentation In-repository documentation, under docs/ labels Jul 14, 2024
@sporksmith
Copy link
Contributor

Intriguing! It is a significant dependency to bring in, but otoh the flexibility this would provide to give users an alternative to patching shadow would be pretty nice.

Did you consider any more-established extension languages like lua or guile?

@stevenengler
Copy link
Contributor Author

stevenengler commented Jul 14, 2024

Did you consider any more-established extension languages like lua or guile?

My reasons for using CEL were that it's pure-rust (no need to install and link to a separate interpreter), you can explicitly compile it to bytecode ahead of time so is probably decently fast, it's straightforward to add to shadow (about 100 lines of code), and it's simple (expressions are similar to a weakly typed C). That being said I'm not very familiar with lua or guile, so they might be a better choice.

I think if there's a nice way to optionally link to an interpreter so that users don't need to install one to use shadow, but only if they want to use this feature, that could work too.

@sporksmith
Copy link
Contributor

My reasons for using CEL were that it's pure-rust (no need to install and link to a separate interpreter), you can explicitly compile it ahead of time so is probably decently fast, it's straightforward to add to shadow (about 100 lines of code), and it's simple (expressions are similar to a weakly typed C). That being said I'm not very familiar with lua or guile, so they might be a better choice.

Those seem like pretty good reasons.

Looking at crate.io a bit, it looks like there's a fairly mature and popular crate mlua for lua. My main reasoning for considering it instead of CEL is that I think more devs have prior experience with lua, and that it's a quite mature and fast language designed for pretty much this purpose. It's a more full-featured and therefore complex language, which would be a tradeoff. The implementation "LuaJit" (which the mlua wrapper supports) is quite fast and does JIT-compilation.

It isn't pure rust though; I think the user would need a system lua dev package installed so that it can be linked in.

It might be worth giving it a try to see what what that would look like and how it compares. I don't really feel super strongly about lua in particular, but generally it seems worth being a bit careful about this decision. Even if we call it an "experimental" feature, in practice if users start using it they could develop a fair bit of investment into it, making it pretty disruptive if we take it away again or switch to a different language.

(I didn't see a similarly popular/mature crate for guile)

I think if there's a nice way to optionally link to an interpreter so that users don't need to install them to use shadow, but only if they want to use this feature, that could work too.

I think that could be done fairy easily at compile time by making it a cargo feature.

@sporksmith
Copy link
Contributor

[lua] is a more full-featured and therefore complex language

Even if we call it an "experimental" feature, in practice if users start using it they could develop a fair bit of investment into it

Maybe this is an argument in favor of cel over lua. It's limited nature would prevent users from doing anything much more complex than the intended use case of "custom error codes for unimplemented features"

@robgjansen
Copy link
Member

robgjansen commented Jul 18, 2024

Overall this seems positive and helpful! I think this approach is interesting and a powerful way for users to override syscalls.

Some thoughts:

  • When users need a new stub, what is blocking them? It seems to me that the hard part here is doing the debugging to know how to write the stub / cel statement, not the actual rust stub / statement itself. So the main benefit here is that users don't have to maintain a custom branch with their own stubs.
  • This feature may still be useful even if we have fully implemented all syscall suboptions, eg., if a user wants to override our internal impl.
  • How "significant" is the new dependency? If the main practical impact is longer building times, do we know how much the build time increases with the new dependencies?
  • I would argue against an approach that requires adding a new system dependency, as those tend to eventually introduce maintenance issues, and I think one of the (sub)goals of this PR is to make things nicer while we are in maintain mode.
  • I wouldn't worry too much about us removing the feature later. It would be in our experimental section, and it's reasonable to assume users should take our policy seriously.

@sporksmith
Copy link
Contributor

I wouldn't worry too much about us removing the feature later. It would be in our experimental section, and it's reasonable to assume users should take our policy seriously.

Eh, that argument loses some weight once we start advising users to use the feature to get their use-case working. OTOH I don't think we'd recommend doing anything more than the simple "one-liner"s as in Steven's example, so it hopefully wouldn't be hugely disruptive if we change or remove this feature later.

But yeah I'm fine with moving forward with CEL. No extra system dependencies + fairly constrained such that users can't do anything too involved with it SGTM

@stevenengler stevenengler changed the title Add experimental syscall_overrides config option (CEL version) Add experimental syscall_overrides config option Jul 24, 2024
@stevenengler
Copy link
Contributor Author

It might be worth giving it a try to see what what that would look like and how it compares.

I added a Lua version in #3375. I'd expect the performance to be lower since the only way I can see to make it work requires recompiling each expression every time.

It isn't pure rust though; I think the user would need a system lua dev package installed so that it can be linked in.

I noticed that mlua has the "vendored" feature, which seems to solve this problem. I'm guessing it builds the C interpreter.

Maybe this is an argument in favor of cel over lua. It's limited nature would prevent users from doing anything much more complex than the intended use case of "custom error codes for unimplemented features"

Yeah a downside of Lua is that I don't think you can require the user to write a pure function, which means they can modify global state and do other things. Maybe you could write an expression that proxies shadow's network requests to the real internet :)

This allows users to override the behaviour of emulated syscalls using
CEL expressions.
@stevenengler
Copy link
Contributor Author

stevenengler commented Jul 29, 2024

How "significant" is the new dependency? If the main practical impact is longer building times, do we know how much the build time increases with the new dependencies?

Here are the unscientific numbers when doing a clean ./setup build --test --debug on my laptop:

main (d82059b):

  • wall-clock time: 4m37.541s

syscall-stub (377e1c2; this PR):

  • wall-clock time: 4m57.528s

syscall-stub-lua (b40e02f):

  • wall-clock time: 4m37.096s

So this PR adds about 20 seconds (~7% increase) to a clean build. This biggest reason is that cel-interpreter requires the lalrpop crate:

1722291157_grim

(Aside: we build syn twice since cbindgen hasn't published a new release in a while.)

Edit: Added timings for the Lua version.

@stevenengler
Copy link
Contributor Author

stevenengler commented Jul 30, 2024

I'm becoming more confused about the build time. Creating a new project with a dependency of 'cel-interpreter' takes a total of 20 seconds to build and the lalrpop sub-dependency builds in 8.8 seconds (2.4 seconds for codegen). When adding a dependency of lalrpop to shadow without even using it, lalrpop builds in 51 seconds (43 seconds for codegen). I don't understand why it takes significantly longer to build in Shadow, especially when it's not even used. Code was compiled in debug mode with opt-level = 1. Even the syn library takes much longer to build in Shadow than in a new crate with the same feature flags (53 seconds vs 5.5 seconds).

I think the only thing we can do is to put this syscall_overrides behind a cargo feature flag like we do with --use-perf-timers so that people only build it if they need it. This becomes painful to test in the CI though, since we'd need to then rebuild shadow to test it with this feature. Otherwise it might end up like the perf-timers feature which doesn't pass all of the tests.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Component: Build Build/install tools and dependencies Component: Documentation In-repository documentation, under docs/ Component: Main Composing the core Shadow executable

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants