Skip to content

Fix leaking sockets #454

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 9 commits into from
Feb 4, 2021
Merged

Fix leaking sockets #454

merged 9 commits into from
Feb 4, 2021

Conversation

swamp-agr
Copy link
Contributor

@swamp-agr swamp-agr commented Feb 1, 2021

Hi @snoyberg!

This PR is intended to close the issue with leaking sockets in case where managerResponseTimeout is relatively small. With outgoing constant 1K rps application becomes stuck after 4-17 hours uptime. Force reboot required every 4 hours. Situation is pretty similar to #374. Thus, this PR will close #374.

Details

withResponse req man f = bracket (responseOpen req man) responseClose f
  • Timeout exception had been thrown in "acquire" function of bracket (i.e. in responseOpen, when connection is opened).
  • Thus, connection could not be closed via bracket.
  • It seems that reaper introduced in Data.KeyedPool did not manage such connections well.
  • It is also quite hard to get extra information from KeyedPool for further debugging purposes and produce the explicit test for such case.
  • Meantime, as it is currently implemented inside httpRaw' some exceptions are indeed handled well:
    ...
    case ex of
        -- Connection was reused, and might have been closed. Try again
        Left e | managedReused mconn && mRetryableException m e -> do
            managedRelease mconn DontReuse
            httpRaw' req m
        -- Not reused, or a non-retry, so this is a real exception
        Left e -> throwIO e
        -- Everything went ok, so the connection is good. If any exceptions get
        -- thrown in the response body, just throw them as normal.
    ...
  • In order to resolve the issue, separate case added:
     ...
    case ex of
        -- Connection was reused, and might have been closed. Try again
        Left e | managedReused mconn && mRetryableException m e -> do
            managedRelease mconn DontReuse
            httpRaw' req m
        -- Connection timed out and might have been closed.
               | mTimeoutException m e -> do
            managedRelease mconn DontReuse
            throwIO e
        -- Not reused, or a non-retry, so this is a real exception
        Left e -> throwIO e
        -- Everything went ok, so the connection is good. If any exceptions get
        -- thrown in the response body, just throw them as normal.
    ...
  • mTimeoutException added as mRetryableException, i.e. in a flexible way. Default implementation was also provided:
    ...
    , managerTimeoutException = \e ->
        case fromException e of
          Just (HttpExceptionRequest _ ConnectionTimeout) -> True
          _                                               -> False
    ...
  • It could be disabled via something like const False to fallback to current upstream.
  • Haddock for Manager contains assumption about next package version (@since 0.7.5). I could remove it if you ask.
  • I have stylish-haskell enabled by default. Thus, there are also some formatting changed. If formatting is not acceptable, I could proceed with reverting it.

I am looking forward to bring the fix to upstream.

Best Regards,
@swamp-agr

Copy link
Owner

@snoyberg snoyberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've started reviewing, but I'd rather look at a clean diff before adding any comments, as I may be missing things.

@snoyberg
Copy link
Owner

snoyberg commented Feb 1, 2021

Thanks for updating. I'm not sure I fully understand the motivation behind the change here though. It seems like the problem is that, if an exception is thrown from a reused connection, we need to ensure that it is released. My question is: what makes timeout exceptions special? Shouldn't we simply call managedRelease mconn DontReuse for all types of exceptions?

@Yuras
Copy link
Collaborator

Yuras commented Feb 1, 2021

@swamp-agr Let me describe how it works right now. KeyedPool doesn't handle allocated connections at all, reaper is for idle connections (waiting to be reused).

When new connection is allocated, weak finalizer is attached to it to deallocate it in case it leaks, but it doesn't guarantee prompt deallocation, see here.

Indeed http-client doesn't even try to guarantee prompt connection deallocation, it relies mostly on finalizers.

The issue in question occurs after 4-7 hours of correct work, so I assume there where multiple major GCs during this time, so finalizers had enough time to be executed. I see three possibilities here:

  • KeyedPool is buggy and doesn't attach finalizer is some cases, or
  • Something keeps a reference to the connections (probably via Response), preventing finalizers from running, or
  • Finalizers are broken in ghc, e.g. you are using older ghc (pre 8.6 IIRC) or they are still broken in newer ghc.

So my conclusion: it's likely that you are fixing a symptom instead of the core issue.

There is nothing wrong with fixing symptoms, and actually I never liked the fact that http-client relies on finalizers. Now the following question comes: do you want http-client to guarantee prompt deallocation? If you do, then the PR is way too simple for that, but if you don't, then it's probably OK.

And I agree with @snoyberg that request timeout is not special in any way. Consider for example the connection timeout, with high probability it will trigger the same issue. Not to mention IO errors or async exceptions.

@swamp-agr
Copy link
Contributor Author

swamp-agr commented Feb 1, 2021

@Yuras, thank you for detailed explanation.

@snoyberg, I want to provide wide fix for deallocation of the httpRaw' function.
There are at least two entry points where deallocation is affected:

In this case the next steps for me would be:

  • close connections before throwing exception in getConnectionWrapper.
  • close connections for all exceptional results of getResponse (Left e -> throwIO e).
  • remove Timeout related settings and helpers introduced in previous commits.

Please expect fix soon.

- `getConn` wrapped in `bracketOnError` because `timeout ms f` applied
to it.
- All exceptions trigger connection release with DontReuse.
- Timeout settings reverted completely.
@swamp-agr swamp-agr changed the title Fix leaking sockets in case of ResponseTimeoutException Fix leaking sockets Feb 2, 2021
@swamp-agr
Copy link
Contributor Author

Changes that were described in comment above implemented now.

| otherwise =
-- Release connection in case of connection timeout:
-- https://github.com/snoyberg/http-client/pull/454
bracketOnError
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This call is a nop. bracketOnError will only trigger the cleanup if the in-between action throws an exception. If the allocation action throws an exception, the finalizer will never be run.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about that. On the other hand, in case of exception Managed Connection will not be created and it will not be an issue. Is it correct or I am missing something?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's correct. I'm not sure what problem you're trying to prevent here.

Copy link
Contributor Author

@swamp-agr swamp-agr Feb 3, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • I was concerned about timeout timeout' (getConn req m) call in getConnectionWrapper.
  • I thought that in some tricky case connection could be acquired but forgotten via timeout. In this case its future will be uncertain and I tried to prevent it.
  • If Managed Connection created before timeout, it definitely would be returned. Otherwise, it would not be created at all.
  • And if it would not be created, then ConnectionTimeout exception could be "safely" thrown from getConnectionWrapper.

Now I see, there is nothing to worry about. I will completely restore getConn.

Thank you for your comment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

@swamp-agr swamp-agr requested a review from snoyberg February 3, 2021 12:20
Copy link
Owner

@snoyberg snoyberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Only last request is a changelog entry.

@Yuras did you have any additional comments before I merge this one in?

@swamp-agr
Copy link
Contributor Author

@snoyberg changelog updated accordingly (I assume that next version will be 0.7.5).

@swamp-agr swamp-agr requested a review from snoyberg February 3, 2021 14:24
@swamp-agr
Copy link
Contributor Author

swamp-agr commented Feb 3, 2021

After quick chat with @Yuras he pointed me out on getConnectionWrapper:

  where
    getConnectionWrapper mtimeout f =
        case mtimeout of
            Nothing -> fmap ((,) Nothing) f
            Just timeout' -> do
                before <- getCurrentTime
                mres <- timeout timeout' f
                case mres of
                    Nothing -> throwHttp ConnectionTimeout
                    Just res -> do
                        now <- getCurrentTime
                        let timeSpentMicro = diffUTCTime now before * 1000000
                            remainingTime = round $ fromIntegral timeout' - timeSpentMicro
                        if remainingTime <= 0
                            then throwHttp ConnectionTimeout -- <-- In this case connection has already present, it could leak
                            else return (Just remainingTime, res)

There is also possibility of asynchronous exceptions.
Thus, his recommendation is to apply bracketOnError inside getConnectionWrapper.

@swamp-agr
Copy link
Contributor Author

Suggestion implemented.
@Yuras, @snoyberg please review.

CI errors from appveyour could be ignored, GitHub webhooks are degraded right now:

telegram-cloud-photo-size-2-5249229594207301805-y

bracketOnError getConnWithTimeout releaseConnMaybe checkConnDuration
where
-- If applicable, wait for timeout period to acquire the connection from the pool.
getConnWithTimeout = case mtimeout of
Nothing -> fmap ((,) Nothing) f
Just timeout' -> do
before <- getCurrentTime
mres <- timeout timeout' f
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a problem. timeout is now running in a masked state, which we shouldn't be doing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should timeout function be wrapped in interruptible explicitly?

Copy link
Collaborator

@Yuras Yuras Feb 4, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@snoyberg Why is it a problem? getConn is interruptible, so timeout will be able to interrupt it. And in any case, getConn basically just calls takeKeyedPool, which is mask_'ed anyway.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't want to rely on the current behavior of bracketOnError not masking in the acquire clause, that may change in the future. It's fragile to write code that way IMO. It seems like the primary thing we're trying to achieve is ensuring that releaseConnMaybe is called in the case of remainingTime <= 0. Why not add that in directly before the throwHttp call, or add an onException?

Copy link
Contributor Author

@swamp-agr swamp-agr Feb 4, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Restored previous version and handled this case directly. Fixed.

@swamp-agr swamp-agr requested a review from snoyberg February 4, 2021 19:11
Copy link
Owner

@snoyberg snoyberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good. Let’s start with this change, and we can continue with other improvements as necessary

@snoyberg snoyberg merged commit 70c0da6 into snoyberg:master Feb 4, 2021
@swamp-agr
Copy link
Contributor Author

@snoyberg @Yuras Thank you for your support!

Even with early version of the fix situation changed.
According to netstat and application logs, bottleneck shifted from http-client side to warp side.

Please expect PR in warp on the next week.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

RequestTimeout setting results in getProtocolByName: does not exist (no such protocol name: tcp))
3 participants