Fix leaking sockets #454

swamp-agr · 2021-02-01T08:08:15Z

This PR is intended to close the issue with leaking sockets in case where managerResponseTimeout is relatively small. With outgoing constant 1K rps application becomes stuck after 4-17 hours uptime. Force reboot required every 4 hours. Situation is pretty similar to #374. Thus, this PR will close #374.

Details

withResponse req man f = bracket (responseOpen req man) responseClose f

Timeout exception had been thrown in "acquire" function of bracket (i.e. in responseOpen, when connection is opened).
Thus, connection could not be closed via bracket.
It seems that reaper introduced in Data.KeyedPool did not manage such connections well.
It is also quite hard to get extra information from KeyedPool for further debugging purposes and produce the explicit test for such case.
Meantime, as it is currently implemented inside httpRaw' some exceptions are indeed handled well:

    ...
    case ex of
        -- Connection was reused, and might have been closed. Try again
        Left e | managedReused mconn && mRetryableException m e -> do
            managedRelease mconn DontReuse
            httpRaw' req m
        -- Not reused, or a non-retry, so this is a real exception
        Left e -> throwIO e
        -- Everything went ok, so the connection is good. If any exceptions get
        -- thrown in the response body, just throw them as normal.
    ...

In order to resolve the issue, separate case added:

     ...
    case ex of
        -- Connection was reused, and might have been closed. Try again
        Left e | managedReused mconn && mRetryableException m e -> do
            managedRelease mconn DontReuse
            httpRaw' req m
        -- Connection timed out and might have been closed.
               | mTimeoutException m e -> do
            managedRelease mconn DontReuse
            throwIO e
        -- Not reused, or a non-retry, so this is a real exception
        Left e -> throwIO e
        -- Everything went ok, so the connection is good. If any exceptions get
        -- thrown in the response body, just throw them as normal.
    ...

mTimeoutException added as mRetryableException, i.e. in a flexible way. Default implementation was also provided:

    ...
    , managerTimeoutException = \e ->
        case fromException e of
          Just (HttpExceptionRequest _ ConnectionTimeout) -> True
          _                                               -> False
    ...

It could be disabled via something like const False to fallback to current upstream.
Haddock for Manager contains assumption about next package version (@since 0.7.5). I could remove it if you ask.
~~I have stylish-haskell enabled by default. Thus, there are also some formatting changed. If formatting is not acceptable, I could proceed with reverting it.~~

I am looking forward to bring the fix to upstream.

Best Regards,
@swamp-agr

snoyberg

I've started reviewing, but I'd rather look at a clean diff before adding any comments, as I may be missing things.

http-client/Network/HTTP/Client/Manager.hs

snoyberg · 2021-02-01T15:59:53Z

Thanks for updating. I'm not sure I fully understand the motivation behind the change here though. It seems like the problem is that, if an exception is thrown from a reused connection, we need to ensure that it is released. My question is: what makes timeout exceptions special? Shouldn't we simply call managedRelease mconn DontReuse for all types of exceptions?

Yuras · 2021-02-01T19:01:03Z

@swamp-agr Let me describe how it works right now. KeyedPool doesn't handle allocated connections at all, reaper is for idle connections (waiting to be reused).

When new connection is allocated, weak finalizer is attached to it to deallocate it in case it leaks, but it doesn't guarantee prompt deallocation, see here.

Indeed http-client doesn't even try to guarantee prompt connection deallocation, it relies mostly on finalizers.

The issue in question occurs after 4-7 hours of correct work, so I assume there where multiple major GCs during this time, so finalizers had enough time to be executed. I see three possibilities here:

KeyedPool is buggy and doesn't attach finalizer is some cases, or
Something keeps a reference to the connections (probably via Response), preventing finalizers from running, or
Finalizers are broken in ghc, e.g. you are using older ghc (pre 8.6 IIRC) or they are still broken in newer ghc.

So my conclusion: it's likely that you are fixing a symptom instead of the core issue.

There is nothing wrong with fixing symptoms, and actually I never liked the fact that http-client relies on finalizers. Now the following question comes: do you want http-client to guarantee prompt deallocation? If you do, then the PR is way too simple for that, but if you don't, then it's probably OK.

And I agree with @snoyberg that request timeout is not special in any way. Consider for example the connection timeout, with high probability it will trigger the same issue. Not to mention IO errors or async exceptions.

swamp-agr · 2021-02-01T23:27:55Z

@Yuras, thank you for detailed explanation.

@snoyberg, I want to provide wide fix for deallocation of the httpRaw' function.
There are at least two entry points where deallocation is affected:

getConnectionWrapper.
And response exceptions.

In this case the next steps for me would be:

close connections before throwing exception in getConnectionWrapper.
close connections for all exceptional results of getResponse (Left e -> throwIO e).
remove Timeout related settings and helpers introduced in previous commits.

Please expect fix soon.

- `getConn` wrapped in `bracketOnError` because `timeout ms f` applied to it. - All exceptions trigger connection release with DontReuse. - Timeout settings reverted completely.

swamp-agr · 2021-02-02T15:31:21Z

Changes that were described in comment above implemented now.

snoyberg · 2021-02-03T04:24:22Z

http-client/Network/HTTP/Client/Manager.hs

+    | otherwise = 
+        -- Release connection in case of connection timeout:
+        -- https://github.com/snoyberg/http-client/pull/454
+        bracketOnError


This call is a nop. bracketOnError will only trigger the cleanup if the in-between action throws an exception. If the allocation action throws an exception, the finalizer will never be run.

I thought about that. On the other hand, in case of exception Managed Connection will not be created and it will not be an issue. Is it correct or I am missing something?

That's correct. I'm not sure what problem you're trying to prevent here.

I was concerned about timeout timeout' (getConn req m) call in getConnectionWrapper.

I thought that in some tricky case connection could be acquired but forgotten via timeout. In this case its future will be uncertain and I tried to prevent it.

If Managed Connection created before timeout, it definitely would be returned. Otherwise, it would not be created at all.

And if it would not be created, then ConnectionTimeout exception could be "safely" thrown from getConnectionWrapper.

Now I see, there is nothing to worry about. I will completely restore getConn.

Thank you for your comment.

snoyberg

LGTM! Only last request is a changelog entry.

@Yuras did you have any additional comments before I merge this one in?

swamp-agr · 2021-02-03T14:24:33Z

@snoyberg changelog updated accordingly (I assume that next version will be 0.7.5).

swamp-agr · 2021-02-03T14:58:52Z

After quick chat with @Yuras he pointed me out on getConnectionWrapper:

  where
    getConnectionWrapper mtimeout f =
        case mtimeout of
            Nothing -> fmap ((,) Nothing) f
            Just timeout' -> do
                before <- getCurrentTime
                mres <- timeout timeout' f
                case mres of
                    Nothing -> throwHttp ConnectionTimeout
                    Just res -> do
                        now <- getCurrentTime
                        let timeSpentMicro = diffUTCTime now before * 1000000
                            remainingTime = round $ fromIntegral timeout' - timeSpentMicro
                        if remainingTime <= 0
                            then throwHttp ConnectionTimeout -- <-- In this case connection has already present, it could leak
                            else return (Just remainingTime, res)

There is also possibility of asynchronous exceptions.
Thus, his recommendation is to apply bracketOnError inside getConnectionWrapper.

swamp-agr · 2021-02-03T15:52:37Z

Suggestion implemented.
@Yuras, @snoyberg please review.

CI errors from appveyour could be ignored, GitHub webhooks are degraded right now:

snoyberg · 2021-02-04T03:31:17Z

http-client/Network/HTTP/Client/Core.hs

+        bracketOnError getConnWithTimeout releaseConnMaybe checkConnDuration
+      where
+        -- If applicable, wait for timeout period to acquire the connection from the pool.
+        getConnWithTimeout = case mtimeout of
            Nothing -> fmap ((,) Nothing) f
            Just timeout' -> do
                before <- getCurrentTime
                mres <- timeout timeout' f


This is a problem. timeout is now running in a masked state, which we shouldn't be doing.

Should timeout function be wrapped in interruptible explicitly?

@snoyberg Why is it a problem? getConn is interruptible, so timeout will be able to interrupt it. And in any case, getConn basically just calls takeKeyedPool, which is mask_'ed anyway.

I don't want to rely on the current behavior of bracketOnError not masking in the acquire clause, that may change in the future. It's fragile to write code that way IMO. It seems like the primary thing we're trying to achieve is ensuring that releaseConnMaybe is called in the case of remainingTime <= 0. Why not add that in directly before the throwHttp call, or add an onException?

Restored previous version and handled this case directly. Fixed.

snoyberg

This looks good. Let’s start with this change, and we can continue with other improvements as necessary

swamp-agr · 2021-02-04T20:05:43Z

@snoyberg @Yuras Thank you for your support!

Even with early version of the fix situation changed.
According to netstat and application logs, bottleneck shifted from http-client side to warp side.

Please expect PR in warp on the next week.

snoyberg requested changes Feb 1, 2021

View reviewed changes

http-client/Network/HTTP/Client/Manager.hs Outdated Show resolved Hide resolved

Force close connection when ResponseTimeout happened

db17ae8

swamp-agr force-pushed the master branch from 55efc06 to db17ae8 Compare February 1, 2021 09:55

swamp-agr requested a review from snoyberg February 1, 2021 09:57

Fix typo and markup in haddock

ebe54a4

Close connection for ResponseTimeout

d51afae

Wrap getConn in bracketOnError and release connection for exceptions

72e599e

- `getConn` wrapped in `bracketOnError` because `timeout ms f` applied to it. - All exceptions trigger connection release with DontReuse. - Timeout settings reverted completely.

swamp-agr changed the title ~~Fix leaking sockets in case of ResponseTimeoutException~~ Fix leaking sockets Feb 2, 2021

snoyberg reviewed Feb 3, 2021

View reviewed changes

swamp-agr added 2 commits February 3, 2021 15:19

Restore getConn since it is safe to invoke timeout upon it

1bccc90

Remove unused import

bf5b111

swamp-agr requested a review from snoyberg February 3, 2021 12:20

snoyberg requested changes Feb 3, 2021

View reviewed changes

Update ChangeLog for http-client

633bb1d

swamp-agr requested a review from snoyberg February 3, 2021 14:24

Release connection inside getConnectionWrapper for error cases

06d6b85

snoyberg reviewed Feb 4, 2021

View reviewed changes

Restore getConnectionWrapper and handle explicit timeout

43deffa

swamp-agr requested a review from snoyberg February 4, 2021 19:11

snoyberg approved these changes Feb 4, 2021

View reviewed changes

snoyberg merged commit 70c0da6 into snoyberg:master Feb 4, 2021

larskuhtz mentioned this pull request Feb 5, 2021

use http-client-0.7.5 in cabal builds kadena-io/chainweb-node#1191

Merged

3 tasks

Fix leaking sockets #454

Fix leaking sockets #454

Uh oh!

Conversation

swamp-agr commented Feb 1, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Details

Uh oh!

snoyberg left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

snoyberg commented Feb 1, 2021

Uh oh!

Yuras commented Feb 1, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

swamp-agr commented Feb 1, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

swamp-agr commented Feb 2, 2021

Uh oh!

snoyberg Feb 3, 2021

Choose a reason for hiding this comment

Uh oh!

swamp-agr Feb 3, 2021

Choose a reason for hiding this comment

Uh oh!

snoyberg Feb 3, 2021

Choose a reason for hiding this comment

Uh oh!

swamp-agr Feb 3, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

swamp-agr Feb 3, 2021

Choose a reason for hiding this comment

Uh oh!

snoyberg left a comment

Choose a reason for hiding this comment

Uh oh!

swamp-agr commented Feb 3, 2021

Uh oh!

swamp-agr commented Feb 3, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

swamp-agr commented Feb 3, 2021

Uh oh!

snoyberg Feb 4, 2021

Choose a reason for hiding this comment

Uh oh!

swamp-agr Feb 4, 2021

Choose a reason for hiding this comment

Uh oh!

Yuras Feb 4, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

snoyberg Feb 4, 2021

Choose a reason for hiding this comment

Uh oh!

swamp-agr Feb 4, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

snoyberg left a comment

Choose a reason for hiding this comment

Uh oh!

swamp-agr commented Feb 4, 2021

Uh oh!

Uh oh!

swamp-agr commented Feb 1, 2021 •

edited

Loading

Yuras commented Feb 1, 2021 •

edited

Loading

swamp-agr commented Feb 1, 2021 •

edited

Loading

swamp-agr Feb 3, 2021 •

edited

Loading

swamp-agr commented Feb 3, 2021 •

edited

Loading

Yuras Feb 4, 2021 •

edited

Loading

swamp-agr Feb 4, 2021 •

edited

Loading