As presented in the issue, we might end up in situation when parallel-processing and accepting two neutral+ trade offers will result in unwanted inventory state, because while they're both neutral+ and therefore OK to accept standalone, the combination of them both causes active badge progress degradation.
Considering the requirements we have, e.g. still processing trades in parallel, being performant, low on resources and with limited Steam servers overhead, the solution that I came up with in regards to this issue is quite simple:
- After we determine the trade to be neutral+, but before we tell the parse trade routine to accept it, we check if shared with other parallel processes set of handled sets contains any sets that we're currently processing.
- If no, we update that set to include everything we're dealing with, and tell the caller to accept this trade.
- If yes, we tell the caller to retry this trade after (other) accepted trades are confirmed and handled as usual.
This solves some issues and creates some optimistic assumptions:
- First of all, it solves the original issue, since if trade A and B both touch set S, then only one of them will be accepted. It's not deterministic which one (the one that gets to the check first), and not important anyway.
- We do not "lock" the sets before we determine that trade is neutral+, because otherwise unrelated users could spam us with non-neutral+ trades in order to lock the bot in infinite retry. This way they can't, as if the trade is determined to not be neutral+ then it never checks for concurrent processing.
- We are optimistic about resources usage. This routine could be made much more complicated to be more synchronous in order to avoid unnecessary calls to inventory and matching, however, that'd slow down the whole process only because the next call MAYBE will be determined as unneeded. Due to that, ASF is optimistic that trades will (usually) be unrelated, and can be processed in parallel, and if the conflict happens then simply we end up in a situation where we did some extra work for no reason, which is better than waiting with the work till all previous trades are processed.
- As soon as the conditions are met, the conflicting trades are retried to check if the conditions allow to accept them. If yes, they'll be accepted almost immediately after previous ones, if not, they'll be rejected as non-neutral+ anymore.
This way the additional code does not hurt the performance, parallel processing or anything else in usually expected optimistic scenarios, while adding some additional overhead in pessimistic ones, which is justified considering we don't want to degrade the badge progress.
After changes regarding to callbacks handling, we accidentally broke the reconnection logic. In particular, forced connection implicitly did disconnect with disconnect callback, but disconnect callback killed our callbacks handling loop for future connection since it was instructed to not reconnect... Pretty convulated logic.
Let's attempt to fix and simplify it. There is no forced connection concept anymore, but rather a new reconnect function which either, triggers reconnection through usual disconnection logic, or connects in edge case if we attempted to reconnect with already disconnnected client.
This way the status transition is more predictable, as we Connect() only in 3 cases:
- Initial start, including !start command, when we actually spawn the callbacks handling loop
- Upon disconnection, if we're configured to reconnect
- Reconnection, in case we're already disconnected and can't use above
And we use reconnect when:
- Failure in heartbeats to detect disconnections sooner
- Failure in refreshing access tokens, since if we lose our refresh token then the only way to get a new one is to reconnect
And finally disconnect is triggered when:
- Stopping the bot, especially !stop
- Bulletproofing against trying to connect when !KeepRunning and likewise
- Usual Steam maintenance and other network issues (which usually trigger reconnection)
The codebase is too huge to analyze every possible edge case, but with this logic I can no longer reproduce the previous issue
This is some next-level race condition, so for those interested:
- Concurrent collections are thread-safe in a way that each operation is atomic
- Naturally if you call two atomic operations in a row, the result is no longer atomic, since there could be some changes between the first and the last
- Certain LINQ operations such as OrderBy(), Reverse(), ToArray(), among more, use internal buffer for operation with certain optimization that checks if input is ICollection, if yes, it calls Count and CopyTo(), for OrderBy in this example
- In result, such LINQ call is not guaranteed to be thread-safe, since it assumes those two calls to be atomic, while they're not in reality.
This issue is quite hard to spot in real applications, since it's not that easy to trigger it (you need to call the operation on ICollection and then have another thread modifying it while enumerating). This is probably why we've never had any real problem until I've discovered this madness with @Aareksio in entirely different project.
As a workaround, we'll explicitly convert some ICollection inputs to IEnumerable, in particular around OrderBy(), so the optimization is skipped and the result is not corrupted.
I've added unit tests which ensure this workaround works properly, and you can easily reproduce the problem by removing AsLinqThreadSafeEnumerable() in them.
See https://github.com/dotnet/runtime/discussions/50687 for more insight
I have no clue who thought that ignoring this issue is a good idea, at the very least concurrent collections should have opt-out mechanism from those optimizations, there is no reason for them to not do that.