Author: Arvind Padmanabhan
Analysis of a chat app and resolving the problem
A system in which concurrent events come in can affect behaviour depending on the relative timing of these events. Those working with embedded systems are probably familiar with this problem. Hardware interrupts come in at unpredictable times. One layer sends a message to trigger action from another layer. Data comes in via the wireless interface. A timer expires. How does one reliably handle all these events without taking the system to an unknown state? It would be worse if the system can’t recover from such a state.
Real-time operating systems (RTOS) have some things to handle data synchronization, critical sections and race conditions. But can we have such problems in a web app? The answer is yes, as I found out recently when developing a “chat-like” app. The chat window of this app can be updated anytime by multiple users. When user A creates a new chat message, user B can also see it as long as the chat window is open. User B can then reply to user A’s message.
One possible implementation is to use MQTT. Unfortunately, this app is hosted on a shared server where it was not possible to install MQTT. Another solution is to use long polling but I wasn’t sure if this can scale, at least on an Apache server. For example, a hundred chat users doing long polling would mean a hundred TCP connections are kept open. This is certainly not scalable. Short polling is just fine so long as the polling interval is in the order of tens of seconds. I choose about 20 seconds as the interval. The app will still feel responsive to users.
A chat room can have dozens of messages. With an eye on performance, only the recent 8 messages are displayed. Users can then choose to load older messages, 8 messages at a time. Each message has an ID, which is used to ensure that when polling only newer messages (having larger ID) are retrieved.
User A creates a new message. At about the same time, a poll goes out. In response to the newly created message, the message is returned for display. At the same time, the poll returns the new message as well. What happens is that the chat window ends up displaying the new message twice.
This problem is in fact hard to reproduce and it happens only occasionally. This is because to create a new message and get back the response took about 900 ms. The poll goes out once in 20 seconds. So the chances of a collision are quite rare. One way to reproduce the problem is to change the poller to trigger every 900 ms or less. I used 300 ms. By doing this, the problem surfaced easily. In practice, 300 ms poll period will be obeyed only if each request-response completes within that time. In the actual implementation, the next poll period is started only after response to the previous one is obtained. But despite such test parameters, sometimes duplicates do not occur and things seem to work as expected!
To understand the problem, a UML sequence diagram can help. Four variants are shown in (a)-(d). For (a), new chat request goes out first before the poll request. Likewise, new chat response comes before poll response. The first thing that becomes obvious is that multiple AJAX requests from the client to the same server can be made. In fact, most browsers support at least 6 concurrent AJAX calls to the same domain. This is the reason both requests can be pending at the same time.
Let’s assume for a second that the server allows only a single active request per client. Other requests are queued and processed serially. If this be the case, the new message will be added and assigned ID=11. Response is sent out. When the poll request comes with ID=10, server sees that a more recent message is present and delivers that. Hence, client ends up with duplicates. This is what happens with (a) but it doesn’t explain (c), where server responds saying that there are no messages after ID=10. This can be explained in two ways.
Although the diagrams show that a request once sent is immediately received by the server, in actual fact there’s a delay in the network. It’s therefore possible that although new message request is sent first, it’s received second at the server because it’s packets took longer through the network. But this case is similar to analyzing (b) or (d). So let’s discount network delay since we can analyze (b) and (d) instead. The other most possible explanation is that the server is multithreaded, meaning that for every request it spawns a new thread. This is in fact common for most servers. Even with a single-threaded asynchronous server such as NodeJs, there are multiple worker threads in the background that do processing on concurrent requests.
Scenario (c) can be explained easily knowing that the server can handle multiple requests concurrently. Although the polling request comes in second, its thread starts first even before the new message has been added. Before the polling response can be sent out, the new chat message thread adds the new message and responds. Scenario (d) is very similar, with polling request thread finishing before new chat message thread. Scenario (b) is such that new chat message thread finishes first but its response comes after polling response.
A couple of things can be done to fix the problem. Not all of them are perhaps needed but in the spirit of defensive programming, we want to handle problems that may happen anywhere in the sequence.
- When a new chat message is sent out, stop the poller. Resume it only when response is received. Response must come with most recent 8 messages (11…4). In fact, this solves another problem: another new message has been added by another user. So the response would contain messages (12…5).
- When a response is received, ignore a particular message if it’s already being displayed. Since each message has a unique ID, this can be easily determined.
Author: Arvind Padmanabhan
Arvind Padmanabhan graduated from the National University of Singapore with a master’s degree in electrical engineering. With more than fifteen years of experience, he has worked extensively on various wireless technologies including DECT, WCDMA, HSPA, WiMAX and LTE. He is passionate about tech blogging, training and supporting early stage Indian start-ups. He is a founder member of two non-profit community platforms: IEDF and Devopedia. In 2013, he published a book on the history of digital technology: http://theinfinitebit.wordpress.com.