Animats 5 days ago

Long polling has some problems of its own.

Second Life has an HTTPS long polling channel between client and server. It's used for some data that's too bulky for the UDP connection, not too time sensitive, or needs encryption. This has caused much grief.

On the client side, the poller uses libcurl. Libcurl has timeouts. If the server has nothing to send for a while, libcurl times out. The client then makes the request again. This results in a race condition if the server wants to send something between timeout and next request. Messages get lost.

On top of that, the real server is front-ended by an Apache server. This just passes through relevant requests, blocking the endless flood of junk HTTP requests from scrapers, attacks, and search engines. Apache has a timeout, and may close a connection that's in a long poll and not doing anything.

Additional trouble can come from middle boxes and proxy servers that don't like long polling.

There are a lot of things out there that just don't like holding an HTTP connection open. Years ago, a connection idle for a minute was fine. Today, hold a connection open for ten seconds without sending any data and something is likely to disconnect it.

The end result is an unreliable message channel. It has to have sequence numbers to detect duplicates, and can lose messages. For a long time, nobody had discovered that, and there were intermittent failures that were not understood.

In the original article, the chart section labelled "loop" doesn't mention timeout handling. That's not good. If you do long polling, you probably need to send something every few seconds to keep the connection alive. Not clear what a safe number is.

  • wutwutwat 5 days ago

    Every problem you just listed is 100% in your control and able to be configured, so the issue isn't long polling, it's your setup/configs. If your client (libcurl) times out a request, set the timeout higher. If apache is your web server and it disconnects idle clients, increase the timeout, tell it not to buffer the request and to pass it straight back to the app server. If there's a cloud lb somewhere (sounds like it because alb defaults to a 10s idle timeout), increase the timeouts...

    Every timeout in every hop of the chain is within your control to configure. Setup a subdomain and send long polling requests through that so the timeouts can be set higher and not impact regular http requests or open yourself up to slow client ddos.

    Why would you try to do long polling and not configure your request chain to be able to handle them without killing idle connections? The problems you have only exist because you're allowing them to exist. Set your idle timeouts higher. Send keepalives more often. Tell your web servers to not do request buffering, etc.

    All of that is extremely easy to test and verify functioanlity. Does the request live longer than your polling interval? Yes? Great you're done! No? Tune some more timeouts and log the request chain everywhere you can until you know where the problems lie. Knock them out one by one going back to the origin until you get what you want.

    Long polling is easy to get right from an operations perspective.

    • moritonal 5 days ago

      Whilst it's possible you may be correct, I do have to point out you are, I believe, lecturing John Nagle, known for Nagle's algorithm, used in most TCP stacks in the world.

      • Animats 5 days ago

        He has a valid criticism. It's not that it can't be fixed. It's that it's hard to diagnose.

        The underlying problems are those of legacy software. People long gone from Linden Lab wrote this part of Second Life. The networking system predates the widespread use of middle boxes. (It also predates the invention of conflict-free replicated data types, which would have helped some higher level consistency problems.) The internal diagnostic tools are not very helpful. The problem manifests itself as errors in the presentation of a virtual world, far from the network layer. What looked like trouble at the higher levels turned to be, at least partially, trouble at the network layer.

        The developer who has to fix this wrote "This made me look into this part of the protocol design and I wish I hadn't."

        More than you ever wanted to know about this: [1] That discussion involves developers of four different clients, some of which talk to two different servers.

        (All this, by the way, is part of why there are very few big, seamless, high-detail metaverses. Despite all the money spent during the brief metaverse boom, nobody actually shipped a good one. There are some hard technical problems seen nowhere else. Somehow I seem to end up in areas like that.)

        [1] https://community.secondlife.com/forums/topic/503010-obscure...

      • motorest 5 days ago

        > Whilst it's possible you may be correct, I do have to point out you are, I believe, lecturing John Nagle, known for Nagle's algorithm, used in most TCP stacks in the world.

        Thank you for pointing that out. This thread alone is bound to become a meme.

        • wutwutwat 5 days ago

          [flagged]

          • moritonal 5 days ago

            I'm sorry, I didn't mean to embarrass you. I only meant to point out that people would err on the side of John likely knowing what they were talking about, whilst you seem to confidently have some misunderstandings in your comment.

            For example, you said "Every timeout in every hop of the chain is within your control to configure", but I'm quite confident that the WiFi router my office uses doesn't respect timeouts correctly, nor does my Phone's ISP, or certain VPNs. Those "hops" in the network are not within my control at all.

          • motorest 5 days ago

            > I better go kill myself from this embarrassment so my family doesn't have to live with my shame!

            There's no need to go to extremes, no matter how embarrassing and notably laughable your comment was. I'd say enjoy your fame.

            • KronisLV 4 days ago

              > There's no need to go to extremes

              Agreed, honestly if an argument is made and it makes sense, it doesn't matter who is on the other side.

              > no matter how embarrassing and notably laughable your comment was.

              I wouldn't even call the original comment laughable, they had a point - if you are in control of significant parts of the overall solution, then you can most likely mitigate most of the issues. And, while not a best practice, there's nothing really preventing you from sneaking in the occasional keepalive type response in the stream of events, if you deem it necessary.

              The less of the infrastructure and application you control, the more likely the other issues are likely to pop up their heads. As usual, it depends on your circumstances and both the original comment and the response are valid in their own right. The "extreme" response was a bit more... I'm not sure, cringe? Oh well.

      • rezmason 5 days ago

        I bet there's an online college credit transfer program that'll accept this as a doctoral defense. Depending on how Nagle's finagled.

      • exe34 5 days ago

        Oh my gosh :-D

      • wutwutwat 5 days ago

        [flagged]

        • conradfr 5 days ago

          But it's easier to lecture someone on a bug they already diagnosed and explained to you.

          • wutwutwat 5 days ago

            We must be reading different comments

    • tbillington 5 days ago

      > Every timeout in every hop of the chain is within your control to configure.

      lol

      • wutwutwat 5 days ago

        I wasn't talking about network switch hops and if you're trying to long polling and don't have control over the web servers going back to your systems then wtf are you trying to do long polling for anyway.

        I don't try to run red lights because I don't have control over the lights on the road.

        • earnestinger 5 days ago

          Thus, the advice to not run the red light..

          • wutwutwat 5 days ago

            [flagged]

            • luma 5 days ago

              Re-read the post, there’s more in the path than just your client and server code, and network switches aren’t the problem. The “middle boxes and proxy servers” are legion and you can only mitigate their presence.

              You’ve been offered the gift of wisdom. It’d be wise on your part to pay attention, because you clearly aren’t.

  • mike-cardwell 5 days ago

    That race condition has nothing to do with long polling, it's just poor design. The sender should stick the message in a queue and the client reads from that queue. Perhaps with the client specifying the last id it saw.

    • profmonocle 5 days ago

      And it's worth noting that you can't just ignore this problem if you're using websockets - websockets disconnect sometimes for a variety of reasons. It may be less frequent than a long-polling timeout, but if you don't have some mechanism of detecting that messages weren't ack'd and retransmitting them the next time the user connects, messages will get lost eventually.

    • lmm 5 days ago

      > That race condition has nothing to do with long polling, it's just poor design. The sender should stick the message in a queue and the client reads from that queue.

      How does that help? You can't pop from a queue over HTTP because when the client disconnects you don't know whether it saw your response or not.

      • profmonocle 5 days ago

        The next long-polling request can include a list of the ID(s) returned in the previous request. You keep the messages in the queue until you get the next request ack'ing them.

        • thayne 5 days ago

          That means you have to keep them in the queue until the next time the client connects, which could be a very long time

          • jlokier 5 days ago

            To get reliable message delivery, you have to do that when using WebSockets or SSE too, because those also disconnect or time out depending on upstream network conditions outside the client's control, and will lose messages during reconnection if you don't have a sender-side retransmt queue.

            However, queued messages don't have to be kept for a very long time, usually. Because every connection method suffers from this problem, you wouldn't usually architect a system with no resync or reset strategy in the client when reconnection takes so long that it isn't useful to stream every individual message since the last connection.

            The client and/or server have a resync timeout, and the server's queue is limited to that timeout, plus margin for various delays.

            Once there's a resync strategy implemented, it is often reasonable for rhe server to be able to force a resync early, so it can flush messages queues according to other criteria than a strict timeout. For example memory pressure or server restarts.

            • thayne 5 days ago

              With websockets, the client can immediately acknowledge receipt of a message, since the connection is bidirectional. And on a resync the server can just send events that the client never acknowledged.

              • ghusbands 4 days ago

                That (to quote you) means you have to keep them in the queue until the next time the client connects, which could be a very long time.

                To be clear, there's no real difference - in both cases you have to keep messages in some queue and potentially resend them until they've been acknowledged.

          • mike-cardwell 4 days ago

            As the other guy said, that's why I mentioned using an ID. And as the other guy said, the same as required regardless of what channel you're using.

  • interroboink 5 days ago

    I'm new to websockets, please forgive my ignorance — how is sending some "heartbeat" data over long polling different from the ping/pong mechanism in websockets?

    I mean, in both cases, it's a TCP connection over (eg) port 443 that's being kept open, right? Intermediaries can't snoop the data if its SSL, so all they know is "has some data been sent recently?" Why would they kill long-polling sessions after 10sec and not web socket ones?

    • Animats 5 days ago

      Sending an idle message periodically might help. But the Apache timeout for persistent HTTPS connections is now only 5 seconds.[1] So you need rather a lot of idle traffic if the server side isn't tolerant of idle connections.

      Why such a short timeout? Attackers can open connections and silently drop them to tie up resources. This is why we can't have nice things.

      [1] https://httpd.apache.org/docs/2.4/mod/core.html#keepalivetim...

      • ahoka 4 days ago

        Wow, 5 seconds is pretty aggressive! For example nginx has 60s as default, probably allowed by its event driven architecture, which mitigates some of the problems with “c10k” use cases.

        Anyways, the real takeaway is even if your current solution works now, one day someone will put something stupid between your server and the clients that will invalidate all current assumptions.

        For example I have created some service which consumes a very large NDJSON file over an HTTPS connection, which I expect to be open for half an hour at least, so I can process the content as a stream.

        I dread the day when I have to fight with someone’s IT to keep this possible.

  • Myrmornis 4 days ago

    > If the server has nothing to send for a while, libcurl times out. The client then makes the request again. This results in a race condition if the server wants to send something between timeout and next request. Messages get lost.

    I think it's a premise of reliable long-polling that the server can hold on to messages across the changeover from one client request to the next.

  • rednafi 5 days ago

    Yeah, some servers close connections when there’s no data transfer. When the backend holds the connection while polling the database until a timeout occurs or the database returns data, it needs to send something back to the client to keep the connection alive. I wonder what could be sent in this case and whether it would require special client-side logic.

    • mhitza 5 days ago

      In the HTTP model, technically status code 102 Processing would fit best. Though, no longer part of the HTTP specification.

      https://http.dev/102

      100 Continue could be usable as a workaround. Would probably require at a bare minimum some extra integration code on the client side.

ipnon 5 days ago

Articles like this make me happy to use Phoenix and LiveView every day. My app uses WebSockets and I don’t think about them at all.

  • pipes 5 days ago

    Is this similar to Microsoft's blazer?

    • mattmanser 5 days ago

      MS's web socket solution is called signalr.

      It is also fire and forget with fall over to http if web sockets aren't available. I believe if web sockets don't work it can fall over to http long polling instead, but don't quote me on that.

      All the downsides of web sockets mentioned in the article are handled for you. Plus you can re-use your existing auth solution. Easily plug logging stuff in, etc. etc. Literally all the problems mentioned by the author are dealt with.

      Given the author mentions C# is part of their stack I don't know why they didn't mention signalr or use that instead of rolling their own solution.

      • whoknowsidont 5 days ago

        >Given the author mentions C# is part of their stack I don't know why they didn't mention signalr or use that instead of rolling their own solution.

        Not saying this is why but SignalR is notoriously buggy. I've never seen a real production instance that didn't have issues. I say that as someone who probably did one of the first real world, large scale roll outs of SignalR about a decade ago.

        • rswskg 4 days ago

          I believe it's changed alot since then? I was abit cynical of it from that initial experience but recent usage of it made it seem reliable and scalable.

      • hikarikuen 5 days ago

        To further clarify, I believe Blazor uses SignalR under-the-hood when doing server-side rendering. So the direct answer is probably "this is similar to a component used in Blazor"

        Edit: Whoops, I lost context here. Phoenix LiveView as a whole is probably pretty analogous to Blazor.

      • avgDev 4 days ago

        You are correct.

        I am using Blazor Server, and for some reason a server is not allowing Web Socket connections(troubleshooting this) and the app switches to long polling as fallback.

    • abrookewood 5 days ago

      Phoenix predates Blazer, but they are both server-side rendered frameworks.

      In terms of real time updates, Phoenix typically relies on LiveView, which uses web sockets and falls back to long-polling if necessary. I think SignalR is the closest equivalent in the .Net world.

    • pipes 5 days ago

      Odd, I wonder why I got down voted for it, it was a genuine question

  • dugmartin 5 days ago

    The icing on the cake is that you can also enable Phoenix channels to fallback to longpolling in your endpoint config. The generator sets it to false by default.

  • cultofmetatron 5 days ago

    hah seriously. my app uses web sockets extensively but since we are also using Phoenix, its never been source of conflict in development. it really was just drop it and scale to thousands of users.

    • arrty88 5 days ago

      Why couldn’t nodejs with uWS library or golang + gorilla handle 10s of thousands of connections?

      • apitman 5 days ago

        I think GP's point is that they feel Phoenix is simpler to use than alternatives, not necessarily that it scales better.

        • cultofmetatron 5 days ago

          to clarify, it does scale better out of the box. a clustered websocket setup where reconnections can go to a different machine and resume state works our of the box. its a LOT of work do do that in nodejs. I've done both.

  • zacksiri 5 days ago

    I was thinking this exact thing as I was reading the article.

  • wutwutwat 5 days ago

    every websocket setup is painless when running on a single server or handling very few connections...

    I was on the platform/devops/man w/ many hats team for an elixir shop running Phoenix in k8s. WS get complicated even in elxir when you have 2+ app instances behind a round robin load balancer. You now need to share broadcasts between app servers. Here's a situation you have to solve for w/ any app at scale regardless of language

    app server #1 needs to send a publish/broadcast message out to a user, but the user who needs that message isn't connected to app server #1 that generated the message, that user is currently connected to app server #2.

    How do you get a message from one app server to the other one which has the user's ws connection?

    A bad option is sticky connections. User #1 always connects to server #1. Server #1 only does work for users connected to it directly. Why is this bad? Hot spots. Overloaded servers. Underutilized servers. Scaling complications. Forecasting problems. Goes against the whole concept of horizontal scaling and load balancing. It doesn't handle side-effect messages, ie user #1000 takes some action which needs to broadcast a message to user #1 which is connected to who knows where.

    The better option: You need to broadcast to a shared broker. Something all app servers share a connection to so they can themselves subscribe to messages they should handle, and then pass it to the user's ws connection. This is a message broker. postgres can be that broker, just look at oban for real world proof. Throw in pg's listen/notify and you're off to the races. But that's heavy from a resources per db conn perspective so lets avoid the acid db for this then. Ok. Redis is a good option, or since this is elixir land, use the built in distributed erlang stuff. But, we're not running raw elixir releases on linux, we're running inside of containers, on top of k8s. The whole distributed erlang concept goes to shit once the erlang procs are isolated from each other and not in their perfect Goldilocks getting started readme world. So ok, in containers in k8s, so each app server needs to know about all the other app servers running, so how do you do that? Hmm, service discovery! Ok, well, k8s has service discovery already, so how do I tell the erlang vm about the other nodes that I got from k8s etcd? Ah, a hex package cool. lib_cluster to the rescue https://github.com/bitwalker/libcluster

    So we'll now tie the boot process of our entire app to fetching the other app server pod ips from k8s service discovery, then get a ring of distributed erlang nodes talking to each other, sharing message passing between them, this way no matter which server the lb routes the user to, a broadcast from any one of them will be seen by all of them, and the one who holds the ws connection will then forward it down the ws to the user.

    So now there's a non trivial amount of complexity and risk that was added here. More to reason about when debugging. More to consider when working on features. More to understand when scaling, deploying, etc. More things to potentially take the service down or cause it not to boot. More things to have race conditions, etc.

    Nothing is ever so easy you don't have to think about it.

    • conradfr 5 days ago

      Did you try the Redis adapter that ships with Phoenix.PubSub (which Channels use)?

      • wutwutwat 5 days ago

        Yes, and I mentioned redis above. You need a message broker with 2+ servers, be it pg, redis, or distributed erlang. It's still more complex than a single server setup, which was my point. It's easy to say you don't have to think about things working when you have 1 server, it's not easy to say that past 1 server, because the complexity added changes that picture.

        • davydog187 5 days ago

          Ok but in literally any other language you minimally need this setup to do PubSub.

          Elixir gives more options and lets you do it natively.

          Also, there are simpler options for clustering out there like https://github.com/phoenixframework/dns_cluster (Disclaimer: I am a contributor)

        • conradfr 5 days ago

          Ok it was not clear if you used Redis directly with some custom code or the integrated stuff.

          Anyway I agree that once you go with more than one server it's a whole new world but not sure if it's easier in any other language.

    • Muromec 5 days ago

      If only there was a way to send messages from one host to another in elixir

      • wutwutwat 5 days ago

        Did you even read my comment? I literally talk about how to do that, in multiple different ways.

        • abrookewood 5 days ago

          Were the servers part of an Elixir cluster? Because that functionality should be transparent. It sounds to me like you had them set up as two independent nodes that were not aware of each other.

  • leansensei 5 days ago

    Same here. It truly is a godsend.

  • tzumby 5 days ago

    I came here to say exactly this! Elixir and OTP (and by extension LiveView) are such a good match for the problem described in the post.

    • j45 5 days ago

      I was kind of wondering how something hadn't solve this at all, compared to a solution not readily being on one's path.

  • diggan 5 days ago

    Articles like this make me happy to use Microsoft FrontPage and cPanel, I don't think about HTTP or WebSockets at all.

CharlieDigital 6 days ago

Would there be any technical benefit to this over using server sent events (SSE)?

Both are similar in that they hold the HTTP connection open and have the benefit of being simply HTTP (the big plus here). SSE (at least to me) feels like it's more suitable for some use cases where updates/results could be streamed in.

A fitting use case might be where you're monitoring all job IDs on behalf of a given client. Then you could move the job monitoring loop to the server side and continuously yield results to the client.

  • vindex10 6 days ago

    I got interested and found this nice thread on SO: https://stackoverflow.com/a/5326159

    One of the drawbacks, as I learned - SSE have limit on number of up to ~6 open connections (browser + domain name). This can quickly become a limiting factor when you open the same web page in multiple tabs.

    • CharlieDigital 5 days ago

      As the other two comments mentioned, this is a restriction with HTTP/1.1 and it would apply also to long polling connections as well.

    • _heimdall 5 days ago

      Syncing state across multiple tabs and windows is always a bit tricky. For SSE, I'd probably reach for the BroadcastChannel API. Open the SSE connection in the first tab and have it broadcast events to any other open tab or window.

    • bn-l 5 days ago

      …if you’re using http/1.1. It’s not an issue with 2+

    • Klonoar 5 days ago

      Not an issue if you’re using HTTP/2 due to how multiplexing happens.

  • lunarcave 6 days ago

    Good point! We did consider SSE, but ultimately decided against it due to the way we have to re-implement response payloads (one for application/json and one for text/event-stream).

    I've not personally witnessed this, but people on the internets have said that _some_ proxies/LBs have problems with SSE due to the way it does buffering.

    • gorjusborg 5 days ago

      > we have to re-implement response payloads (one for application/json and one for text/event-stream)

      I am curious about what you mean here. The 'text/event-stream' allows for abitrary event formats, it just provides structure for EventSource to be able to parse.

      You should only need one 'text/event-stream' and should be able send the same JSON via normal or SSE response.

      • josephg 5 days ago

        What the GP commenter might have meant is that websockets support binary message bodies. SSE does not.

  • csumtin 5 days ago

    I tried using SSE and found it didn't work for my use case, it was broken on mobile. When users switched from the browser to an app and back, the SSE connection was broken and they wouldn't receive state updates. Was easier to do long polling

    • josephg 5 days ago

      The standard way to fix that is to send ping messages every ~15 seconds or something over the SSE stream. If the client doesn’t get a ping in any 20 second window, assume the sse stream is broken somehow and restart it. It’s complex but it works.

      The big downside of sse in mobile safari - at least a few years ago - is you got a constant loading spinner on the page. Thats bad UX.

  • bythreads 5 days ago

    SSE are being removed from various stacks and implementations

    • sudodevnull 5 days ago

      That's the dumbest thing I've ever heard. SSE is just a response type for normal HTTP. Explain exactly what you mean cause that's like saying that are removing response types from HTTP.

yuliyp 5 days ago

I think this article is tying a lot of unrelated decisions to "Websocket" vs "Long-polling" when they're actually independent. A long-polling server could handle a websocket client with just a bit of extra work to handle keep-alive.

For the other direction, to support long-polling clients if your existing architecture is websockets which get data pushed to them by other parts of the system, just have two layers of servers: one which maintains the "state" of the connection, and then the HTTP server which receives the long polling request can connect to the server that has the connection state and wait for data that way.

  • harrall 5 days ago

    It sounded like the author(s) just had existing request-oriented code and didn’t want to rewrite it to be connection-oriented.

    Personally I would enjoyed solving that problem instead of hacking around it but that’s me.

  • lunarcave 5 days ago

    Author here.

    Having done this, I don't think I'd reduce it to "just a little bit of work" to make it hum in production.

    Everything in between your UI components and the database layer needs to be reworked to work in the connection-oriented (Websockets) model of the world vs request-oriented world.

    • yuliyp 5 days ago

      > Everything in between your UI components and the database layer needs to be reworked to work in the connection-oriented (Websockets) model of the world vs request-oriented world.

      How so? As a minimal change, the thing on the server end of the websocket could just do the polling of your database on its own while the connection is open (using authorization credentials supplied as the websocket is being opened). If the connection dies, stop polling. This has the nice property that you're in full control of the refresh rate, can implement coordinated backoffs if the database is overloaded, etc.

      • lunarcave 4 days ago

        Yes, but that's one part of it though. For example, you have to:

        - change how you hydrate the initial state in the web component. - rework any request-oriented configurations you do at the edge based on the payloads. (For example, if you use cloudflare and use their HTTP rules, you have to rework that)

    • bythreads 5 days ago

      It's a good pattern to have a vanilla js network manager / layer in fe for this exact reason - makes swapping network technologies a lot simpler.

      Only that knows url for endpoints, protocols and connections - and proxies between them and your app / components

wereHamster 5 days ago

Unrelated to the topic in the article…

    await new Promise(resolve => setTimeout(resolve, 500));
In Node.js context, it's easier to:

    import { setTimeout } from "node:timers/promises";
    await setTimeout(500);
  • hombre_fatal 5 days ago

    I haven't used that once since I found out that it exists.

    I just don't see the point. It doesn't work in the browser and it shadows global.setTimeout which is confusing. Meanwhile the idiom works everywhere.

    • joshmanders 5 days ago

      You can alias it if you're worried about shadowing.

          import { setTimeout as loiter } from "node:timers/promises";
          await loiter(500);
      • hombre_fatal 5 days ago

        Sure, and that competes with a universal idiom.

        To me it's kinda like adding a shallowClone(old) helper instead of writing const obj = { ...old }.

        But no point in arguing about it forever.

  • treve 5 days ago

    Is that easier? The first snippet is shorter and works on any runtime.

    • joshmanders 5 days ago

      In the context of Node.js, where op said, yes it is easier. But it's a new thing and most people don't realize timers in Node are awaitable yet, so the other way is less about "works everywhere" and more "this is just what I know"

      • wereHamster 5 days ago

        I guess most Node.js developers also don't realize that there's "node:fs/promises" so you don't have to use callbacks or manually wrap functions from "node:fs" with util.promisify(). Doesn't mean need to stick with old patterns forever.

        When I said 'in the context of Node.js' I meant if you are in a JS module where you already import other node: modules, ie. when it's clear that code runs in a Node.js runtime and not in a browser. Of course when you are writing code that's supposed to be portable, don't use it. Or don't use setTimeout at all because it's not guaranteed to be available in all runtimes - it's not part of the ECMA-262 language specification after all.

baumschubser 5 days ago

I like long polling, it’s easy to understand from start to finish and from client perspective it just works like a very slow connection. You have to keep track of retries and client-side cancelled connections to have one but only one (and the right one) of requests at hand to answer to.

One thing that seems clumsy in the code example is the loop that queries the data again and again. Would be nicer if the data update could also resolve the promise of the response directly.

  • josephg 5 days ago

    Hard disagree. Long polling can have complex message ordering problems. You have completely different mechanisms for message passing from client-to-server and server-to-client. And middle boxes can and will stall long polled connections, stopping incremental message delivery. (Or you can use one http query per message - but that’s insanely inefficient over the wire).

    Websockets are simply a better technology. With long polling, the devil is in the details and it’s insanely hard to get those details right in every case.

    • wruza 5 days ago

      you can use one http query per message - but that’s insanely inefficient over the wire

      Use one http response per message queue snapshot. Send no more than N messages at once. Send empty status if the queue is empty for more than 30-60 seconds. Send cancel status to an awaiting connection if a new connection opens successfully (per channel singleton). If needed, send and accept "last" id/timestamp. These are my usual rules for long-polling.

      Prevents: connection overhead, congestion latency, connection stalling, unwanted multiplexing, sync loss, respectively.

      You have completely different mechanisms for message passing from client-to-server and server-to-client

      Is this a problem? Why should this even be symmetric?

      • josephg 5 days ago

        You can certainly you can do all that. You also need to handle retransmission. And often you also need a way for the client to send back confirmations that each side received certain messages. So, as well as sequence numbers like you mentioned, you probably want acknowledgement numbers in messages too. (Maybe - it depends on the application).

        Implementing a stable, in-order, exactly once message delivery system on top of long polling starts to look a lot like implementing TCP on top of UDP. Its a solvable problem. I've done it - 14 years ago I wrote the first opensource implementation of (the server side) of google's Browserchannel protocol, from back before websockets existed:

        https://github.com/josephg/node-browserchannel

        This supports long polling on browsers, all the way back to IE5.5. It works even when XHR isn't available! I wrote it in literate coffeescript, from back when that was a thing.

        But getting all of those little details right is really very difficult. Its a lot of code, and there are a lot of very subtle bugs lurking in this kind of code if you aren't careful. So you also need good, complex testing. You can see in that repo - I ended up with over 1000 lines of server code+comments (lib/server.coffee), and 1500 lines of testing code (test/server.coffee).

        And once you've got all that working, my implementation really wanted server affinity. Which made load balancing & failover across over multiple application servers a huge headache.

        It sounds like your application allows you to simplify some details of this network protocol code. You do you. I just use websockets & server-sent events. Let TCP/IP handle all the details of in-order message delivery. Its really quite good.

        • wruza 5 days ago

          This is a common library issue, it doesn’t know and has to be defensive and featureful at the same time.

          Otoh, end-user projects usually know things and can make simplifying decisions. These two are incomparable. I respect the effort, but I also think that this level of complexity is a wrong answer to the call in general. You have to price-break requirements because they tend to oversell themselves and rarely feature-intersect as much as this library implies. Iow, when a client asks for guarantees, statuses or something we just tell them to fetch from a suitable number of seconds ago and see themselves. Everyone works like this, you need some extra - track it yourself based on your own metrics and our rate limits.

    • _nalply 5 days ago

      One of them 2001 was that Netscape didn't render correctly if the connection is still open. Hah. I am sure this issue has been fixed a long, long time ago, but perhaps there are other issues.

      Nowadays I prefer SSE to long polling and websockets.

      The idea is: the client doesn't know that the server has new data before it makes a request. With a very simple SSE the client is told that new data is there then it can request new data separately if it wants. This said, SSE has a few quirks, one of them that on HTTP/1 the connection counts to the maximum limit of 6 concurrent connections per browser and domain, so if you have several tabs, you need a SharedWorker to share the connection between the tabs. But probably this quirk also appllies to long polling and websockets. Another quirk, SSE can't transmit binary data and has some limitations in the textual data it represents. But for this use case this doesn't matter.

      I would use websockets only if you have a real bidirectional data flow or need to transmit complex data.

      • snackbroken 5 days ago

        > if you have several tabs, you need a SharedWorker to share the connection between the tabs.

        You don't have to use a SharedWorker, you can also do domain sharding. Since the concurrent connection limit is per domain, you can add a bunch of DNS records like SSE1.example.org -> 2001:db8::f00; SSE2.example.org -> 2001:db8::f00; SSE3.example.org -> 2001:db8::f00; and so on. Then it's just a matter of picking a domain at random on each page load. A couple hundred tabs ought to be enough for anyone ;)

      • zazaulola 5 days ago

        WebSocket solves a very different problem. It may be only partially related to organizing two-way communication, but it has nothing to do with data complexity. Moreover, WS are not good enough at transmitting binary data.

        If you are using SSE and SW and you need to transfer some binary data from client to server or from server to client, the easiest solution is to use the Fetch API. `fetch()` handles binary data perfectly well without transformations or additional protocols.

        If the data in SW is large enough to require displaying the progress of the data transfer to the server, you will probably be more suited to `XMLHttpRequest`.

      • Cheezmeister 5 days ago

        Streaming SIMD Extensions?

        Server-sent events.

        • crop_rotation 5 days ago

          Streaming SIMD Extensions seems very unlikely to have any relevance in the above statement, server-sent events is the perfect fit.

  • moribvndvs 5 days ago

    You could have your job status update push an update into an in-memory or distributed cache and check that in your long poll rather than a DB lookup, but that may require adding a bunch of complexity to wire the completion of the task to updating said cache. If your database is tuned well and you don’t have any other restrictions (e.g. serverless where you pay by the IO), it may be good enough and come out in the wash.

rednafi 5 days ago

Neither Server-Sent Events nor WebSockets have replaced all use cases of long polling reliably. The connection limit of SSE comes up a lot, even if you’re using HTTP/2. WebSockets, on the other hand, are unreliable as hell in most environments. Also, WS is hard to debug, and many of our prod issues with WS couldn’t even be reproduced locally.

Detecting changes in the backend and propagating them to the right client is still an unsolved problem. Until then, long polling is surprisingly simple and a robust solution that works.

  • pas 5 days ago

    Robust WS solutions need a fallback anyway, and unless you are doing something like Discord long polling is a reasonable option.

  • infamia 5 days ago

    > The connection limit of SSE comes up a lot, even if you’re using HTTP/2.

    I'm considering using SSE for an app. I'm curious, what problems you've run into? At least the docs say you get 100 connections between the server and a client, but it can be negotiated higher if needed it seems?

    https://developer.mozilla.org/en-US/docs/Web/API/EventSource

    • rednafi 5 days ago

      SSEs are great and more reliable than websockets in a smaller scale. So I'd reach for it despite the issues. But that being said, some websevers don't play well with SSE and you'll need to fiddle with it. If you control the webserver, then it's not much of a problem.

imglorp 5 days ago

Since the article mentioned Postgres by name, isn't this a case for using its asynchronous notification features? Servers can LISTEN to a channel and PG can TRIGGER and NOTIFY them when the data changes.

No polling needed, regardless of the frontend channel.

  • lunarcave 5 days ago

    Yes, but the problems of detecting that changeset and delivering it to the right connection remains to be solved in the app layer.

  • cluckindan 5 days ago

    It would be easier to run Hypermode’s Dgraph as the database and use GraphQL subscriptions from the frontend. But nobody ever got fired for choosing postgres.

    • j45 5 days ago

      I have relatively recently taken steps towards Postgres from it's abiality to be at the center of so much until a project outgrows it.

      In terms of not getting fired - Postgres is a lot more innovative than most databases, and the insinuation of IBM.

      By innovative I mean uniquely putting in performance related items for the last 10-20 years.

bigbones 5 days ago

I don't know how meaningful it is any more, but with long polling with a short timeout and a gracefully ended request (i.e. chunked encoding with an eof chunk sent rather than disconnection), the browser would always end up with one spare idle connection to the server, making subsequent HTTP requests for other parts of the UI far more likely to be snappier, even if the app has been left otherwise idle for half the day

I guess at least this trick is still meaningful where HTTP/2 or QUIC aren't in use

bartvk 6 days ago

Refreshing to be reminded of a relatively simple alternative to websockets. For a short time, I worked at a now-defunct startup which had made the decision for websockets. It was an app that would often be used on holiday so testing was done on hotel and restaurant wifi. Websockets made that difficult.

  • ipnon 5 days ago

    I feel like WebSockets are already as simple as it gets. It's "just" an HTTP request with an indeterminate body. Just make an HTTP request and don't close the connection. That's a WebSocket.

    • bluepizza 5 days ago

      It's surprisingly complex.

      Connections are dropped all the time, and then your code, on both client and server, need to account for retries (will the reconnection use a cached DNS entry? how will load balancing affect long term connections?), potentially missed events (now you need a delta between pings), DDoS protections (is this the same client connecting from 7 IPs in a row or is this a botnet), and so on.

      Regular polling great reduces complexity on some of these points.

      • slau 5 days ago

        Long polling has nearly all the same disadvantages. Disconnections are harder to track, DNS works exactly the same for both techniques, as does load balancing, and DDoS is specifically about different IPs trying to DoS your system, not the same IP creating multiple connections, so irrelevant to this discussion.

        Yes, WS is complex. Long polling is not much better.

        I can’t help but think that if front end connections are destroying your database, then your code is not structured correctly. You can accept both WS and long polls without touching your DB, having a single dispatcher then send the jobs to the waiting connections.

        • bluepizza 5 days ago

          My understanding is that long polling has these issues handled by assuming the connection will be regularly dropped.

          Clients using mobile phones tend to have their IPs rapidly changed in sequence.

          I didn't mention databases, so I can't comment on that point.

          • josephg 5 days ago

            Well, it’s the same in both cases. You need to handle disconnection and reconnection. You need a way to transmit missed messages, if that’s important to you.

            But websockets also guarantee in-order delivery, which is never guaranteed by long polling. And websockets play way better with intermediate proxies - since nothing in the middle will buffer the whole response before delivering it. So you get better latency and better wire efficiency. (No http header per message).

            • bluepizza 5 days ago

              That very in order guarantee is the issue. It can't know exactly where the connection died, which means that the client must inform the last time it received an update, and the server must then crawl back a log to find the pending messages and redispatch them.

              At this point, long polling seems to carry more benefits, IMHO. WebSockets seem to be excellent for stable conditions, but not quite what we need for mobile.

              • naasking 3 days ago

                > It can't know exactly where the connection died, which means that the client must inform the last time it received an update, and the server must then crawl back a log to find the pending messages and redispatch them.

                I don't see how this is meaningfully different for long polling. The client could have received some updates but never ack'd it successfully over a long poll, so either way you need to keep a log and resync on reconnection.

      • sn0wtrooper 5 days ago

        If a connection is closed, isn't the browser's responsibility to solve DNS when you open it again?

peheje 5 days ago

What about HTTP/2 Multiplexing, how does it hold up against long-polling and websockets?

I have only tried it briefly when we use gRPC: https://grpc.io/docs/what-is-grpc/core-concepts/#server-stre...

Here it's easy to specify that a endpoint is a "stream", and then the code-generation tool gives all tools really to just keep serving the client with multiple responses. It looks deceptively simple. We already have setup auth, logging and metrics for gRPC, so I hope it just works off of that maybe with minor adjustments. But I'm guessing you don't need the gRPC layer to use HTTP/2 Multiplexing?

  • toast0 5 days ago

    At least in a browser context, HTTP/2 doesn't address server to client unsolicitied messages. So you'd still need a polling request open from the client.

    HTTP/2 does specify a server push mechanism (PUSH_PROMISE), but afaik, browsers don't accept them and even if they did, (again afaik) there's no mechanism for a page to listen for them.

    But if you control the client and the server, you could use it.

  • yencabulator 5 days ago

    gRPC as specced to ride directly on top of HTTP/2 doesn't work from browsers, the sandboxed JS isn't allowed that level of control over the protocol. And often is too low level to implement as part of a pre-existing HTTP server, too. gRPC is a server-to-server protocol that is not part of the usual Web, but happens to repurpose HTTP/2.

    Outside of gRPC, just HTTP POST cannot at this time replace websockets because the in-browser `fetch` API doesn't support streaming request body. For now, websockets is the only thing that can natively provide an ordered stream of messages from browser to server.

  • BitPirate 5 days ago

    With RFC8441 websockets are just HTTP/2 streams.

vouwfietsman 5 days ago

The points mentioned against websockets are mostly fud, I've used websockets in production for a very heavy global data streaming application, and I would respond the following to the "upsides" of not using websockets:

> Observability Remains Unchanged

Actually it doesn't, many standard interesting metrics will break because long-polling is not a standard request either.

> Authentication Simplicity

Sure, auth is different than with http, but not more difficult. You can easily pass a token.

> Infrastructure Compatibility

I'm sure you can find firewalls out there where websockets are blocked, however for my use case I have never seen this reported. I think this is outdated, for sure you don't need "special proxy configurations or complex infrastructure setups".

> Operational Simplicity

Restarts will drop any persistent connection, state can be both or neither in WS or in LP, it doesn't matter what you use.

> Client implementation

It mentions "no special WebSocket libraries needed" and also "It works with any HTTP client". Guess what, websockets will work with any websocket client! Who knew!

Finally, in the conclusion:

> For us, staying close to the metal with a simple HTTP long polling implementation was the right choice

Calling simple HTTP long polling "close to the metal" in comparison to websockets is weird. I wouldn't be surprised if websockets scale much better and give much more control depending on the type of data, but that's besides the point. If you want to use long polling because you prefer it, go ahead. Its a great way to stick to request/response style semantics that web devs are familiar with. Its not necessary to regurgitate a bunch of random hearsay arguments that may influence people in the wrong way.

Try to actually leave the reader with some notion of when to use long polling vs when to use websockets, not a post-hoc justification of your decision based on generalized arguments that do not apply.

  • amatuer_sodapop 5 days ago

    > > Observability Remains Unchanged

    > Actually it doesn't, many standard interesting metrics will break because long-polling is not a standard request either.

    As a person who works in a large company handling millions of websockets, I fundamentally disagree with discounting the observability challenges. WebSockets completely transform your observability stack - they require different logging patterns, new debugging approaches, different connection tracking, and change how you monitor system health at scale. Observability is far more than metrics, and handwaving away these architectural differences doesn't make the implementation easier.

    • vouwfietsman 4 days ago

      > with discounting the observability challenges > handwaving away these architectural differences

      I am doing neither of these things. I am only saying you will have observability problems whether you do LP or WS, because you are stepping away from the request/response model that most tools work with. As such, its weird to argue that "observability remains unchanged".

vitus 5 days ago

I would appreciate if the article spent more time actually discussing the benefits of websockets (and/or more modern approaches to pushing data from server -> browser) and why the team decided those benefits were not worth the purported downsides. I could see the same simplicity argument being applied to using unencrypted HTTP/1.1 instead of HTTP/2, or TCP Reno instead of CUBIC.

The section at the end talking about "A Case for Websockets" really only rehashes the arguments made in "Hidden Benefits of Long-Polling" stating that you need to reimplement these various mechanisms (or just use a library for it).

My experience in this space is from 2011, when websockets were just coming onto the scene. Tooling / libraries were much more nascent, websockets had much lower penetration (we still had to support IE6 in those days!), and the API was far less stable prior to IETF standardization. But we still wanted to use them when possible, since they provided much better user experience (lower latency, etc) and lower server load.

LeicaLatte 5 days ago

Long polling is my choice for simple, reliable and plug and play like interfaces. HTTP requests tend to be standard and simplify authentication as well. Systems with frequent but not constant updates are ideal. Text yes. Voice maybe not.

Personal Case Study: I built mobile apps which used Flowise assistants for RAG and found websockets compeletely out of line with the rest of my system and interactions. Suddenly I was fitting a round peg in a square hole. I switched to OpenAI assistants and their polling system felt completely "natural" to integrate.

feverzsj 5 days ago

Why not just use chunked encoding and get rid of extra requests.

Cort3z 5 days ago

I think they are mixing some problems here. They could probably have used their original setup with Postgres NOTIFY+triggers in stead of polling, and only have one "pickup poller" to catch any missed events/jobs. In my opinion transaction medium should not be linked to how the data is manage internally, but I know from experience that this separation is often hard to achieve in practice.

emilio1337 5 days ago

The article does discuss a lot of mixed concepts. I would prefer one process polling new jobs/state and one process handling http connections/websockets. Hence no flooding the database and completely scalable from the client side. The database process pushes everything downstream via some queue while the other process/server handles those and sends them to respective clients

sgarland 5 days ago

The full schema isn’t listed, but the indices don’t make sense to me.

(id, cluster_id) sounds like it could / should be the PK

If the jobs are cleared once they’ve succeeded, and presumably retried if they’ve failed or stalled, then the table should be quite small; so small, that a. The query planner is unlikely to use the partial index on (status) b. The bloat from the rapidity of DELETEs likely overshadows the live tuple size.

DougN7 5 days ago

I implemented a long polling solution in desktop software over 20 years ago and it’s still working great. It can even be used as a tunnel to stream RDP sessions, through which YouTube can play without a hiccup. Big fan of long polling, though I admit I didn’t get a chance to try web sockets back then.

  • jclarkcom 5 days ago

    I did the same, were you at VMware by any chance? At the time it was the only way to get comparability with older browsers.

gloosx 5 days ago

>Our system handles hundreds of worker nodes constantly polling our PostgreSQL-backed control plane for new jobs

Does everybody poll their PosgreSQL to get new rows in real-time? This is really weird, there are trigger functions and notifications.

  • lunarcave 4 days ago

    In pg, each unique `listen(channel)` takes up server resources, and if you don't reliably clean them up, everything comes to a screeching halt.

    There's also `max_notify_queue_pages`

    >Specifies the maximum amount of allocated pages for NOTIFY / LISTEN queue. The default value is 1048576.

k__ 5 days ago

Half-OT:

What's the most resource efficient way to push data to clients over HTTP?

I can send data to a server via HTTP request, I just need a way to notify a client about a change and would like to avoid polling for it.

I heard talk about SSE, WebSockets, and now long-polling.

Is there something else?

What requires the least resources on the server?

  • mojuba 5 days ago

    I don't think any of the methods give any significant advantage since in the end you need to maintain a connection per each client. The difference between the methods boils down to complexity of implementation and reliability.

    If you want to reduce server load then you'd have to sacrifice responsiveness, e.g. you perform short polls at certain intervals, say 10s.

    • k__ 5 days ago

      Okay, thanks.

      What's the least complex to implement then?

      • mojuba 5 days ago

        For the browser and if you need only server-to-client sends, I assume SSE would be the best option.

        For other clients, such as mobile apps, I think long poll would be the simplest.

amelius 5 days ago

> Corporate firewalls blocking WebSocket connections was one of our other worries. Some of our users are behind firewalls, and we don't need the IT headache of getting them to open up WebSockets.

Don't websockets look like ordinary https connections?

  • toast0 5 days ago

    Some corporate firewalls MITM all https connections. Websocket does not look normal once you've terminated TLS.

    • amelius 5 days ago

      Can websites detect this?

      • toast0 5 days ago

        AFAIK, only by symptoms. If https fetches work and websockets don't, that's a sign. HSTS and assorted reporting can help a bit in aggregate, but not if the corporate MITM CA has been inserted into the browser's trusted CA list. I don't think there's an API to get certificate details from the browser side to compare.

        A proxy may have a different TLS handshake than a real browser would, depending on how good the MITM is, but the better they are, the more likely it is that websockets work.

  • doublerabbit 5 days ago

    It does. However DPI firewalls look at and block the upgrade handshake.

        Connection: Upgrade
        Upgrade: websocket
mojuba 5 days ago

Can someone explain why TTL = 60s is a good choice? Why not more, or less?

  • notatoad 5 days ago

    i can't speak for why the author chose it, but if you're operating behind AWS cloudfront then http requests have a maximum timeout of 60s - if you don't close the request within 60s, cloudfront will close it for you.

    i suspect other firewalls, cdns, or reverse proxy products will all do something similar. for me, this is one of the biggest benefits of websockets over long-polling: it's a standard way to communicate to proxies and firewalls "this connection is supposed to stay open, don't close it on me"

tguvot 5 days ago

Another reason: there is a patent troll suing companies over usage of websockets.

justinl33 5 days ago

yeah, authentication complexity with WebSockets is severely underappreciated. We ran into major RBAC headaches when clients needed to switch between different privilege contexts mid-session. Long polling with standard HTTP auth patterns eliminates this entire class of problems.

  • watermelon0 5 days ago

    Couldn't you just disconnect and reconnect websocket if privileges change, since the same needs to be done with the long polling?

    • josephg 5 days ago

      Yeah, and you can send cookies in the websocket connection headers. This used to be a problem in some browsers iirc - they wouldn’t send cookies properly over websocket connection requests.

      As a workaround in one project I wrote JavaScript code which manually sent cookies in the first websocket message from the client as soon as a connection opened. But I think this problem is now solved in all browsers.

sneak 5 days ago

Given long polling, I have never ever understood why websockets became a thing. I’ve never implemented them and never will, it’s a protocol extension where none is necessary.

  • ekkeke 5 days ago

    Websockets can operate outside the request/response model used in this long polling example, and allow you to stream data continuously. They're also a lot more efficient in terms of framing and connections if there are a lot of individual pieces of data to push as you don't need to spin a up a connection + request for each bit.

valenterry 5 days ago

Using websockets with graphql, I feel like a lot of the challenges are then already solved. From the post:

- Observability: WebSockets are more stateful, so you need to implement additional logging and monitoring for persistent connections: solved with graphql if the existing monitoring is already sufficient.

- Authentication: You need to implement a new authentication mechanism for incoming WebSocket connections: solved with graphql.

- Infrastructure: You need to configure your infrastructure to support WebSockets, including load balancers and firewalls: True, firewalls need to be updated.

- Operations: You need to manage WebSocket connections and reconnections, including handling connection timeouts and errors: normally already solved by the graphql library. For errors, it's basically the same though.

- Client Implementation: You need to implement a client-side WebSocket library, including handling reconnections and state management: Just have to use a graphql library that comes with websocket support (I think most of them do) and configure it accordingly.

  • anonzzzies 5 days ago

    I hope (never needed this) client implementations that do this all for you and pick the best implementation based on what the client supports? Not sure why the transport is interesting when/if you have freedom to choose.

    • josephg 5 days ago

      Yeah there’s plenty of high quality websocket client libraries in all languages now. Support and feature are excellent. And they’ve been supported in all browsers for a decade or something at this point.

      I vomit in my mouth a bit whenever people reach for socket.io or junk like that. You don’t want or need the complexity and bugs these libraries bring. They’re obsolete.

  • _heimdall 5 days ago

    Using graphql comes with IRS own list of challenges and issues though. Its a good solution for some situations, but it isn't so universal that you can just switch to it without a problem.

    • valenterry 5 days ago

      I didn't claim that it would be a universal solution. It was just an observation of mine, especially in the context of the mentioned com[arison of their long polling approach vs ElectricSQL.

  • rednafi 5 days ago

    Where did graphql come from? It doesn’t solve any of the problems mentioned here.