Designing circumvention protocols for speed, flexibility, and robustness
In working on circumvention protocols, I have repeatedly felt the need for a piece that is missing from our current designs. This document summarizes the problems I perceive, and how I propose to solve them.
In short, I think that every circumvention transport should incorporate some kind of session/reliability protocol—even the ones built on reliable channels that seemingly don't need it. It solves all kinds of problems related to performance and robustness. By session/reliability protocol, I mean something that offers a reliable stream abstraction, with sequence numbers and retransmissions, like QUIC or SCTP. Instead of a raw unstructured data stream, the obfuscation layer will carry encoded datagrams that are provided by the session/reliability layer.
When I say that circumvention transports should incorporate something like QUIC, for example, I don't mean that QUIC UDP packets are what we should send on the wire. No—I am not proposing a new outer layer, but an additional inner layer. We take the datagrams provided by the session/reliability layer, and encode them as appropriate for whatever obfuscation layer we happen to be using. So with meek, for example, instead of sending an unstructured blob of data in each HTTP request/response, we would send a handful of QUIC packets, encoded into the HTTP body. The receiving side would decode the packets and feed them into a local QUIC engine, which would reassemble them and output the original stream. A way to think about it is that the the sequencing/reliability layer is the "TCP" to the obfuscation layer's "IP". The obfuscation layer just needs to deliver chunks of data, on a best-effort basis, without getting blocked by a censor. The sequencing/reliability layer builds a reliable data stream atop that foundation.
I believe this design can improve existing transports, as well as enable new transports that are now possible now, such as those built on unreliable channels. Here is a list of selected problems with existing or potential transports, and how a sequencing/reliability layer helps solve them:
- Problem: Censors can disrupt obfs4 by terminating long-lived TCP connections, as Iran did in 2013, killing connections after 60 seconds.
- This problem exists because the obfs4 session is coupled with the TCP connection. The obfs4 session begins and ends exactly when the TCP connection does. We need an additional layer of abstraction, a virtual session that exists independently of any particular TCP connection. That way, if a TCP connection is terminated, it doesn't destroy all the state of the obfs4 session—you can open another TCP connection and resume where you left off, without needing to re-bootstrap Tor or your VPN or whatever was using the channel.
- Problem: The performance of meek is limited because it is half-duplex: it never sends and receives at the same time. This is because, while the bytes in a single HTTP request arrive in order, the ordering of multiple simultaneous requests is not guaranteed. Therefore, the client sends a request, then waits for the server's response before sending another, resulting in a delay of an RTT between successive sends.
- The session/reliability layer provides sequence numbers and reordering. Both sides can send data whenever is convenient, or as needed for traffic shaping, and any unordered data will be put back in order at the other end. A client could even split its traffic over two or more CDNs, with different latency characteristics, and know that the server will buffer and reorder the encoded packets to correctly recover the data stream.
- Problem: A Snowflake client can only use one proxy at a time, and that proxy may be slow or unreliable. Finding a working proxy is slow because each non-working one must time out in succession before trying another one.
- The problem exists because even though each WebRTC DataChannel is reliable (DataChannel uses SCTP internally), there's no ordering between multiple simultaneous DataChannels on separate Snowflake proxies. Furthermore, if and when a proxy goes offline, we cannot tell whether the last piece of data we sent was received by the bridge or not—the SCTP ACK information is not accessible to us higher in the stack—so even if we reconnect to the bridge through another proxy, we don't know whether we need to retransmit the last piece of data or not. All we can do is tear down the entire session and start it up again from scratch. As in the obfs4 case, this problem is solved by having an independent virtual session that persists across transient WebRTC sessions. An added bonus is the opportunity to use more than one proxy at once, to increase bandwidth or as a hedge against one of them disappearing.
- Problem: DNS over HTTPS is an unreliable channel: it is reliable TCP up to the DoH server, but after that, recursive resolutions are plain old unreliable UDP. And as with meek, the ordering of simultaneous DNS-over-HTTPS requests is not guaranteed.
- Solved by retransmission in the session layer. There's no DNS pluggable transport yet, but I think some kind of retransmission layer will be a requirement for it. Existing DNS tunnel software uses various ad-hoc sequencing/retransmission protocols. I think that a proper user-space reliability layer is the "right" way to do it.
- Problem: Shadowsocks opens a separate encrypted TCP connection for every connection made to the proxy. If a web page loads resources from 5 third parties, then the Shadowsocks client makes 5 parallel connections to the proxy.
- This problem is really about multiplexing, not session/reliability, but several candidate session/reliability protocols additionally offer multiplexing, for example streams in QUIC, streams in SCTP, or smux for KCP. Tor does not have this problem, because Tor already is a multiplexing protocol, with multiple virtual circuits and streams in one TCP/TLS connection. But every system could benefit from adding multiplexing at some level. Shadowsocks, for example, could open up one long-lived connection, and each new connection to the proxy would only open up a new stream inside the long-lived connection. And if the long-lived connection were killed, all the stream state would still exist at both endpoints and could be resumed on a new connection.
As an illustration of what I'm proposing, here's the protocol layering of meek (which sends chunks of the Tor TLS stream inside HTTP bodies), and where the new session/reliability layer would be inserted. Tor can remain oblivious to what's happening: just as before it didn't "know" that it was being carried over HTTP, it doesn't now need to know that it is being carried over QUIC-in-HTTP (for example).
[TLS]
[HTTP]
[session/reliability layer] ⇐ 🆕
[Tor]
[application data]
I've done a little survey and identified some suitable candidate protocols that also seem to have good Go packages:
I plan to evaluate at least these three candidates and develop some small proofs of concept. The overall goal of my proposal is to liberate the circumvention context from particular network connections and IP addresses.
Related work
The need for a session and sequencing layer has been felt—and dealt with—repeatedly in many different projects. It has not yet, I think, been treated systematically or recognized as a common need. Systems typically implement some form of TCP-like SEQ and ACK numbers. The ones that don't, are usually built on the assumption of one long-lived TCP connection, and therefore are really using the operating system's sequencing and reliability functions behind the scenes.
Here are are few examples:
- Code Talker Tunnel (a.k.a. SkypeMorph) uses SEQ and ACK numbers and mentions selective ACK as a possible extension. I think it uses the UDP 4-tuple to distinguish sessions, but I'm not sure.
- OSS used SEQ and ACK numbers and a random ID to distinguish sessions.
- I wasted time in the early development of meek grappling with sequencing, before punting by strictly serializing requests, sacrificing performance for simplicity. meek uses an X-Session-Id HTTP header to distinguish sessions.
- DNS tunnels all tend to do their own idiosyncratic thing. dnscat2, one of the better-thought-out ones, uses explicit SEQ and ACK numbers.
My position is that SEQ/ACK schemes are subtle enough and independent enough that they should be treated as a separate layer, not as an underspecified and undertested component of some specific system.
Psiphon can use obfuscated QUIC as a transport. It's directly using QUIC UDP on the wire, except that each UDP datagram is additionally obfuscated before being sent. You can view my proposal as an extension of this design: instead of always sending QUIC packets as single UDP datagrams, we allow them to be encoded/encapsulated into a variety of carriers.
MASQUE tunnels over HTTPS and can use QUIC, but is not really an example of the kind of design I'm talking about. It leverages the multiplexing provided by HTTP/2 (over TLS/TCP) or HTTP/3 (over QUIC/UDP). In HTTP/2 mode it does not introduce its own session or reliability layer (instead using that of the underlying TCP connection); and in HTTP/3 mode it directly exposes the QUIC packets on the network as UDP datagrams, instead of encapsulating them as an inner layer. That is, it's using QUIC as a carrier for HTTP, rather than HTTP as a carrier for QUIC. The main similarity I spot in the MASQUE draft is the envisioned connection migration which frees the circumvention session from specific endpoint IP addresses.
Mike Perry wrote a detailed summary of considerations for migrating Tor to use end-to-end QUIC between the client and the exit. What Mike describes is similar to what is proposed here—especially the subtlety regarding protocol layering. The idea is not to use QUIC hop-by-hop, replacing the TLS/TCP that is used today, but to encapsulate QUIC packets end-to-end, and use some other unreliable protocol to carry them hop-by-hop between relays. Tor would not be using QUIC as the network transport, but would use features of the QUIC protocol.
Anticipated questions
- Q: Why don't VPNs like Wireguard have to worry about all this?
- A: Because they are implemented in kernel space, not user space, they are, in effect, using the operating system's own sequencing and reliability features. Wireguard just sees IP packets; it's the kernel's responsibility to notice, for example, when a TCP segment need to be retransmitted, and retransmit it. We should do in user space what kernel-space VPNs have been doing all along!
- Q: You're proposing, in some cases, to run a reliable protocol inside another reliable protocol (e.g. QUIC-in-obfs4-in-TCP). What about the reputed inefficiency TCP-in-TCP?
- A: Short answer: don't worry about it. I think it would be premature optimization to consider at this point. The fact that the need for a session/reliability layer has been felt for so long by so many systems indicates that we should start experimenting, at least. There's contradictory information online as to whether TCP-in-TCP is as bad as they say, and anyway there are all kinds of performance settings we may tweak if it turns out to be a problem. But again: let's avoid premature optimization and not allow imagined obstacles to prevent us from trying.
- Q: QUIC is not just a reliability protocol; it also has its own authentication and encryption based on TLS. Do we need that extra complexity if the underlying protocol (e.g. Tor) is already encrypted and authenticated independently?
- A: The transport and TLS parts of QUIC are specified separately (draft-ietf-quic-transport and draft-ietf-quic-tls), so in principle they are separable and we could just use the transport part without encryption or authentication, as if it were SCTP or some other plaintext protocol. In practice, quic-go assumes right in the API that you'll be using TLS, so separating them may be more trouble than it's worth. Let's start with simple layering that is clearly correct, and only later start breaking abstraction for better performance if we need to.
- Q: What about traffic fingerprinting? If you simply obfuscate and send each QUIC packet as it is produced, you will leak traffic features through their size and timing, especially when you consider retransmissions and ACKs.
- A: Don't do that, then. There's no essential reason why the traffic pattern of the obfuscation layer needs to match that of the sequencing/reliability layer. Part of the job of the obfuscation layer is to erase such traffic features, if doing so is a security requirement. That implies, at least, not simply sending each entire QUIC packet as soon as it is produced, but padding, splitting, and delaying as appropriate. An ideal traffic scheduler would be independent of the underlying stream—its implementation would not even know how much actual data is queued to send at any time. But it's likely we don't have to be quite that rigorous in implementation, at least at this point.
This document is also posted at https://www.bamsoftware.com/sec/turbotunnel.html.
-------------------------------------
-------------------------------------
Here is a branch of meek that uses the Turbo Tunnel concept: inside the HTTPS tunnel, it carries serialized QUIC packets instead of undifferentiated chunks of a data stream.
Using it works pretty much the same way as always—except you need to set
--quic-tls-cert and --quic-tls-key options on the server and a quic-tls-pubkey= option on the client, because QUIC has its own TLS layer (totally separate from the outer HTTPS TLS layer). See the server man page and client man page for details.
This code is more than a prototype or a proof of concept. The underlying meek code has been in production for a long time, and the Turbo Tunnel support code has by now been through a few iterations (see #14). However, I don't want to merge this branch into the mainline just yet, because it's not compatible with the protocol that meek clients use now, and because in preliminary testing it appears that the Turbo Tunnel–based meek, at least in the implementation so far, is slower than the way it works already for high-bandwidth streams. More on this below.
For simplicity, in the branch I tore out the existing simple tunnel transport and replaced it with the QUIC layer. Therefore it's not backward compatible: a non–Turbo Tunnel client won't be able to talk to a Turbo Tunnel server. Of course it wouldn't take much, at the protocol level, to restore backward compatibility: have the Turbo Tunnel clients set a special request header or use a reserved
/quic URL path or something. But I didn't do that, because the switch to Turbo Tunnel really required a rearchitecting of the code, basically turning the logic inside-out. Before, the server received HTTP requests, matched them to an upstream TCP connection using the session ID, and fed the entire request body to the upstream. Now, the server deserializes packets from requests bodies and feeds them into a QUIC engine, not immediately taking any further action. The QUIC engine then calls back into the application with "new session" and "new stream" events, after it has received enough packets to make those events happen. I actually like the Turbo Tunnel–based architecture better: you have your traditional top-level listener (a QUIC listener) and an accept loop; then alongside that you have an HTTP server that just does packet I/O on behalf of the QUIC engine. Arguably it's the traditional meek architecture that's inside-out. At any rate, combining both techniques in a single program would require factoring out a common interface from both.
The key to implementing a Turbo Tunnel–like design in a protocol is building an interface between discrete packets and whatever your obfuscation transport happens to be—a
PacketConn. On the client side, this is done by PollingPacketConn, which puts to-be-written packets in a queue, from where they are bundled up and sent in HTTP requests by a group of polling goroutines. Incoming packets are unbundled from HTTP responses and then placed in another queue from which the QUIC engine can handle them at its leisure. On the server side, the compatibility interface is QueuePacketConn, which is similar in that it puts incoming and outgoing packets in queues, but simpler because it doesn't have to do its own polling. The server side combines QueuePacketConn with ClientMap, a data structure for persisting session across logically separate HTTP request–response pairs. Connection migration, essentially. While connection migration is nice to have for obfs4-like protocols, it's a necessity for meek-like protocols. That's because each "connection" is a single HTTP request–response, and there needs to be something to link a sequence of them together.
Incidentally, it's conceivable that we could eliminate the need for polling in HTTP-tunnelled protocols using Server Push, a new feature in HTTP/2 that allows the server to send data without the client asking first. Unfortunately, the Go
http2 client does not support Server Push, so at this point it would not be easy to implement.
QUIC uses TLS for its own server authentication and encryption—this TLS is entirely separate from the TLS used by the outer domain-fronted HTTPS layer. In my past prototypes here and here, to avoid dealing with the TLS complication, I just turned off certificate verification. Here, I wanted to do it right. So on the server you have to generate a TLS certificate for use by the QUIC layer. (It can be a self-signed certificate and you don't have to get it signed by a CA. Again, the QUIC TLS layer has nothing to do with the HTTP TLS layer.) If you are running from torrc, for example, it will look like this:
ServerTransportPlugin meek exec /usr/local/bin/meek-server --acme-hostnames=meek.example.com --quic-tls-cert=quic.crt --quic-tls-key=quic.key
Then, on the client, you provide a hash of the server's public key (as in HPKP). I did it this way because it allows you to distribute public keys out-of-band and not have to rely on CAs or a PKI. (Only in the inner QUIC layer. The HTTPS layer still does certificate verification like normal.) In the client torrc, it looks like this:
Bridge meek 0.0.1.0:1 url=https://meek.example.com/ quic-tls-pubkey=JWF4kDsnrJxn0pSTXwKeYR0jY8rtd/jdT9FZkN6ycvc=
The nice thing about an inner layer of QUIC providing its own crypto features is that you can send plaintext through the obfuscation tunnel, and the CDN middlebox won't be able to read it. It means you're not limited to carrying protocols like Tor that are themselves encrypted and authenticated.
The pluggable transports spec supports a feature called USERADDR that allows the PT server to inform the upstream application of the IP address of the connecting client. This is where Tor per-country bridge metrics come from, for example. The Tor bridge does a geolocation of the USERADDR IP address in order to figure out where clients are connecting from. The meek server has support for USERADDR, but I had to remove it in the Turbo Tunnel implementation because it would be complicated to do. The web server part of the server knows client IP addresses, but that's not the part that needs to provide the USERADDR. It only feeds packets into the QUIC engine, which some time later decides that a QUIC connection has started. That's the point at which we need to know the client IP address, but by then it's been lost. (It's not even a well-defined concept: there are several packets making up the QUIC handshake, and conceivably they could all have come in HTTP requests from different IP addresses—all that matters is their QUIC connection ID.) It may be possible to hack in at least heuristic USERADDR support, especially if the packet library gives a little visibility into the packet-level metadata (easier with kcp-go than with quic-go), but at this point I decided it wasn't worth the trouble.
Now about performance. I was disappointed after running a few performance tests. The turbotunnel branch was almost twice as slow as mainline meek, despite meek's slow, one-direction-at-a-time transport protocol. Here are sample runs of simultaneous upload and download of 10 MB. There's no Tor involved here, just a client–server connection through the listed tunnel.
| protocol | time |
|---|---|
| direct QUIC UDP | 3.7 s |
| TCP-encapsulated QUIC | 10.6 s |
| traditional meek | 23.3 s |
| meek with encapsulated QUIC | 34.9 s |
I investigated the cause, and as best I can tell, it's due to QUIC congestion control. Running the programs with
QUIC_GO_LOG_LEVEL=DEBUG turns up many messages of the form Congestion limited: bytes in flight 50430, window 49299, and looking at the packet trace, there are clearly places where the client and server have data to send (and the HTTP channel has capacity to send it), but they are holding back. In short, it looks like it's the problem that @ewust anticipated here ("This is also complicated if the timing/packet size layer tries to play with timings, and your reliability layer has some constraints on retransmissions/ACK timings or tries to do things with RTT estimation").
Of course bandwidth isn't the whole story, and it's possible that the Turbo Tunnel code has lower latency because it enables the client to send at any time. But it's disappointing to have something "turbo" that's slower than what came before, no matter the cause, especially as meek already had most of the benefits that Turbo Tunnel is supposed to provide, performance being the only dimension liable to improve. I checked and it looks like quic-go doesn't provide knobs to control how congestion control works. One option is to try another one of the candidate protocols, such as KCP, in the inner layer—converting from one to another is not too difficult.
Here are graphs showing the the differences in bandwidth and TCP segment size. Above the axis is upload; below is download. Source code and data for the graphs is at meek-turbotunnel-test.zip.
No comments:
Post a Comment