I’m opening this topic in order to explain the current situation and roadmap of the peer-to-peer connectivity between nodes. In other words, this is about how nodes talk to each other.
Once a connection is established, the nodes use an encryption protocol and a multiplexing protocol then open substreams, and so on. But this is not what this topic is about. This topic is just about the connection establishment, as this topic alone is surprisingly complex.
Current situation
There exists three ways (three protocols) that two nodes can use to connect to each other:
- Plain TCP. This is represented through the multiaddr
/ip4/1.2.3.4/tcp/30333
. - WebSocket. This is represented through the multiaddr
/ip4/1.2.3.4/tcp/30333/ws
. - Secure WebSocket. This is represented through the multiaddr
/ip4/1.2.3.4/tcp/30333/wss
.
(note: it is also possible to use a DNS address instead of an IP address, or even dnsaddr, but this is all off-topic)
WebSocket secure
Substrate can establish outgoing connections for all of these three kinds of connections, but doesn’t support incoming Secure WebSocket connections. In order to use WSS, you are supposed to start a node that listens for (non-secure) WebSocket connections, then add a reverse proxy in front of that node.
The reason why Substrate doesn’t support incoming WSS connections is to avoid having to deal with the complexity in terms of UX of storing certificates and supporting letsencrypt. We expect node operators to be familiar with for example how to setup an nginx reverse proxy, and if they’re not they can find hundreds of online tutorials about how to do that. Whereas if Substrate directly supported certificates, we’d have to write extensive documentation about this.
Note, however, that, this is just a choice, and technically speaking there’s no reason why we couldn’t support them.
Reachability from browsers
Initially, only plain TCP connections were supported. We added support for WebSocket in order to experiment with browser-embedded light clients that directly connect from the browser to the peer-to-peer network. Because browsers don’t allow web pages to establish plain TCP connections but only WebSocket, we had to add support for WebSocket on the server side.
By default, Substrate currently listens for plain TCP connections on port 30333 if you pass --validator
, and for WebSocket connections on port 30333 if you don’t. The history behind this behavior is: after adding support for WebSocket, very few nodes were actually listening for WebSocket connections, and it was very difficult for browser-embedded light clients to find nodes to connect to. We switched to listening on WebSocket by default, but as a safety measure we decided to not do that for validators, in order to prevent potential DDoS vectors. In retrospect, this safety measure wasn’t justified, but so far we didn’t change this behavior again.
Note that browsers do not allow web pages to open non-secure WebSocket connections anymore to any IP other than localhost
. Even if a node listens for WebSocket connections, it needs to add a reverse proxy in front of it in order to be reachable from web pages. This restriction doesn’t apply to browser extensions, which are free to use non-secure WebSocket connections. This is the main reason why substrate-connect provides a browser extension.
Nodes that have a reverse proxy in front of them must use the --public-addr
CLI option when this proxy “modifies” the port they are listening on, as they cannot automatically detect this port modification.
Ports being open
The design of Substrate currently assumes that all ports a node listens on are reachable from the Internet. No attempt is made at checking whether ports are reachable. However, nodes try to determine their public-facing IP address by asking other nodes which IP address they see for a certain connection.
In the case of Substrate alone, not having your ports open means that you are detrimental to the network, but it is not a big deal. The Polkadot networking “extension” (i.e. networking protocols that Polkadot uses but not the base Substrate), however, requires that validator establish direct connections between them. In that situation, it is crucial that their ports are open.
Problems
Here are, in my opinion, the challenges with the current situation:
- It is very complicated. I don’t think many people in the ecosystem know all the information that I’ve explained above.
- I often see people facing problems because they have a bootnode listening on plain WebSocket connections, but the bootnode address doesn’t mention
/ws
, or vice-versa, and thus it doesn’t work. These situations are very difficult to understand, because all the typical Unix networking tools will tell you that the port is reachable and that a connection is being established. This is exacerbated by the fact that the behavior is different whether you pass--validator
or not. - Having a node reachable from web pages (secure WebSocket) requires a lot of infrastructure work, and is done voluntarily without gaining anything in return. Unfortunately, having a large number of nodes reachable from web pages is very important for browser-embedded light clients to eventually be adopted.
- Having a node reachable from web pages (secure WebSocket) requires getting a TLS certificate from a certificate provider, which is ideologically incompatible with Polkadot.
- Listening for both plain TCP and WebSocket at the same time requires two different ports, which make things even more complicated.
- Our CLI options are generally confusing. For example, the
--port
option can be either the plain TCP or WebSocket port. We also have the--ws-port
option, but it is completely unrelated to networking and is used by the JSON-RPC server.
Roadmap
Here is what I suggest we do in the future. The main objectives, to me, are to simplify and clarify the way the nodes reach each other.
Let’s not do NAT traversal and routing
It is, in my opinion, not worth the effort to add systems that help nodes participate in the network despite their port not being open.
In the future, nodes would be clearly split in two categories: full nodes, running as a binary on a server administered by a technically capable person, and light clients, running on the end user’s machine and typically in a browser.
The use case of running a full node for personal reasons is, in my opinion, going to disappear.
If you run a full node, that means that you want to participate in the infrastructure of the network. And as such, it is not unreasonable to ask you to open your ports.
While it’s not a bad idea in the absolute to have fallback solutions if ports aren’t open, NAT traversal techniques and especially routing are generally extremely complex. Which is why I believe that the trade-off isn’t worth it.
Reinforce the idea that full node == infrastructure == ports open
I think that we should reinforce the idea that full nodes and validators are the infrastructure of a chain, and thus should have their ports open, rather than something that you use for personal access to a chain.
If you need personal access to a chain, use a light client.
This should eventually be clearly written out in documentation. Of course not before light clients are super polished and never crash, which is not completely the case right now.
WebRTC
One of the main networking features that we want to ship in the not-so-distant future is support for WebRTC.
WebRTC is a protocol supported by browsers and designed specifically for peer-to-peer communication. The fact that it is designed specifically for peer-to-peer communication doesn’t actually bring anything technically speaking, but the fact that we use a protocol the way it is intended guarantees that browsers won’t take decisions that are detrimental to us.
WebRTC is based upon UDP. It can in principle be used on top of TCP as well, but doing so is suboptimal and UDP is much preferred.
A WebRTC multiaddr would look like: /ip4/1.2.3.4/udp/30333/webrtc/certhash/uEiC0Tu8hrnOTo29K991d3bZdSGwuWlx1RRxAmwtsLdEtSw
The hash at the end is a certificate hash. WebRTC uses TLS certificates as well, but self-signed certificates, which makes it ok for our use case.
This certificate would be stored on disk by the node, similar to the networking key.
I find the fact that you need to pass a certificate hash in the multiaddr very annoying in terms of UX. I had originally proposed an alternative version of the libp2p WebRTC protocol that is less optimal doesn’t require providing a certificate hash in the multiaddr, unfortunately the libp2p people don’t seem to give much attention to my UX concern. We can consider implementing that alternative version later if necessary.
Changing the defaults and deprecating WebSocket support
After WebRTC is shipped and working well, I would propose to:
- Make all non-validator nodes listen for WebRTC by default, on the same port as TCP (30333 or whatever is passed with
--port
). - Remove the behavior that listens for WebSocket by default, and instead only listen for TCP connections.
- Maybe in the future remove support for WebSocket altogether, for the sake of simplicity. However I don’t think this is very important.
The reason for not activating WebRTC by default on validators is because we don’t have a lot of trust upon the WebRTC implementation that we use. In principle it should be activated, but we’d rather not in order to avoid someone potentially finding a panic vector in the library and crashing the entire Polkadot network.
Contrary to the WebSocket situation, I think that this conditional enabling wouldn’t bring much confusion, for two reasons:
- Contrary to the WebSocket situation, this does not disable listening for TCP connections. TCP connections always work.
- In order to obtain the multiaddr of a WebRTC node, one needs to know the certificate hash, meaning that looking at the node is necessary. If the node isn’t listening on WebRTC, the person will notice. The only possible source of confusion could come if someone adds
--validator
to a node later on having having already saved its WebRTC address.
Replacing WebSocket with WebRTC would solve many problems: no need to have a reverse proxy anymore, thus no need to pass --public-addr
, and the same port can be used for TCP and WebRTC (UDP).
QUIC and CLI options
Another protocol which we’ve been working on is QUIC.
Contrary to everything described above, support for QUIC isn’t about connectivity but about performance. We believe that it might be possible to optimize the networking by using QUIC instead of TCP. I’m not going into details because this isn’t really relevant here.
QUIC is based upon UDP, just like WebRTC. Unfortunately, this means that QUIC and WebRTC unfortunately couldn’t use the same port.
QUIC also has another interesting property: it needs to use a specific unique local port for all outgoing connections (contrary to TCP, where the operating system assigns a new separate port for each connection).
Without QUIC support, the CLI would be easy to simplify, as the --port
option could refer to both the TCP port and the WebRTC/UDP port. A node operator would simply need to provide a --port
, open both TCP and UDP of that port, and wouldn’t need to tinker with the --listen-addr
option.
QUIC, however, makes everything CLI more complicated.
I unfortunately don’t have the answer to that yet.
Conclusion
Feel free to give your opinion on this plan.
Also please note that I used to work on Substrate’s networking code, but I no longer am. I am now more or less “the light client person”, which is why browser connections interests me, but I am most likely not going to implement what I suggest here.