As you may or may not know, we commercialize a product called Talaia. Very simply put, it is a network visibility solution that receives traffic summaries from routers via a protocol called NetFlow, and from this information it builds the picture of what is happening in customer networks.
(If that was a self-promotion post then I’d keep on talking about its benefits and how it can be deployed from the cloud as a SaaS solution etcetera. And I won’t refrain from mentioning our recent blog entry about its Bitcoin mining detection capabilities, a story that was recently featured on Bitcoin Magazine.)
I’m instead posting about how easy it is, in practice, to identify what SSL-protected websites network users are accessing and, in particular, how to tell the hostnames behind IP addresses. It is actually easier to tell for HTTPS traffic than it is for HTTP! Let me elaborate.
As I mentioned, our product works with NetFlow (or equivalent protocols, such as IPFIX and sFlow). This means that, for each connection, we know the origin and destination IP addresses, protocol, ports, and other info such as the number of packets and bytes exchanged, timestamps for first and last packet, and so on. Most notably: packet contents are not available for inspection.
An alternative to NetFlow is “deep packet inspection” (DPI) based products. These are based on installing hardware sniffers in every data link to be monitored. These products can trivially snoop on HTTP traffic, and even perform man-in-the-middle attacks (see for example recent scandal, Verizon Wireless injecting tracking UIDs into HTTP requests). That is part of the reason why the world is shifting towards HTTPS.
However, DPI products are simply not cost-effective, at least in our opinion (and our customers’), as they require extensive deployment of expensive hardware. So, cost-effective network visibility has to be based on NetFlow. And, given the limited kind of information that NetFlow provides, how can we identify the hostname behind HTTP(S) requests?
For SSL traffic, it is actually extremely easy: we can simply connect to the IP address of the server and parse the server certificate, which it actively announces as you connect. Easy game! We don’t even need to issue an actual HTTP request. (Even if the server has SNI support, there will be a default certificate.)
But consider: how can we identify the non-ssl, plain HTTP host behind an IP address? There we can’t simply connect and receive a straight answer from the server. Moreover, it is very common for a single IP address to host many unrelated websites (this is less common for HTTPS). So, instead, we need to combine other sources of info, such as the reverse DNS resolution, WHOIS lookups, and whatever additional black magic we do that I don’t really want to discuss in this post.
It’s almost ironic that, at least for a NetFlow-based product, HTTPS services can be identified more easily than plain old unencrypted HTTP. Of course, that’s no reason to dismiss HTTPS. On the contrary, with HTTP you are at the mercy of DPI and attackers who can snoop on you, retrieve passwords, steal browser sessions, and perform a myriad of man-in-the-middle attacks.