We were awarded a patent for Behavior, Qadium’s second product, which I designed, architected, and built as applied research! We sold my BigQuery-based implementation for roughly a year to the tune of >$1 million in ARR – a big deal for a startup. As soon as the value was visible, we staffed it with multiple engineers, converted it to Dataflow, and cranked up a UI.

The core insight inside Behavior is the same as the core insight of our first product, Expander, and the core business of Qadium: We can help you monitor everything that your organization has connected to the internet, and we can do it without deploying any physical devices or software agents. It turns out most organizations have all kinds of unexpected and unwanted systems connected to the internet. Qadium products give you visibility into what isn’t under management and what is misconfigured, and Behavior extends that unprecedented visibility to live interaction flow data.

Challenges with using netflow data

Behavior detects otherwise-undetectable security threats by operating on netflow data. Netflow data is very basic metadata about what communications are traversing the public internet. In netflow data, we have only a few fields: the source and destination IP addresses, the source and destination ports, the layer 4 protocol, the timestamp, and any TCP flags. These aren’t much to work with. Worse, the modern internet multiplexes many domains and even organizations to shared IP addresses, so the data aren’t definitive.

Netflow data capture only a very small fraction of a percent of all traffic (on the order of 1 packet in 1 million), and the sampling is entirely by convenience. Most routers and switches retain some netflow records as the traffic passes through them, but each system has very limited memory, and quickly handling traffic is their priority. So, dropped records are unavoidable, and dropping behavior varies by characteristics like node identity (network position) and time-of-day (data volume). We also see non-negligible duplicate records, because a single client-server conversation might be observed by multiple sensors. “Chattier” types of interactions (like audiovisual communications) are over-represented. Adding more complexity, which IP is considered the “source” of the packet is essentially arbitrary for our purposes.

Additionally, because Qadium doesn’t have access to internal firewall logs, we don’t actually know what truly happens to risky flows. We don’t know whether the organization’s firewall blocked the risky connection (as it should), nor do we know which machine ultimately received or initiated the flow.

My challenge was: Can we say anything valuable about an organization’s internet security risks from such a limited dataset?

What I built

I focused on one fact. When we see a flow, that interaction happened.

First I verified that fact was indeed true. I ran an experiment. I took a set of otherwise-quiescent IP addresses, and I sent a zillion packets to a random selection of even-numbered IP addresses. I observed how fast we observed netflow records for those interactions (quite fast), and I verified no packets were hallucinated in the multi-day window. I saw no traffic, a spike of even-numbered packets for a few minutes, and then no traffic again.

To be able to use the fact that observations always reflect real traffic, I had to transform the data. I first converted its column names to be meaningful – from “source” and “destination” to “likely client” and “likely server”. I used the port numbers for this translation. (Low ports and known ports are servers; high ports are clients.) With this transformation, I could deduplicate conversations. I also kept only TCP records, and dropped all records whose flags didn’t indicate an established connection. I joined customer IPs onto the records, keeping only the flows relevant to each customer. I dropped IPs shared by multiple organizations (commonly CDNs, hosting providers, and Cloud resources). I did all of this in a computationally effective way, rather than a human-consumption-friendly way.

Then I checked for indicators that the customer IP had been compromised or was behaving in a risky way:

  • Remote IP is risky. The remote IP might be in a country on the OFAC Sanctions List, it might be a known bad IP like a command-and-control server, or it might have some other property that makes it risky to communicate with.
  • Remote service is insecure. Because Qadium regularly talks to every IPv4 address on the most important ports using the most important protocols, we know what is at the other end of each netflow record. I flagged the flows where the remote system was too juicy to be legitimate.
  • Remote IP:port is against security policy. For example, many organizations ban peer-to-peer networking; many US government agencies banned the use of popular Kaspersky anti-virus and other products[1]. Behavior is able to detect and demonstrate compliance.
  • Remote is a honeypot. Legitimate users have no need to interact with honeypot servers, but compromised servers running automated attack and reconaissance software find them very enticing.
  • Remote is not a publicized IP. Non-publicized IPs have no legitimate reason to be involved in a communication. Any communication with such an IP is an indicator that the client node is scanning the internet, which is an indicator that malware may be present on the customer system.
  • Customer service is insecure. Insecure systems sending voluminous outbound traffic to strange IPs are potentially compromised. (It’s also possible, though much less likely, that inbound traffic reflects an attacker’s reverse shell.)
  • We detect a change. By detecting changes in volume going to particular kinds of remote systems, we identified behavioral oddities for security teams to investigate.

I developed “risk filters” in all the categories, and applied them at scale to customer data. Hits were, unfortunately, common. Some connections were correctly dropped at a proxy firewall, but many others were not. I worked through many possible risk filter ideas, keeping the ones that were most effective. My final set all warranted security team investigation, unlike most security products, which have extremely high false positives. The risk filter hits created an opportunity for effective alignment conversations and uncovered misconfigured firewalls at customer locations (satellite offices are particularly tricky).

Squeezing value from data was great fun for me, and linking our static “inventory” product with dynamic “risk” data was incredibly valuable for customers.

I wrote this post with more than 5 years of hindsight, well past the point where any details are sensitive. I am backdating it to the first public release of detailed information about Behavior and our netflow work.