NeurIPS: Data space as the dual to feature space

We use “data slices” to evaluate our cybersecurity ML systems for the asset attribution task at Palo Alto Networks. For us, data slices are the dual of feature explanations. By segmenting our data into subunits with known properties, we can verify that improvements to address model blindspots actually succeed, we can detect model regression, and we can characterize differences between models.

I described our approach in a paper on using data subsets to evaluate the ML internet asset attribution problem, which was accepted to the NeurIPS Data-Centric AI (DCAI) Workshop held on 14 December 2021. The DCAI workshop focused on practical tooling, best practices, and infrastructure for data management in modern ML systems. The paper discusses two themes: (1) data slices, and (2) their application to our asset attribution task in cybersecurity.

What’s this blog about?

Whatever is on my mind. The content has varied over the past more-than-decade, but it's always been technical. In the early years I focused on improving the fabric of the internet for some niche tools. But the internet no longer needs that kind of improving, and search doesn't really work like that anymore either. This blog is currently mostly about documenting notes for my future self, and sharing those notes with anyone who is interested.

Pamela Toman

NeurIPS: Data space as the dual to feature space

What’s this blog about?

Recent posts

Tags