Python web services are built on each other's crazy

Recently I broke a Python web service by adding a new library to the backend. I was floored, because there was no reason that library should have affected anything in the service layer. But of course the computer wasn’t wrong. There was a reason. And figuring it out meant I got to take a deep dive into the coherent cloud of strange practices that is Python web services.


My misbehaving Python service follows all the established best practices. Its public face is an nginx server that accepts public HTTP(S) connections from clients. That nginx server talks to an internal gunicorn server. The gunicorn server is a Python WSGI server that handles many simultaneous HTTP connections; it translates from HTTP connections to Python logic. The gunicorn server loads and serves a Flask application. The Flask application encapsulates the application logic. It links particular web paths in the application to the Python code that calculates the response payloads. All good, and nothing strange here.

The bug first manifested as POSTs responding with empty replies to clients on my local machine. The server log said there was a segmentation fault – signal 11, SIGSEGV. A segmentation fault means a process is accessing memory that does not belong to it:

[2022-01-20 11:18:26 -0800] [75979] [INFO] Starting gunicorn 20.1.0
[2022-01-20 11:18:26 -0800] [75979] [INFO] Listening at http://0.0.0.0:8080 (75979)
[2022-01-20 11:18:26 -0800] [75979] [INFO] Using worker: sync
[2022-01-20 11:18:26 -0800] [75987] [INFO] Booting worker with pid: 75987
The process has forked and you cannot use this CoreFoundation functionality safely. You MUST exec().
Break on __THE_PROCESS_HAS_FORKED_AND_YOU_CANNOT_USE_THIS_COREFOUNDATION_FUNCTIONALITY___YOU_MUST_EXEC__() to debug.
[2022-01-20 11:18:26 -0800] [75979] [WARNING] Worker with pid 75987 was terminated due to signal 11
[2022-01-20 11:18:26 -0800] [75989] [INFO] Booting worker with pid: 75989

I grumbled to myself, thinking “What kind of ridiculous system calls fork without exec?!? Surely anything as widely used as gunicorn behaves in reasonable ways. This error message must be a red herring.”

Screenshot of fork without exec (`spawn_worker` lines 572-592)

Yes, gunicorn calls fork without exec. They call it a "pre-fork worker model".

Well, then I read the gunicorn source code. Calling fork without exec is exactly what gunicorn does. It forks, then the child initiates the core of the application. No exec to be found, which means the child inherits all the parent’s memory. The child nods to resource cleanup by closing a bunch of file pointers – but the memory is shared with the parent. (And kind of oddly, each child separately reinitializes all the application state by default – rather than initializing in the parent, forking, and letting all the children share all the read-only memory. You can get the second behavior, but you have to actively set preload_app=True.)

I verified that the fork-without-exec behavior was the root problem in a couple more ways. First I ran the application bare, as a single process, with no gunicorn. Flask alone didn’t segfault, which reinforced the idea of gunicorn’s fork behavior as the culprit. Then I checked the Mac Console (a useful tool that I learned about during this investigation!) for Crash Reports, and it also showed the fault with the message “crashed on child side of fork pre-exec”. So, yep, fork-without-exec clearly implicated.

So then I instrumented the code. I traced the segmentation fault to the line where it occurred. The bad memory access occurred when the new library used the Google Cloud Storage client for a download operation:

self.gcs_client.download_blob_to_file( ... )

Now I finally had enough clues to start to put it together. It seems the GCS client was being created before the fork – before worker.init_process() ever ran, in fact! Then, when the child worker tried to actually use the GCS client, it segfaulted, because the GCS client was in parent memory rather than child memory. I hypothesized that the SIGSEGV occurred because Apple’s CoreFoundation OS framework disallows children using their parent’s memory as part of its ban on “fork without exec”. (I also figured that if the GCS client had been created only in the child, there would be no issue. From the child’s perspective, its process is unique and alone; apart from its pid value, the child is unaware of any multiprocessing. That said, I had never heard of CoreFoundation before this bug, so it’s quite possible CoreFoundation works a different way.)

Summarizing my understanding of the situation to this point:

  1. The GCS storage client is created in gunicorn’s parent process.
  2. The GCS storage client relies on some library that is compiled specifically for Mac. (pip and other package managers seamlessly choose the right binary wheel for each system, and will compile on the user’s local machine for a source distribution as needed.) Anything compiled specifically for Mac uses CoreFoundation for all the very basic operations, including URLs and stream sockets.
  3. The gunicorn server process forks – but it doesn’t exec. Fork-without-exec is a deliberate design decision by gunicorn; it seems like gunicorn wants to make bad code with memory leaks and other issues more stable.
  4. A POST request comes in. Gunicorn hands it to a worker, which is a child. As the worker handles the request, it tries to use that CoreFoundation functionality. Rather than getting copy-on-write semantics for CoreFoundation functionality in the parent, we get a SIGSEGV. (All fork-without-exec is potentially unsafe, so Apple blocks it since Catalina. I understand the reasoning thusly: It is impossible to guarantee that the parent is not a thread itself. If the parent is a thread, then from the point of view of the child, all its peer threads were violently murdered, and so they will never release any held locks – including important locks for the child, like locks on malloc. So, better to avoid this entire issue and instead force the entire memory of the child to be replaced via exec.)
  5. The worker dies (it can’t handle the signal). The gunicorn parent manager spawns another child worker. But the new child has the same memory configuration, and it is also unable to handle requests.
  6. The application is completely broken.

(This is what I pieced together. It’s my first foray into some of this tech, though – please reach out with corrections and other explanations!)

That makes sense as far as it goes. But now I had a new mystery: why would that GCS client be created before the fork? The GCS client was used deep in the bowels of request handling within the application. But the parent? That’s gunicorn. Gunicorn is a webserver. There’s no reason for application code to be executed when gunicorn starts….

At this point, I decided to tackle the problem from the other direction. I started building a very tiny version of the application entirely from scratch. Just Flask, gunicorn, and the new library functionality? No segfault. Just Flask, gunicorn, and the application’s use of the new library functionality? No segfault. Then I tried to introduce the application’s gunicorn configuration file to the mix. The system segfaulted instantly.

Reviewing the Python web service’s gunicorn config, the mystery finally became clear. The config file is in Python. The first few lines of the gunicorn config file looked like this:

from os import environ
from application_library.application_path import PORT

bind = ":" + str(PORT) # port to use for application

The service layer was importing the application!

When gunicorn “read” this config, it actually executed the config. (I find “execution” to be a very odd pattern to use for a config file, but whatever; this is how gunicorn works.) As soon as gunicorn executed the line from application_library.application_path import PORT, gunicorn also executed the Python file application_library.application_path, because “execute on import” shenanigans are core to how Python works: when Python first imports a file, it executes everything in the file and attaches it to the module’s scope (check out dir(sys.modules["$moduleName"]) to see this in action).

That PORT import has massive side-effects, because the Flask app object was defined in that same file. Flask’s pattern for creating applications is to include app = Flask(__name__) as a plain line in a Python file. That line isn’t wrapped in a class, or in any kind of conditional. It’s just a top-level, unindented statement. As a result, when we imported PORT from a file that also included app = Flask(__name__), the whole application immediately sprung into being. Even down to a GCS client deep in the bowels of the application.

So that little throwaway PORT import? It was probably introduced to guarantee that the default ports used by a Flask server and a gunicorn server always matched (it might have even been introduced by me – I gave up git blame after tracing the refactor history for a few minutes). But that import statement also unwittingly caused the entire application to be created pre-fork, in the server – which broke the whole application on strict Macs.

A dead bug on its back

Once tracked down, the fix was thankfully simple. I replaced that too-DRY reference to PORT with more duplication. I defined the default port number independently in Flask and gunicorn so there would be no dependency between the service layer and the application layer:

bind = f":{os.environ.get('PORT', 8080)}" # port to use for application

With this change, the application does not exist until worker.init_process(). The segmentation fault is gone.

(It’s unlikely I’d have noticed this bug if I had only been testing on Linux servers. I prefer the lower overhead of local testing when possible, but this is a good reminder that the platform sometimes does still matter. Cross-platform code is hard.)


My takeaway from this debugging experience is “weirdness propagates from initial decisions”. Python, gunicorn and Flask mostly play nice together, but it’s because they’re built on each others’ crazy (and a lot of eyes).

The chains I see are:

  1. Python web app developers aren’t trusted to write applications that can run indefinitely. –> gunicorn eschews the two most common forking patterns in favor of a “fork-then-load” pattern that maximizes the ease with which an application can be reinitialized in the worker.

    • “Load-everything-in-parent-then-fork-without-exec” is very memory efficient, since the children processes all share a single copy of read-only memory. Gunicorn doesn’t use this pattern, because it would mean memory leaks and other code issues would require more complexity to fix than a quick reload in the child process. (The gunicorn documentation seems to dissuade users from loading the app before fork: “By preloading an application you can save some RAM resources as well as speed up server boot times. Although, if you defer application loading to each worker process, you can reload your application code easily by restarting workers.”)
    • “Fork-then-exec” is very safe, since the child process memory is completely replaced and you can’t accidentally get into deadlock. Gunicorn doesn’t use this pattern, because it would require spawning entirely new processes each time a worker child died, and process creation is pretty slow. (I’m still surprised by this, honestly; it’s not really that slow to create new processes, especially not compared to application lifetimes. Maybe I’m missing something.)
    • “Fork-without-exec-then-load” is what gunicorn opts to use. This approach uses more memory and it’s more dangerous, but it means reinitializing the user’s application each time something goes wrong is very lightweight.
  2. Python is a scripting language in which all keywords are real statements that get executed, rather than some of them being declarations. –> It is possible to execute substantial amounts of code just by using the import keyword.

  3. Configuring callbacks and other complex functionality is easiest if the config file is itself Python. –> In gunicorn, the config file can have side-effects. Nothing limits config to declaring parameter values.

  4. Python and almost all its libraries are intended to run cross-platform. –> There are potentially bugs lurking in Python’s interactions with environments, because testing all code in all environment configurations is hard.

An image cut horizontally. Above ground there are trees. Below ground there is a dense interconnected network of white fungi and tree roots.

Everything interacts with everything else when you look deep enough.


Sherlock Holmes was my companion throughout this journey:

When you have eliminated the impossible, whatever remains, however improbable, must be the truth.

I investigated a number of wrong hypotheses along the way, from which I learned a ton – but this tale is quite long just covering what really was going on. I did not expect gunicorn in particular to work quite the way it does.

NeurIPS: Data space as the dual to feature space

We use “data slices” to evaluate our cybersecurity ML systems for the asset attribution task at Palo Alto Networks. For us, data slices are the dual of feature explanations. By segmenting our data into subunits with known properties, we can verify that improvements to address model blindspots actually succeed, we can detect model regression, and we can characterize differences between models.

I described our approach in a paper on using data subsets to evaluate the ML internet asset attribution problem, which was accepted to the NeurIPS Data-Centric AI (DCAI) Workshop held on 14 December 2021. The DCAI workshop focused on practical tooling, best practices, and infrastructure for data management in modern ML systems. The paper discusses two themes: (1) data slices, and (2) their application to our asset attribution task in cybersecurity.

Introducing microservices to students in Stanford CS 110

Ryan Eberhardt invited Xpanse to give a guest lecture on the last day of a summer session of Stanford CS 110, in that gap between real coursework and the final.

CS 110 is the second course in Stanford’s systems programming sequence. I loved taking it as a student. I loved CS 110 so much that I TAed it twice, even though it’s a really tough course to TA (the students are zillions of new undergrads, there’s a lot of assignments to give feedback on, and the material is pretty hard for them so office hours consist of a never-ending queue of students with questions). My professor Jerry Cain gave me an award for my TA work, so hopefully I did okay by them.

For this re-visit to CS 110, I introduced microservices, containerization, and orchestration. I gave the orientation to why they should care and who we were. Then two sharp coworkers talked about their daily tech of port scanning and functional programming. I concluded the lecture with hinting at the problems solved by Docker and Kubernetes (and the problems created by them), and I asked leading questions that extended some of the core ideas in CS 110: decoupling of concerns, each worker does one thing, pools of workers share a single point of entry, and request/response models.

Data documentation and me

When I start wrangling a new data source, I search for a lot of terms, I ask a lot of questions and I take a lot of notes. I like to publish those notes on the company wiki. By working publicly, the quality of the docs tends to improve – no one has to re-invent the wheel, I sometimes get comments that improve the docs, and there is a clear stub for others to maintain.

These notes often find their way to becoming onboarding directives (“go read these pages to learn about our data”), and they’re valuable to skim when revisiting data.

When I was in a highly mature data science organization with weak engineering practices, everyone was building on everyone else’s data and data documentation was useful in tracking alternative datasets and tracing back to canonical sources. Now I’m in a maturing data science organization with strong engineering practices, and these documents are useful because they stop the data users from annoying the engineers with the same questions repeatedly.

I’ve landed on a general template while documenting many, many dozens of datasets. I want to share it in case it can be useful to someone else.


$typeOfData data documentation

The page naming convention supports searching for “X data”.

The first section is a very short orientation (ideally 1-3 sentences). It gives just enough for the reader to figure out if the data are relevant to their task. When the dataset scope is complex, this section tends to be a bit longer to explain what is (not) included.

Background

This section varies in length. It covers anything that isn’t immediately apparent that someone might need to know, such as:

  • Where do these data come from? Do we have any contractual or PII obligations?
  • How does the system that creates these data work? (for instance, how does Tor work, how does network communication work, why would a certificate be revoked, what is domain registration)

Summary of dataset

This section contains a short paragraph or small table that always give the most basic facts:

  • Where can you find the data?
  • What time period do the data cover?
  • How often is the source refreshed? By whom?
  • Roughly how large are the data on disk (per unit of time)?
  • Who is the best POC team(s) for this data? (useful in itself and as an indicator of data maturity)

Organizations sometimes keep multiple of the same data. Maybe that looks like Parquet and .ndjson and Avro copies of every file. Maybe that looks like GCS and BigQuery and Databricks. Maybe that looks like something else entirely. I get opinionated here. I think it’s unreasonable to document all of those, so it’s important to decide which approach is the most “organizationally canonical” source and write a data documentation doc that reflects only that canonical source. (And I think the source itself should have documentation on it that discusses how it is loaded.)

Fields

This section is the core of the documentation. It contains a very, very long table. Each field gets its own row. I use three columns:

  • field name
  • example values
  • notes

The examples must be edited to be notional. The company wiki never has the same controls as R&D systems do, so I always edit the examples so they do not actually appear in the dataset.

The “notes” field is the core. This is where I capture relevant facts, like whether this field is sufficient to be a primary key, whether it’s a hash of another field, whether it’s an enum masquerading as a string and how frequent each value is, whether it is actually the same as the field with the same name in that other dataset, whether it’s a dirty raw version of some other field, ….

I often stub out a row for each field, and then I fill in example values and notes as I work with each field.

Limitations and gotchas

This section reiterates the most frequent mistakes that someone is likely to make with this data. I often fill the entire section with a handful of very short bullet points.

It includes big-picture gotchas. What kinds of data will not be visible in this dataset? What kinds of mistakes will people make? (This is the place I underscore the implications of what is already documented – “remember to deduplicate when you JOIN on this data, or your join will explode”.)

It also includes the most important field-level gotchas (“the domains in this dataset always have a final period postpended, which means all naive JOINs will falsely fail”).

Since different people read different things and duplicating warnings is usually valuable, I think the repetition is worthwhile.

Usage

When the datasouce is canonical but opaque and there isn’t organizational interest in maintaining either a usable copy of all the data or a centralized query library, it can be useful to provide an example bit of SQL or Java code. I write a bit of code that massages the data into a useful format while correctly avoiding the gotchas. If the organizational appetite for data maturity is there, that’s even better; do that instead.

This section distinguishes these data from similar data sources. It also curates links to any other related data that someone might not even know they were looking for (e.g., WHOIS might link to DNS data as a source of FQDNs rather than only SLDs).


And that’s that. I don’t use all the sections every time, and I sometimes move them around, but I do tend to keep to this structure. I find public documents in a consistent structure to be very valuable for working with data.

Detecting live security threats without internal data: Netflow and building the Behavior product

We were awarded a patent for Behavior, Qadium’s second product, which I designed, architected, and built as applied research! We sold my BigQuery-based implementation for roughly a year to the tune of >$1 million in ARR – a big deal for a startup. As soon as the value was visible, we staffed it with multiple engineers, converted it to Dataflow, and cranked up a UI.

The core insight inside Behavior is the same as the core insight of our first product, Expander, and the core business of Qadium: We can help you monitor everything that your organization has connected to the internet, and we can do it without deploying any physical devices or software agents. It turns out most organizations have all kinds of unexpected and unwanted systems connected to the internet. Qadium products give you visibility into what isn’t under management and what is misconfigured, and Behavior extends that unprecedented visibility to live interaction flow data.

Challenges with using netflow data

Behavior detects otherwise-undetectable security threats by operating on netflow data. Netflow data is very basic metadata about what communications are traversing the public internet. In netflow data, we have only a few fields: the source and destination IP addresses, the source and destination ports, the layer 4 protocol, the timestamp, and any TCP flags. These aren’t much to work with. Worse, the modern internet multiplexes many domains and even organizations to shared IP addresses, so the data aren’t definitive.

Netflow data capture only a very small fraction of a percent of all traffic (on the order of 1 packet in 1 million), and the sampling is entirely by convenience. Most routers and switches retain some netflow records as the traffic passes through them, but each system has very limited memory, and quickly handling traffic is their priority. So, dropped records are unavoidable, and dropping behavior varies by characteristics like node identity (network position) and time-of-day (data volume). We also see non-negligible duplicate records, because a single client-server conversation might be observed by multiple sensors. “Chattier” types of interactions (like audiovisual communications) are over-represented. Adding more complexity, which IP is considered the “source” of the packet is essentially arbitrary for our purposes.

Additionally, because Qadium doesn’t have access to internal firewall logs, we don’t actually know what truly happens to risky flows. We don’t know whether the organization’s firewall blocked the risky connection (as it should), nor do we know which machine ultimately received or initiated the flow.

My challenge was: Can we say anything valuable about an organization’s internet security risks from such a limited dataset?

What I built

I focused on one fact. When we see a flow, that interaction happened.

First I verified that fact was indeed true. I ran an experiment. I took a set of otherwise-quiescent IP addresses, and I sent a zillion packets to a random selection of even-numbered IP addresses. I observed how fast we observed netflow records for those interactions (quite fast), and I verified no packets were hallucinated in the multi-day window. I saw no traffic, a spike of even-numbered packets for a few minutes, and then no traffic again.

To be able to use the fact that observations always reflect real traffic, I had to transform the data. I first converted its column names to be meaningful – from “source” and “destination” to “likely client” and “likely server”. I used the port numbers for this translation. (Low ports and known ports are servers; high ports are clients.) With this transformation, I could deduplicate conversations. I also kept only TCP records, and dropped all records whose flags didn’t indicate an established connection. I joined customer IPs onto the records, keeping only the flows relevant to each customer. I dropped IPs shared by multiple organizations (commonly CDNs, hosting providers, and Cloud resources). I did all of this in a computationally effective way, rather than a human-consumption-friendly way.

Then I checked for indicators that the customer IP had been compromised or was behaving in a risky way:

  • Remote IP is risky. The remote IP might be in a country on the OFAC Sanctions List, it might be a known bad IP like a command-and-control server, or it might have some other property that makes it risky to communicate with.
  • Remote service is insecure. Because Qadium regularly talks to every IPv4 address on the most important ports using the most important protocols, we know what is at the other end of each netflow record. I flagged the flows where the remote system was too juicy to be legitimate.
  • Remote IP:port is against security policy. For example, many organizations ban peer-to-peer networking; many US government agencies banned the use of popular Kaspersky anti-virus and other products[1]. Behavior is able to detect and demonstrate compliance.
  • Remote is a honeypot. Legitimate users have no need to interact with honeypot servers, but compromised servers running automated attack and reconaissance software find them very enticing.
  • Remote is not a publicized IP. Non-publicized IPs have no legitimate reason to be involved in a communication. Any communication with such an IP is an indicator that the client node is scanning the internet, which is an indicator that malware may be present on the customer system.
  • Customer service is insecure. Insecure systems sending voluminous outbound traffic to strange IPs are potentially compromised. (It’s also possible, though much less likely, that inbound traffic reflects an attacker’s reverse shell.)
  • We detect a change. By detecting changes in volume going to particular kinds of remote systems, we identified behavioral oddities for security teams to investigate.

I developed “risk filters” in all the categories, and applied them at scale to customer data. Hits were, unfortunately, common. Some connections were correctly dropped at a proxy firewall, but many others were not. I worked through many possible risk filter ideas, keeping the ones that were most effective. My final set all warranted security team investigation, unlike most security products, which have extremely high false positives. The risk filter hits created an opportunity for effective alignment conversations and uncovered misconfigured firewalls at customer locations (satellite offices are particularly tricky).

Squeezing value from data was great fun for me, and linking our static “inventory” product with dynamic “risk” data was incredibly valuable for customers.


I wrote this post with more than 5 years of hindsight, well past the point where any details are sensitive. I am backdating it to the first public release of detailed information about Behavior and our netflow work.