Your very own Star Trek Computer: Making sense of unexpected pipenv behavior with an LLM

Story time!

So, recently, I’m playing in someone else’s codebase, and I need to use their code to create a new Docker image that I can use for myself elsewhere. Their thin base image doesn’t have all my usual ML dependencies, so I need to extend the image. When I try, it blows up with an “Unknown compiler” message. With some legwork I learn that sklearn depends on scipy, and scipy needs to be compiled from source (and has for years now), and that thin base image doesn’t have the compiler. So I move to a non-default base image, and all is well.

Then I notice something I can’t explain, which I never would have noticed if I hadn’t been in this codebase. I had set Python to 3.9 using pyenv. I’m using pipenv --three to create a Python 3 Pipfile and then pipenv install to add some dependencies to it. (That’s all wrapped in a build script, because that’s how the codebase works, but that’s all the build script is doing.) But when I cat the Pipfile, the environment I just created is using Python 3.10, not 3.9! My mind is blown. I’ve never really used pipenv before… but I had a deeply rooted expectation that pipenv --three would use python3 --version for building the Pipfile. And clearly, it did not.

I read the pipenv docs and don’t see anything obviously relevant.

I do some internet searches. No answer.

I figure the explanation has to be obvious to anyone who actually has some pipenv experience – so I ask around. None of my usual suspects have pipenv experience either.

Then I have my epiphany. ChatGPT was just recently officially corporately blessed. This is the perfect question to use with ChatGPT – I can learn about pipenv, and I can use this to demonstrate how ChatGPT might help developers at the next engineering department sprint demo. (A good number of our developers are using Python professionally for the first time, in a codebase that extensively uses pipenv, and have not yet tried any LLMs – I like the odds of a sprint demo on this topic actually being useful for the audience.) So, I spend a few minutes composing a very careful message with an SSCCE and everything:

I need help making sense of my interactions with pipenv. I think I am creating a 
Python 3.9 environment and installing packages into it, and then freezing that as 
a Pipfile. But when I check the Pipfile, it shows 3.10 as my frozen environment.

Here's my shell interactions:

pipenv global 3.9
pipenv --three
python --version # output: 'Python 3.9.13'
pipenv install $pkg
cat Pipfile | grep python_version # output: 'python_version = "3.10"'

How does pipenv decide which version of Python to use? Be super succinct.

ChatGPT gives me a lovely (but still verbose) response, whose first and penultimate lines are entirely correct and ultimately all I need: “When you run pipenv --three, pipenv creates a new virtual environment using the latest version of Python that you have installed on your system. To create a virtual environment using Python 3.9, you can use pipenv --python 3.9 instead of pipenv --three.” Ah hah!

I ask it a bunch of related pipenv sense-making questions to get myself smarter on pipenv. ChatGPT gets most of my questions beautifully, perfectly, verifiably right. Then, harkening back to my “sklearn requires scipy requires a compiler” trouble from earlier, I ask a trickier (pure pip) question: “can I pip install precompiled binaries of scikit-learn and its upstreams instead of building from source?” ChatGPT’s answer is so wrong it’s painful: “Yes, there are precompiled binary versions of scikit-learn that you can install instead of building from source. You can try installing the precompiled binary version of scikit-learn by running the following command: pip install -U scikit-learn”. No, sorry, definitely not. In no Python world can you get a precompiled binary for all dependencies by upgrading a downstream user library written in Python. That’s a farcical statement.

And so I shared the story and take-aways at this sprint demo this week:

  • pyenv and pipenv work together… mostly.
  • pipenv will ignore your pyenv!
  • You get your very own Star Trek Computer now… it’s better than a rubber duck, and it’s the kindest, least judgmental coworker ever. (But it’s not your actual coworker; don’t send confidential or sensitive information.)
  • Robots are fallible too.

My stealth recruiting pitch

I have a stealth recruiting talk that I give for machine learning at Xpanse. It goes like this:

  • If you want a real mission in your work, cybersecurity can deliver.
  • My realm of cybersecurity is impossible without AI.
  • Doing this job means solving cool, hard problems.

I pretend the talk is all very objective and about teaching you stuff (which hopefully it also does). I also hint at a lot of the problems that I’ve worked on and solved the past few years. Technical people really like being shown problems and getting to chew on them (which is convenient, because I’m not comfortable publicly sharing how I solved these problems!). And I’m sure it helps that I’m earnest – the slides really are what I love most about my job. The talk works, too. We get solid applicants from it.

I was inspired by a talk I saw from Stitch Fix a number of years ago. I have really minimal interest in clothing… but after hearing them give a technical talk about the problems they were solving, I became convinced they were doing really neat modeling and would be worth considering in a job hunt. Pretty effective. So I tried to channel that insight.

The slides I’m linking here conclude with a harder sell than I usually give, as well as some cross-team Palo Alto projects, because I revised this version for an explicitly recruiting context. (One of our senior recruiters had seen me give this talk at the Lesbians Who Tech conference, and he asked me to give it again in a different context.)

Comfort, distress and dominance: Reading body language

Body language can indicate state of mind. Being familiar with body language tells can help people read a room, avoid closing past the sell on a negotiation, and become more self-aware. I wrote and delivered a short orientation to Comfort, distress and dominance: Reading body language as part of a non-technical skills development series within an established team. It is framed as three 2-3 minute topic introductions followed by 5-10 minutes of small group moderated discussion.

Python web services are built on each other's crazy

Recently I broke a Python web service by adding a new library to the backend. I was floored, because there was no reason that library should have affected anything in the service layer. But of course the computer wasn’t wrong. There was a reason. And figuring it out meant I got to take a deep dive into the coherent cloud of strange practices that is Python web services.

My misbehaving Python service follows all the established best practices. Its public face is an nginx server that accepts public HTTP(S) connections from clients. That nginx server talks to an internal gunicorn server. The gunicorn server is a Python WSGI server that handles many simultaneous HTTP connections; it translates from HTTP connections to Python logic. The gunicorn server loads and serves a Flask application. The Flask application encapsulates the application logic. It links particular web paths in the application to the Python code that calculates the response payloads. All good, and nothing strange here.

The bug first manifested as POSTs responding with empty replies to clients on my local machine. The server log said there was a segmentation fault – signal 11, SIGSEGV. A segmentation fault means a process is accessing memory that does not belong to it:

[2022-01-20 11:18:26 -0800] [75979] [INFO] Starting gunicorn 20.1.0
[2022-01-20 11:18:26 -0800] [75979] [INFO] Listening at (75979)
[2022-01-20 11:18:26 -0800] [75979] [INFO] Using worker: sync
[2022-01-20 11:18:26 -0800] [75987] [INFO] Booting worker with pid: 75987
The process has forked and you cannot use this CoreFoundation functionality safely. You MUST exec().
[2022-01-20 11:18:26 -0800] [75979] [WARNING] Worker with pid 75987 was terminated due to signal 11
[2022-01-20 11:18:26 -0800] [75989] [INFO] Booting worker with pid: 75989

I grumbled to myself, thinking “What kind of ridiculous system calls fork without exec?!? Surely anything as widely used as gunicorn behaves in reasonable ways. This error message must be a red herring.”

Screenshot of fork without exec (`spawn_worker` lines 572-592)

Yes, gunicorn calls fork without exec. They call it a "pre-fork worker model".

Well, then I read the gunicorn source code. Calling fork without exec is exactly what gunicorn does. It forks, then the child initiates the core of the application. No exec to be found, which means the child inherits all the parent’s memory. The child nods to resource cleanup by closing a bunch of file pointers – but the memory is shared with the parent. (And kind of oddly, each child separately reinitializes all the application state by default – rather than initializing in the parent, forking, and letting all the children share all the read-only memory. You can get the second behavior, but you have to actively set preload_app=True.)

I verified that the fork-without-exec behavior was the root problem in a couple more ways. First I ran the application bare, as a single process, with no gunicorn. Flask alone didn’t segfault, which reinforced the idea of gunicorn’s fork behavior as the culprit. Then I checked the Mac Console (a useful tool that I learned about during this investigation!) for Crash Reports, and it also showed the fault with the message “crashed on child side of fork pre-exec”. So, yep, fork-without-exec clearly implicated.

So then I instrumented the code. I traced the segmentation fault to the line where it occurred. The bad memory access occurred when the new library used the Google Cloud Storage client for a download operation:

self.gcs_client.download_blob_to_file( ... )

Now I finally had enough clues to start to put it together. It seems the GCS client was being created before the fork – before worker.init_process() ever ran, in fact! Then, when the child worker tried to actually use the GCS client, it segfaulted, because the GCS client was in parent memory rather than child memory. I hypothesized that the SIGSEGV occurred because Apple’s CoreFoundation OS framework disallows children using their parent’s memory as part of its ban on “fork without exec”. (I also figured that if the GCS client had been created only in the child, there would be no issue. From the child’s perspective, its process is unique and alone; apart from its pid value, the child is unaware of any multiprocessing. That said, I had never heard of CoreFoundation before this bug, so it’s quite possible CoreFoundation works a different way.)

Summarizing my understanding of the situation to this point:

  1. The GCS storage client is created in gunicorn’s parent process.
  2. The GCS storage client relies on some library that is compiled specifically for Mac. (pip and other package managers seamlessly choose the right binary wheel for each system, and will compile on the user’s local machine for a source distribution as needed.) Anything compiled specifically for Mac uses CoreFoundation for all the very basic operations, including URLs and stream sockets.
  3. The gunicorn server process forks – but it doesn’t exec. Fork-without-exec is a deliberate design decision by gunicorn; it seems like gunicorn wants to make bad code with memory leaks and other issues more stable.
  4. A POST request comes in. Gunicorn hands it to a worker, which is a child. As the worker handles the request, it tries to use that CoreFoundation functionality. Rather than getting copy-on-write semantics for CoreFoundation functionality in the parent, we get a SIGSEGV. (All fork-without-exec is potentially unsafe, so Apple blocks it since Catalina. I understand the reasoning thusly: It is impossible to guarantee that the parent is not a thread itself. If the parent is a thread, then from the point of view of the child, all its peer threads were violently murdered, and so they will never release any held locks – including important locks for the child, like locks on malloc. So, better to avoid this entire issue and instead force the entire memory of the child to be replaced via exec.)
  5. The worker dies (it can’t handle the signal). The gunicorn parent manager spawns another child worker. But the new child has the same memory configuration, and it is also unable to handle requests.
  6. The application is completely broken.

(This is what I pieced together. It’s my first foray into some of this tech, though – please reach out with corrections and other explanations!)

That makes sense as far as it goes. But now I had a new mystery: why would that GCS client be created before the fork? The GCS client was used deep in the bowels of request handling within the application. But the parent? That’s gunicorn. Gunicorn is a webserver. There’s no reason for application code to be executed when gunicorn starts….

At this point, I decided to tackle the problem from the other direction. I started building a very tiny version of the application entirely from scratch. Just Flask, gunicorn, and the new library functionality? No segfault. Just Flask, gunicorn, and the application’s use of the new library functionality? No segfault. Then I tried to introduce the application’s gunicorn configuration file to the mix. The system segfaulted instantly.

Reviewing the Python web service’s gunicorn config, the mystery finally became clear. The config file is in Python. The first few lines of the gunicorn config file looked like this:

from os import environ
from application_library.application_path import PORT

bind = ":" + str(PORT) # port to use for application

The service layer was importing the application!

When gunicorn “read” this config, it actually executed the config. (I find “execution” to be a very odd pattern to use for a config file, but whatever; this is how gunicorn works.) As soon as gunicorn executed the line from application_library.application_path import PORT, gunicorn also executed the Python file application_library.application_path, because “execute on import” shenanigans are core to how Python works: when Python first imports a file, it executes everything in the file and attaches it to the module’s scope (check out dir(sys.modules["$moduleName"]) to see this in action).

That PORT import has massive side-effects, because the Flask app object was defined in that same file. Flask’s pattern for creating applications is to include app = Flask(__name__) as a plain line in a Python file. That line isn’t wrapped in a class, or in any kind of conditional. It’s just a top-level, unindented statement. As a result, when we imported PORT from a file that also included app = Flask(__name__), the whole application immediately sprung into being. Even down to a GCS client deep in the bowels of the application.

So that little throwaway PORT import? It was probably introduced to guarantee that the default ports used by a Flask server and a gunicorn server always matched (it might have even been introduced by me – I gave up git blame after tracing the refactor history for a few minutes). But that import statement also unwittingly caused the entire application to be created pre-fork, in the server – which broke the whole application on strict Macs.

A dead bug on its back

Once tracked down, the fix was thankfully simple. I replaced that too-DRY reference to PORT with more duplication. I defined the default port number independently in Flask and gunicorn so there would be no dependency between the service layer and the application layer:

bind = f":{os.environ.get('PORT', 8080)}" # port to use for application

With this change, the application does not exist until worker.init_process(). The segmentation fault is gone.

(It’s unlikely I’d have noticed this bug if I had only been testing on Linux servers. I prefer the lower overhead of local testing when possible, but this is a good reminder that the platform sometimes does still matter. Cross-platform code is hard.)

My takeaway from this debugging experience is “weirdness propagates from initial decisions”. Python, gunicorn and Flask mostly play nice together, but it’s because they’re built on each others’ crazy (and a lot of eyes).

The chains I see are:

  1. Python web app developers aren’t trusted to write applications that can run indefinitely. –> gunicorn eschews the two most common forking patterns in favor of a “fork-then-load” pattern that maximizes the ease with which an application can be reinitialized in the worker.

    • “Load-everything-in-parent-then-fork-without-exec” is very memory efficient, since the children processes all share a single copy of read-only memory. Gunicorn doesn’t use this pattern, because it would mean memory leaks and other code issues would require more complexity to fix than a quick reload in the child process. (The gunicorn documentation seems to dissuade users from loading the app before fork: “By preloading an application you can save some RAM resources as well as speed up server boot times. Although, if you defer application loading to each worker process, you can reload your application code easily by restarting workers.”)
    • “Fork-then-exec” is very safe, since the child process memory is completely replaced and you can’t accidentally get into deadlock. Gunicorn doesn’t use this pattern, because it would require spawning entirely new processes each time a worker child died, and process creation is pretty slow. (I’m still surprised by this, honestly; it’s not really that slow to create new processes, especially not compared to application lifetimes. Maybe I’m missing something.)
    • “Fork-without-exec-then-load” is what gunicorn opts to use. This approach uses more memory and it’s more dangerous, but it means reinitializing the user’s application each time something goes wrong is very lightweight.
  2. Python is a scripting language in which all keywords are real statements that get executed, rather than some of them being declarations. –> It is possible to execute substantial amounts of code just by using the import keyword.

  3. Configuring callbacks and other complex functionality is easiest if the config file is itself Python. –> In gunicorn, the config file can have side-effects. Nothing limits config to declaring parameter values.

  4. Python and almost all its libraries are intended to run cross-platform. –> There are potentially bugs lurking in Python’s interactions with environments, because testing all code in all environment configurations is hard.

An image cut horizontally. Above ground there are trees. Below ground there is a dense interconnected network of white fungi and tree roots.

Everything interacts with everything else when you look deep enough.

Sherlock Holmes was my companion throughout this journey:

When you have eliminated the impossible, whatever remains, however improbable, must be the truth.

I investigated a number of wrong hypotheses along the way, from which I learned a ton – but this tale is quite long just covering what really was going on. I did not expect gunicorn in particular to work quite the way it does.

NeurIPS: Data space as the dual to feature space

We use “data slices” to evaluate our cybersecurity ML systems for the asset attribution task at Palo Alto Networks. For us, data slices are the dual of feature explanations. By segmenting our data into subunits with known properties, we can verify that improvements to address model blindspots actually succeed, we can detect model regression, and we can characterize differences between models.

I described our approach in a paper on using data subsets to evaluate the ML internet asset attribution problem, which was accepted to the NeurIPS Data-Centric AI (DCAI) Workshop held on 14 December 2021. The DCAI workshop focused on practical tooling, best practices, and infrastructure for data management in modern ML systems. The paper discusses two themes: (1) data slices, and (2) their application to our asset attribution task in cybersecurity.