Pamela Toman

Tag: sprint-demo

Making ReAct work in practice

I spent some time getting a ReAct LLM (Yao et al. 2022) to autonomously answer questions. Improving performance turned out to require some changes that I haven't seen discussed in my technical haunts. Through failure case analysis and targeted interventions, I was able to drive performance from on the order of 1% to north of 80% over a couple weeks.

ReAct ("Reasoning and Acting") builds on chain-of-thought. It gives the LLM more structure around thinking and acting, which reduces hallucination and error propagation. With ReAct, we use a straightforward LLM prompt that describes some actions that the agent can take to get external information. The prompt also explains that we expect a final structure of repeated Thought / Action / Action Input / Observation blocks. The LLM then constructs a structured text that follows those rules in collaboration with a backend system. The backend system orchestrates and keeps the LLM on track. In particular, the backend parses the LLM outputs, reaches out to the Action systems to produce the Observations, and writes the prompt for the LLM.

I'll give a concrete example (text taken from the langchain docs). In the example, I've colored the sources of the text. I give user input in dark orange, the lines from the coordinating backend system in blue, and the text generated by the LLM in red with grey background:


Question: Who is Leo DiCaprio's girlfriend? What is her current age raised to 
the 0.43 power?

Thought: I need to find out who Leo DiCaprio's girlfriend is and then calculate
her age raised to the 0.43 power.
Action: Search
Action Input: "Leo DiCaprio girlfriend"
Observation: Camila Morrone

Thought: I need to find out Camila Morrone's age
Action: Search
Action Input: "Camila Morrone age"
Observation: 25 years

Thought: I need to calculate 25 raised to the 0.43 power
Action: Calculator
Action Input: 25^0.43
Observation: Answer: 3.991298452658078

Thought: I now know the final answer
Final Answer: Camila Morrone is Leo DiCaprio's girlfriend and her current age
raised to the 0.43 power is 3.991298452658078.

Each time control returns to the backend, the backend parses what the LLM produced. It throws away everything that isn't immediately relevant (like any oh-so-eagerly-hallucinated Observations). Then the backend constructs a new, slightly longer prompt with a real Observation, and prompts the LLM to complete it a bit further. Eventually the LLM produces a Final Answer, which the backend parses and returns to the user.

So, that's ReAct. It didn't work very well for me out-of-the-box. The langchain implementation plus a dozen possible actions produced sub-5% performance on questions it really should have been able to answer.

So, I embarked on a ReAct performance improvement quest. What worked for me, in order from least surprising to most surprising to me, was:

Prompt engineering the action descriptions to improve dispatch. First-pass docstrings often are flawed. We all have this problem — the docstring writer has high context but the reader does not, and what is salient to the writer isn't always salient to the reader. So, I made sure each available action was described in one sentence starting with a verb. Then I re-scoped each "action" by how clearly I could write a user-facing description for the idea (rather than by how APIs are broken apart).
Making parameterization errors visible to enable recovery. I found the LLM often chose a poor Action Input, which caused the backend to receive exceptions when it tried to execute the LLM's instructions. I wanted to make the "bad parameterization" problem more tangible to the LLM. I took this on in three ways: (1) with a priori clues — I described the format of the input in the description (e.g., is it an integer, enum, string, ...), (2) with data typing (e.g., does the UUID that the LLM wants to pass refer to an object of the appropriate type for this action), (3) with a posteriori clues — I updated the backend to include meaningful error messages as the Observation for bad parameterizations, so the LLM would get another crack at it.
Removing all "early stop" actions to encourage actually trying. LangChain allows you to define "early stop" actions. If the LLM picks one of these actions, the entire reasoning chain gets aborted. I found that the LLM would often pick an early stop action as its very first action. So, I strongly encouraged it to always try to answer by removing all early stop actions. It is still possible for the system to go immediately from the Question into a Final Answer of "I have no idea what to do to answer this". But in practice the system is actually trying now, when it often wasn't before.
Dropping old Thoughts so it can't confuse its guesses for facts. I was semi-frequently seeing the LLM hypothesize a wild idea, decide on an appropriate action and input to test that wild idea, receive the right answer, and then write a Final Action that treated the wild idea as if it were a fact. This behavior is understandable, and it is also very bad. So, now the LLM doesn't get to see its previous Thoughts.
Actively seeding Thoughts to recover from unparseable LLM responses. Sometimes the LLM thinks the most likely next token sequence is nothing, and we get an empty string as the completion. Sometimes it doesn't generate text in the required format. Sometimes it goes off the rails in some other way. All of these break parsing. In the LangChain case, the backend has no effective way to get the whole agent back on track again, so it raises an exception and exits. Ideally, though, it would be able to recover. So, when the LLM gives garbage responses, I've started putting Thoughts in its head. Extending the system prompt slightly past the colon with an innocuous phrase — to something like Thought: I need to decide on an Action and Action Input — is enough to break the LLM free. This "seeding its thoughts" approach works well even when the temperature is turned down to 0.0 (with maximal determinism during text generation), because the approach ensures a slightly different prompt from the one that failed.

With these changes, I was able to raise the performance from <5% to on the order of 70-90% pretty quickly.

I suppose in addition to "keep thinking; a simpler solution will come", the wider lesson that was reinforced for me here is that popular ideas can easily suck up more than their share of oxygen in the public conversation (even good ideas like prompt engineering!). "Don't stop with what's popular; make sure everyone is looking at the real behaviors" is my takeaway for myself.

Your very own Star Trek Computer: Making sense of unexpected pipenv behavior with an LLM

Story time!

So, recently, I’m playing in someone else’s codebase, and I need to use their code to create a new Docker image that I can use for myself elsewhere. Their thin base image doesn’t have all my usual ML dependencies, so I need to extend the image. When I try, it blows up with an “Unknown compiler” message. With some legwork I learn that sklearn depends on scipy, and scipy needs to be compiled from source (and has for years now), and that thin base image doesn’t have the compiler. So I move to a non-default base image, and all is well.

Then I notice something I can’t explain, which I never would have noticed if I hadn’t been in this codebase. I had set Python to 3.9 using pyenv. I’m using pipenv --three to create a Python 3 Pipfile and then pipenv install to add some dependencies to it. (That’s all wrapped in a build script, because that’s how the codebase works, but that’s all the build script is doing.) But when I cat the Pipfile, the environment I just created is using Python 3.10, not 3.9! My mind is blown. I’ve never really used pipenv before… but I had a deeply rooted expectation that pipenv --three would use python3 --version for building the Pipfile. And clearly, it did not.

I read the pipenv docs and don’t see anything obviously relevant.

I do some internet searches. No answer.

I figure the explanation has to be obvious to anyone who actually has some pipenv experience – so I ask around. None of my usual suspects have pipenv experience either.

Then I have my epiphany. ChatGPT was just recently officially corporately blessed. This is the perfect question to use with ChatGPT – I can learn about pipenv, and I can use this to demonstrate how ChatGPT might help developers at the next engineering department sprint demo. (A good number of our developers are using Python professionally for the first time, in a codebase that extensively uses pipenv, and have not yet tried any LLMs – I like the odds of a sprint demo on this topic actually being useful for the audience.) So, I spend a few minutes composing a very careful message with an SSCCE and everything:

I need help making sense of my interactions with pipenv. I think I am creating a 
Python 3.9 environment and installing packages into it, and then freezing that as 
a Pipfile. But when I check the Pipfile, it shows 3.10 as my frozen environment.

Here's my shell interactions:

pipenv global 3.9
pipenv --three
python --version # output: 'Python 3.9.13'
pipenv install $pkg
cat Pipfile | grep python_version # output: 'python_version = "3.10"'

How does pipenv decide which version of Python to use? Be super succinct.

ChatGPT gives me a lovely (but still verbose) response, whose first and penultimate lines are entirely correct and ultimately all I need: “When you run pipenv --three, pipenv creates a new virtual environment using the latest version of Python that you have installed on your system. To create a virtual environment using Python 3.9, you can use pipenv --python 3.9 instead of pipenv --three.” Ah hah!

I ask it a bunch of related pipenv sense-making questions to get myself smarter on pipenv. ChatGPT gets most of my questions beautifully, perfectly, verifiably right. Then, harkening back to my “sklearn requires scipy requires a compiler” trouble from earlier, I ask a trickier (pure pip) question: “can I pip install precompiled binaries of scikit-learn and its upstreams instead of building from source?” ChatGPT’s answer is so wrong it’s painful: “Yes, there are precompiled binary versions of scikit-learn that you can install instead of building from source. You can try installing the precompiled binary version of scikit-learn by running the following command: pip install -U scikit-learn”. No, sorry, definitely not. In no Python world can you get a precompiled binary for all dependencies by upgrading a downstream user library written in Python. That’s a farcical statement.

And so I shared the story and take-aways at this sprint demo this week:

pyenv and pipenv work together… mostly.
pipenv will ignore your pyenv!
You get your very own Star Trek Computer now… it’s better than a rubber duck, and it’s the kindest, least judgmental coworker ever. (But it’s not your actual coworker; don’t send confidential or sensitive information.)
Robots are fallible too.

Python web services are built on each other's crazy

Recently I broke a Python web service by adding a new library to the backend. I was floored, because there was no reason that library should have affected anything in the service layer. But of course the computer wasn’t wrong. There was a reason. And figuring it out meant I got to take a deep dive into the coherent cloud of strange practices that is Python web services.

My misbehaving Python service follows all the established best practices. Its public face is an nginx server that accepts public HTTP(S) connections from clients. That nginx server talks to an internal gunicorn server. The gunicorn server is a Python WSGI server that handles many simultaneous HTTP connections; it translates from HTTP connections to Python logic. The gunicorn server loads and serves a Flask application. The Flask application encapsulates the application logic. It links particular web paths in the application to the Python code that calculates the response payloads. All good, and nothing strange here.

The bug first manifested as POSTs responding with empty replies to clients on my local machine. The server log said there was a segmentation fault – signal 11, SIGSEGV. A segmentation fault means a process is accessing memory that does not belong to it:

[2022-01-20 11:18:26 -0800] [75979] [INFO] Starting gunicorn 20.1.0
[2022-01-20 11:18:26 -0800] [75979] [INFO] Listening at http://0.0.0.0:8080 (75979)
[2022-01-20 11:18:26 -0800] [75979] [INFO] Using worker: sync
[2022-01-20 11:18:26 -0800] [75987] [INFO] Booting worker with pid: 75987
The process has forked and you cannot use this CoreFoundation functionality safely. You MUST exec().
Break on __THE_PROCESS_HAS_FORKED_AND_YOU_CANNOT_USE_THIS_COREFOUNDATION_FUNCTIONALITY___YOU_MUST_EXEC__() to debug.
[2022-01-20 11:18:26 -0800] [75979] [WARNING] Worker with pid 75987 was terminated due to signal 11
[2022-01-20 11:18:26 -0800] [75989] [INFO] Booting worker with pid: 75989

I grumbled to myself, thinking “What kind of ridiculous system calls fork without exec?!? Surely anything as widely used as gunicorn behaves in reasonable ways. This error message must be a red herring.”

Screenshot of fork without exec (`spawn_worker` lines 572-592) — Yes, gunicorn calls fork without exec. They call it a "pre-fork worker model".

Well, then I read the gunicorn source code. Calling fork without exec is exactly what gunicorn does. It forks, then the child initiates the core of the application. No exec to be found, which means the child inherits all the parent’s memory. The child nods to resource cleanup by closing a bunch of file pointers – but the memory is shared with the parent. (And kind of oddly, each child separately reinitializes all the application state by default – rather than initializing in the parent, forking, and letting all the children share all the read-only memory. You can get the second behavior, but you have to actively set preload_app=True.)

I verified that the fork-without-exec behavior was the root problem in a couple more ways. First I ran the application bare, as a single process, with no gunicorn. Flask alone didn’t segfault, which reinforced the idea of gunicorn’s fork behavior as the culprit. Then I checked the Mac Console (a useful tool that I learned about during this investigation!) for Crash Reports, and it also showed the fault with the message “crashed on child side of fork pre-exec”. So, yep, fork-without-exec clearly implicated.

So then I instrumented the code. I traced the segmentation fault to the line where it occurred. The bad memory access occurred when the new library used the Google Cloud Storage client for a download operation:

self.gcs_client.download_blob_to_file( ... )

Now I finally had enough clues to start to put it together. It seems the GCS client was being created before the fork – before worker.init_process() ever ran, in fact! Then, when the child worker tried to actually use the GCS client, it segfaulted, because the GCS client was in parent memory rather than child memory. I hypothesized that the SIGSEGV occurred because Apple’s CoreFoundation OS framework disallows children using their parent’s memory as part of its ban on “fork without exec”. (I also figured that if the GCS client had been created only in the child, there would be no issue. From the child’s perspective, its process is unique and alone; apart from its pid value, the child is unaware of any multiprocessing. That said, I had never heard of CoreFoundation before this bug, so it’s quite possible CoreFoundation works a different way.)

Summarizing my understanding of the situation to this point:

The GCS storage client is created in gunicorn’s parent process.
The GCS storage client relies on some library that is compiled specifically for Mac. (pip and other package managers seamlessly choose the right binary wheel for each system, and will compile on the user’s local machine for a source distribution as needed.) Anything compiled specifically for Mac uses CoreFoundation for all the very basic operations, including URLs and stream sockets.
The gunicorn server process forks – but it doesn’t exec. Fork-without-exec is a deliberate design decision by gunicorn; it seems like gunicorn wants to make bad code with memory leaks and other issues more stable.
A POST request comes in. Gunicorn hands it to a worker, which is a child. As the worker handles the request, it tries to use that CoreFoundation functionality. Rather than getting copy-on-write semantics for CoreFoundation functionality in the parent, we get a SIGSEGV. (All fork-without-exec is potentially unsafe, so Apple blocks it since Catalina. I understand the reasoning thusly: It is impossible to guarantee that the parent is not a thread itself. If the parent is a thread, then from the point of view of the child, all its peer threads were violently murdered, and so they will never release any held locks – including important locks for the child, like locks on malloc. So, better to avoid this entire issue and instead force the entire memory of the child to be replaced via exec.)
The worker dies (it can’t handle the signal). The gunicorn parent manager spawns another child worker. But the new child has the same memory configuration, and it is also unable to handle requests.
The application is completely broken.

(This is what I pieced together. It’s my first foray into some of this tech, though – please reach out with corrections and other explanations!)

That makes sense as far as it goes. But now I had a new mystery: why would that GCS client be created before the fork? The GCS client was used deep in the bowels of request handling within the application. But the parent? That’s gunicorn. Gunicorn is a webserver. There’s no reason for application code to be executed when gunicorn starts….

At this point, I decided to tackle the problem from the other direction. I started building a very tiny version of the application entirely from scratch. Just Flask, gunicorn, and the new library functionality? No segfault. Just Flask, gunicorn, and the application’s use of the new library functionality? No segfault. Then I tried to introduce the application’s gunicorn configuration file to the mix. The system segfaulted instantly.

Reviewing the Python web service’s gunicorn config, the mystery finally became clear. The config file is in Python. The first few lines of the gunicorn config file looked like this:

from os import environ
from application_library.application_path import PORT

bind = ":" + str(PORT) # port to use for application

The service layer was importing the application!

When gunicorn “read” this config, it actually executed the config. (I find “execution” to be a very odd pattern to use for a config file, but whatever; this is how gunicorn works.) As soon as gunicorn executed the line from application_library.application_path import PORT, gunicorn also executed the Python file application_library.application_path, because “execute on import” shenanigans are core to how Python works: when Python first imports a file, it executes everything in the file and attaches it to the module’s scope (check out dir(sys.modules["$moduleName"]) to see this in action).

That PORT import has massive side-effects, because the Flask app object was defined in that same file. Flask’s pattern for creating applications is to include app = Flask(__name__) as a plain line in a Python file. That line isn’t wrapped in a class, or in any kind of conditional. It’s just a top-level, unindented statement. As a result, when we imported PORT from a file that also included app = Flask(__name__), the whole application immediately sprung into being. Even down to a GCS client deep in the bowels of the application.

So that little throwaway PORT import? It was probably introduced to guarantee that the default ports used by a Flask server and a gunicorn server always matched (it might have even been introduced by me – I gave up git blame after tracing the refactor history for a few minutes). But that import statement also unwittingly caused the entire application to be created pre-fork, in the server – which broke the whole application on strict Macs.

Once tracked down, the fix was thankfully simple. I replaced that too-DRY reference to PORT with more duplication. I defined the default port number independently in Flask and gunicorn so there would be no dependency between the service layer and the application layer:

bind = f":{os.environ.get('PORT', 8080)}" # port to use for application

With this change, the application does not exist until worker.init_process(). The segmentation fault is gone.

(It’s unlikely I’d have noticed this bug if I had only been testing on Linux servers. I prefer the lower overhead of local testing when possible, but this is a good reminder that the platform sometimes does still matter. Cross-platform code is hard.)

My takeaway from this debugging experience is “weirdness propagates from initial decisions”. Python, gunicorn and Flask mostly play nice together, but it’s because they’re built on each others’ crazy (and a lot of eyes).

The chains I see are:

Python web app developers aren’t trusted to write applications that can run indefinitely. –> gunicorn eschews the two most common forking patterns in favor of a “fork-then-load” pattern that maximizes the ease with which an application can be reinitialized in the worker.
- “Load-everything-in-parent-then-fork-without-exec” is very memory efficient, since the children processes all share a single copy of read-only memory. Gunicorn doesn’t use this pattern, because it would mean memory leaks and other code issues would require more complexity to fix than a quick reload in the child process. (The gunicorn documentation seems to dissuade users from loading the app before fork: “By preloading an application you can save some RAM resources as well as speed up server boot times. Although, if you defer application loading to each worker process, you can reload your application code easily by restarting workers.”)
- “Fork-then-exec” is very safe, since the child process memory is completely replaced and you can’t accidentally get into deadlock. Gunicorn doesn’t use this pattern, because it would require spawning entirely new processes each time a worker child died, and process creation is pretty slow. (I’m still surprised by this, honestly; it’s not really that slow to create new processes, especially not compared to application lifetimes. Maybe I’m missing something.)
- “Fork-without-exec-then-load” is what gunicorn opts to use. This approach uses more memory and it’s more dangerous, but it means reinitializing the user’s application each time something goes wrong is very lightweight.
Python is a scripting language in which all keywords are real statements that get executed, rather than some of them being declarations. –> It is possible to execute substantial amounts of code just by using the import keyword.
Configuring callbacks and other complex functionality is easiest if the config file is itself Python. –> In gunicorn, the config file can have side-effects. Nothing limits config to declaring parameter values.
Python and almost all its libraries are intended to run cross-platform. –> There are potentially bugs lurking in Python’s interactions with environments, because testing all code in all environment configurations is hard.

An image cut horizontally. Above ground there are trees. Below ground there is a dense interconnected network of white fungi and tree roots. — Everything interacts with everything else when you look deep enough.

Sherlock Holmes was my companion throughout this journey:

When you have eliminated the impossible, whatever remains, however improbable, must be the truth.

I investigated a number of wrong hypotheses along the way, from which I learned a ton – but this tale is quite long just covering what really was going on. I did not expect gunicorn in particular to work quite the way it does.

What’s this blog about?

Whatever is on my mind. The content has varied over the past more-than-decade, but it's always been technical. In the early years I focused on improving the fabric of the internet for some niche tools. But the internet no longer needs that kind of improving, and search doesn't really work like that anymore either. This blog is currently mostly about documenting notes for my future self, and sharing those notes with anyone who is interested.

Pamela Toman

Tag: sprint-demo

Making ReAct work in practice

Your very own Star Trek Computer: Making sense of unexpected pipenv behavior with an LLM

Python web services are built on each other's crazy

Yes, gunicorn calls fork without exec. They call it a "pre-fork worker model".

Everything interacts with everything else when you look deep enough.

What’s this blog about?

Recent posts

Tags