On leadership and being the second voice

Apparently I have phase shifted into the kind of officially and unofficially senior person to whom one earnestly asks listicle questions about professional topics. A few days ago I was asked about leadership. I was wearing my “Distinguished Engineer” hat, I was in front of the business unit, and I had been framed as someone with outsized tenure and impact. My response garnered massive engagement in the backchannel chat — truly shocking amounts of engagement — so apparently it was valuable, and here I write it for posterity.

I situate “leadership” in a general observation about people. If you’re out socializing with a group of friends and you need to get them to physically move, you can’t just walk away and expect your walking away to spark a change in the wider group. Groups have inertia. A single person doing their own thing probably won’t even be noticed. To cause a change, you need to get two people to walk away: one to lead, and one to follow. Once the second person follows, the rest of the group will follow as well. The second person is the key. The second person has the power.

We see this exact same pattern in work meetings. Someone has an idea or suggests some flight of fancy. For that idea to be adopted, someone else also needs to express support for the idea. The intellectual dynamic echoes the physical dynamic: most people are happy to just stick with the group and do their highly local thing. But once there are two people moving the same new direction, the rest of the group follows. The second voice is the key. The second person has the power.

The observation that “it takes two to move a group” has two implications for management. First, management wields power by simply being willing to be a second voice. By calling out the good ideas, the boss guides the team toward the best outcomes available. Much of management work is leadership.

But anyone can be the second voice, calling out good ideas when they see them. In fact, I’m convinced organizations work best when they make decisions as low in the hierarchy as is reasonable. Decision making engenders ownership, it aligns incentives, and it allows the people with the most knowledge to make the right call. Ceding leadership to management is common, but the most effective teams often have multiple people who alternate acting as the second voice for each other’s good ideas.

Needing two voices to move a group has a second implication for management. If the manager can’t trust the team to generate some good ideas, the manager is totally stuck. You can’t lead a group when no good ideas get proposed, because your voice has nothing to get behind. You can’t lead a group by suggesting your own ideas, because the team might not garner that crucial second person’s support on an acceptable idea. And heaven forbid your team has a you-scratch-my-back-I’ll-scratch-yours voting ring whose incentives are orthogonal to those of the wider organization; you’ll never break them free with the ring intact. To be a successful managerial leader, you must have folks on the team whose ideas and judgment you trust. Ensuring you have reliable leaders in the room is key to your success in your role.

So, my take on leadership: Leaders verbalize their belief in the good ideas they encounter. They trust their own judgment. They are willing to put their necks out and be the second mover. Leaders are the people who regularly and successfully are the second voice.


The importance of the “second mover” came to me by way of common knowledge for line management and field trips in K-12 education. (I love the education literature. The whole field of classroom management is dedicated to exceedingly practical advice on getting a group of people with varying levels of interest and motivation to behave in the ways you need, while avoiding resentment and maintaining everyone’s dignity.) Reiterating ideas is also key to the “amplification” technique (archived) that women used in the Obama White House in 2016 to make sure their contributions were recognized.

Approximation through hashing

Approximation algorithms are quite possibly my favorite topic in software. I find them remarkable and beautiful.

Here’s the motivation: On large data, many kinds of exact answers are expensive to calculate. At the same time, there are many, many circumstances where imperfect answers are acceptable. For instance, when deciding whether we need to execute a task (like because it’s the first time we’ve seen it), we don’t need to be exactly right – we’ll still see a significant speed-up from avoiding most duplicative work. When incoming data have already been sampled, we know that pursuing a “exact top 3” is a fool’s errand – our answer will never be anything but approximate. When a human decision is the outcome we’re driving, we need to deliver a gestalt – it’s vanishingly rare that anyone cares about the exact numbers.

Approximation algorithms are gorgeous. My favorites rely on hashing to cleverly and cheaply get us answers:

  • Cardinality estimation (e.g., HyperLogLog, CVM): We hash everything and we track the longest prefix of consecutive 0s that we see in the resulting hashes. If the data produces only one or two 0s in a row as its longest consecutive-0 prefix, that implies we’ve only seen a bit of distinct data. The longer the sequence of 0s, the more likely we are to have seen a lot of distinct data. See Alessandro Nadalin for more on HyperLogLog. (There’s other approaches for this problem, too, like the just-introduced-two-years-ago CVM algorithm. The CVM algorithm tracks a randomly chosen subset of elements in finite memory, such that the cardinality of the elements in the buffer can be trivially converted into an estimate of the overall stream’s cardinality.)
  • Distance metric estimation (e.g., locality sensitive hashing/LSH, MinHash & SimHash): To each piece of data, we apply a set of hash functions. The hash functions are designed to put similar items in the same “bucket”. (In a machine learning context, we might hash on features, plus some random noise.) We approximate the distance between two samples as the extent to which the hash functions produce collisions for those two samples. See Tyler Neylon for more on LSH.
  • Set presence estimation (e.g., Bloom & Cuckoo filters): When we observe an item, we turn on k bits in a bitmap. We figure out which k bits to turn on by hashing the item through h distinct hash functions. Then to test whether a new item is in the set, we hash the new item and we check whether all corresponding k bits are already turned on. See Marciej Ceglowski for more on Bloom filters.
  • Frequency estimation (e.g., count-min sketch): When we observe an item, we hash it k times. We store the result in a grid where each hash function corresponds to a single row. For each hashed input, we increment the integer that occurs at the row of the hash function and the column of the result of that hash operation. Then to estimate the frequency of an item, we retrieve all k integer counts that correspond to that item, and we return the minimum of those values. See Vivek Bansal for more on count-min sketch.

It’s astounding to me that we have effective hash functions at all. What a wonder.

NVIDIA GTC 2024

Some takeaways from GTC 2024:

Vendors

Vendors with noticeably busy booths:

  • Weights and Biases (busy despite having multiple booths): experimentation platform, focused on managing experiments wrt hyperparameter optimization; their product and go-to-market teams seem top-notch
  • CoreWeave: a cloud vendor specializing in GPU compute
  • Twelve Labs: multimodal AI (video)
  • Unstructured: tooling to prep terrible enterprise data into JSON

Some legacy vendors are holding on and trying to pivot.

The glamor of the expo area is real (GTC this year is what everyone else’s conference aspires to be).

There were a lot of dog-sized robots in varying form factors for varying purposes – pizza delivery, lawn mowing, dancing, etc. I didn’t see applications of those robots that made much sense to me.

Best practice software

  • Zephyr: an excellent current smallish open source LLM assistant; it will probably be surpassed in a month, but it’s worth keeping an eye on the creators and papers noted by HuggingFaceH4, the group that put together Zephyr (and others) (this is a good place to be checking for papers worth discussing)
  • Deita: open source project for instruction tuning; there is a scoring estimator included that we might want to investigate for quality evaluations at scale
  • Eleuther LM Eval Harness: quantifies the perf of LLMs; is the backer to the HuggingFace Open LLM Leaderboard; you can set your own custom prompts and evaluation metrics

Our talk

The “GPU Power Play” talk with PJ went well. We filled about half the seats in a Salon (maybe on the order of 50 people), people filed into the room as we talked and once they arrived they didn’t leave. They took lots of pictures of the slides. We had about a half a dozen people queue up to talk to us after we finished.

We heard from someone interested in putting security defenses onto the compute hardware and into the storage layer. This seems extremely enticing, though I’m not sure how you get the speed while also getting the security. Perhaps the play is to satisfice on quality by relying on capability improvements over time, and expect people are willing enough (or regulated enough) to pay twice over for security? Having not heard the pitch, I’m not sure how you wouldn’t end up paying a direct fee for the hardware/storage, and then also again in reduced specs.

Giving talks

I went to one talk that had an extremely effective title & talk (which I won’t name because although I was impressed, my analysis could be construed as critical). I came away satisfied even though the content turned out halfway through to be marketing fluff. In essence, the speaker framed a ridiculous problem with “wouldn’t it be amazing if it worked” – and then they showed that it doesn’t work, and they got a laugh – and then they showed that it kinda does work, and they got knowing nods – and then they shifted to sales. It was brilliantly executed.

It started out looking like a practically-minded semi-academic presentation, but it gracefully turned into a product demo. The room was standing room only, and didn’t drop away, so I wasn’t the only one who was drawn in, recognized the bait-and-switch, and still appreciated it. The speaker seemed to have minimal experience with deep learning (e.g., they expressed surprise that keeping the first few layers of a DL model and dropping the other layers works better than keeping only the top few layers and dropping the layers that actually connect to the embeddings). Even without deep experience, though, they turned a technically truly silly idea into great insights for product and for marketing. They’re clearly very good in their role.

What worked well at attracting me in this and other talks:

  • Select a title that suggests a technical capability that a lot of people would love to achieve (without mentioning the solution you’re marketing).
  • Gradually (very gradually) ease into your solution. Prepare a technical quasi-academic talk, with data selection and approach and challenges well motivated. Name-drop your solution about 25-33% through, but then go right back to the “how to” technical talk. In the final third, transition to a product demo that builds on the tangible, potentially-nonsensical-but-wonderful-if-true example you’ve been fleshing out.
  • Refer to many projects and people, and do so authoritatively.
  • Orient the audience to an entire topic that might otherwise be unfamiliar, so that they do benefit technically from the talk. Perhaps the topic is your overall framing, or perhaps it’s a cross-cutting challenge that your problem lets you talk about (performing simulations can fit here), or perhaps it’s particularly useful datasets and tooling – or perhaps it’s multiple topics.
  • Pre-record the screencast of the demo – but talk over the screencast at the event. This gives you flexibility in the talk track, and it keeps the energy in the room under your live control.

LLMs for text-to-nontext

The only “cool” application I happened to see, for which I could immediately see the value unfold like the glory of Jensen Huang before me, was in autonomous vehicle simulation. NVIDIA is using an LLM to generate non-language synthetic data such as autonomous vehicle scenes (they have a programming language that represents the layers in a scene, and distinct AI models for each layer: one can produce maps of intersections, another can add objects like cars to those maps, and so on). With this text-to-scene platform, their users can chat with an LLM to generate synthetic scenes that focus on edge cases that rarely occur in real data, and they can also produce synthetic data from historic accident reports that focus on the tricky situations. Currently about 20% of their data is synthetic. NVIDIA gave a lot of talks about this idea at GTC, so they’re selling it hard – but it’s also pretty interesting. (Much more interesting than a robot with a leaf blower strapped on it, I’d say.)

Other thoughts

Our interest in foundation models for everything – language, images, autonomous vehicles, etc. – is in direct conflict with Rich Hickey’s idea of simple vs. complected systems. I suspect we as an industry may find ourselves tick-tocking back out to interpretability again in the near-to-medium future.

“Digital twin” is part of the marketing zeitgeist/cliche verbiage; we haven’t been using this phrase so often at Palo Alto Networks. I’m not recommending we start.

There were a lot of vendors focused on GPU compute and fully utilizing GPU compute and enabling AI researchers to train highly effective models – but there didn’t seem to be many people actually doing the work of training those low-loss models with massive hyperparameter experimentation. There was a lot of interest in anyone who indicated they might be successfully training interesting/useful models (but no one actually was). The companies with large research wings using a ton of compute that attended (e.g., Google) were represented by their B2B product arms and not their AI researchers.

It seemed like there were two kinds of talks and participants – (1) interesting technical work on very specific bespoke problems whose actual overall necessary-ness is questionable (e.g., optimization of GPU space so you can run huge contexts – which was a nice talk from the US Navy, but it’s not clear to me that anyone actually needs huge LLM contexts – there were a number of this kind of technically-interesting-but-questionable-value talks), (2) basic application work in a variety of domains (e.g., models that try to detect something like floods in imagery, or copilot work, or use of generative AI within a general stack). This division between “highly applied but very basic” and “technically interesting but limited usefulness” is a tension I just cannot shake.

I would be comfortable giving a talk again; I think that was a reasonable trade-off of time for value (especially given material re-use). I wouldn’t attend GTC for more than that same day unless we were connecting deliberately with people across the industry for some purpose and had done a lot of pre-work toward that end.

So overall, it was a useful conference this year to me primarily in terms of clocking AI imposter syndrome, figuring out some new stealth marketing patterns for talks, and getting a feel for where the zeitgeist is pointing in terms of vendors and open source projects.

Sources of non-determinism in LLMs

I’ve had a couple discussions recently around nondeterminism in LLMs. Measuring statistical bounds is really the only way to go with LLMs. But it’s also entirely true that thinking statistically is (a) much more expensive (both up-front in talent and dataset design, and in perpetuity to run the estimator), and (b) unintuitive-bordering-on-uncomfortable for folks who expect determinism from their machines. So I’m not surprised to have the “but why isn’t it deterministic? can we make it deterministic?” discussion over and over again.

Here is what I know about sources of non-determinism in LLMs:

  • Human perception. Humans aren’t great at actually perceiving truly identical inputs. I’ve found myself and others to pretty often think two inputs are identical, when they aren’t identical at all. In an LLM, any deviation in the input sequence may lead to different completions. Even slightly differences in punctuation, synonym choice, or minimal inserted context can affect the LLM output. Automated diffs help uncover perception mistakes.

  • Model versioning. Developers like versioned APIs. Whenever we hit the same API with the same fixed parameters, we expect the same behavior from the remote system. At time of writing, though, Google’s LLMs don’t support version pinning. For instance, last Tuesday I asked Google’s Bison to generate some “be mean” data, and it was pretty willing to oblige (it wouldn’t be mean to all the social groups I tried, but it spat out stereotypes for a lot of groups). On Wednesday, it politely refused. Ethically, I prefer Wednesday’s model – but entire product capabilities disappearing literally overnight, while the API and user code are exactly the same? That’s massive, unrecoverable non-determinism.

  • Temperature. “Temperature” is a physics metaphor: the higher the temperature, the more “random molecule bouncing around” you get. In an LLM world, temperature affects how random each next token is. Each selected token changes the most likely tokens for the rest of the completion. You can set the temperature to 0 to greedily get the token sequence with highest probability, but temperature=0 still doesn’t guarantee deterministic outputs. You get non-determinism if there are multiple options with the same probability representation (rare, but it occurs). You also get non-determinism if the randomness is introduced upstream of the greedy token selection, such that a different next token becomes most probable, as discussed below.

  • Floating point math. To compress the infinite space of real numbers into the finite space of computer memory, we necessarily lose precision. Floating point numbers can represent both very large and very small numbers – but the floating point numbers are not evenly spaced from each other. We always approximate into the closest representable floating point number, so some number i might have a different amount of discretization error than another number j. The discretization errors can build up when we do addition. As a result, addition in floating point isn’t guaranteed to be commutative. In other words, (a+b)+c might produce a different result than (b+c)+a in floating point math. For parallelized operations (and LLMs are massively parallelized), we usually allow addition to occur in an arbitrary order. But the order can affect the final sum. And the final sum affects the most likely token, which affects the rest of the generated string.

Floating point numbers are not equally spaced.

Floating point numbers (in green) are not equally spaced along the number line. (Image source: Wikimedia)

  • Peers in batch. As Mixture of Experts (MoE) models, the GPT model outputs (and maybe others) appear to be deterministic only at the batch-level. MoE models contain many “experts”, which are often distributed across many machines. When processing each input token, rather than sending it through a complete and expensive dense network, instead we identify the “right” experts to handle it. The best ways to identify the “right” experts is an area of active research. In current MoE, each expert only sparsely activates token paths, which makes it possible to get the quality benefits of significantly larger models without paying the full computational cost at inference. The MoE approach introduces non-determinism because the contents of each batch must be mapped to experts and returned as a batch, and there is a limit on how much data each expert can simultaneously handle. In other words, if your batch has many other inputs that compete with you for the experts you need, you might get a different set of experts than you would in a different batch. This competition for experts can lead to different predicted tokens depending on how your messages are batched for inference, and the effect depends on who else is using the LLM APIs at the same time as you.

In an LLM, each token is conditioned on all the tokens that preceeded it. As a result, once one token diverges, the remainder of the sequence will diverge even more. It’s all back to chaos theory: “when a butterfly flaps its wings in Brazil…”. So yes, if we want to be certain in this new world, we will need experimental statistics. I spend a lot of time asking, how certain do you need to be?

Making ReAct work in practice

I spent some time getting a ReAct LLM (Yao et al. 2022) to autonomously answer questions. Improving performance turned out to require some changes that I haven't seen discussed in my technical haunts. Through failure case analysis and targeted interventions, I was able to drive performance from on the order of 1% to north of 80% over a couple weeks.


ReAct ("Reasoning and Acting") builds on chain-of-thought. It gives the LLM more structure around thinking and acting, which reduces hallucination and error propagation. With ReAct, we use a straightforward LLM prompt that describes some actions that the agent can take to get external information. The prompt also explains that we expect a final structure of repeated Thought / Action / Action Input / Observation blocks. The LLM then constructs a structured text that follows those rules in collaboration with a backend system. The backend system orchestrates and keeps the LLM on track. In particular, the backend parses the LLM outputs, reaches out to the Action systems to produce the Observations, and writes the prompt for the LLM.

I'll give a concrete example (text taken from the langchain docs). In the example, I've colored the sources of the text. I give user input in dark orange, the lines from the coordinating backend system in blue, and the text generated by the LLM in red with grey background:

Question: Who is Leo DiCaprio's girlfriend? What is her current age raised to the 0.43 power? Thought: I need to find out who Leo DiCaprio's girlfriend is and then calculate her age raised to the 0.43 power. Action: Search Action Input: "Leo DiCaprio girlfriend" Observation: Camila Morrone Thought: I need to find out Camila Morrone's age Action: Search Action Input: "Camila Morrone age" Observation: 25 years Thought: I need to calculate 25 raised to the 0.43 power Action: Calculator Action Input: 25^0.43 Observation: Answer: 3.991298452658078 Thought: I now know the final answer Final Answer: Camila Morrone is Leo DiCaprio's girlfriend and her current age raised to the 0.43 power is 3.991298452658078.

Each time control returns to the backend, the backend parses what the LLM produced. It throws away everything that isn't immediately relevant (like any oh-so-eagerly-hallucinated Observations). Then the backend constructs a new, slightly longer prompt with a real Observation, and prompts the LLM to complete it a bit further. Eventually the LLM produces a Final Answer, which the backend parses and returns to the user.

So, that's ReAct. It didn't work very well for me out-of-the-box. The langchain implementation plus a dozen possible actions produced sub-5% performance on questions it really should have been able to answer.

So, I embarked on a ReAct performance improvement quest. What worked for me, in order from least surprising to most surprising to me, was:

  1. Prompt engineering the action descriptions to improve dispatch. First-pass docstrings often are flawed. We all have this problem — the docstring writer has high context but the reader does not, and what is salient to the writer isn't always salient to the reader. So, I made sure each available action was described in one sentence starting with a verb. Then I re-scoped each "action" by how clearly I could write a user-facing description for the idea (rather than by how APIs are broken apart).
  2. Making parameterization errors visible to enable recovery. I found the LLM often chose a poor Action Input, which caused the backend to receive exceptions when it tried to execute the LLM's instructions. I wanted to make the "bad parameterization" problem more tangible to the LLM. I took this on in three ways: (1) with a priori clues — I described the format of the input in the description (e.g., is it an integer, enum, string, ...), (2) with data typing (e.g., does the UUID that the LLM wants to pass refer to an object of the appropriate type for this action), (3) with a posteriori clues — I updated the backend to include meaningful error messages as the Observation for bad parameterizations, so the LLM would get another crack at it.
  3. Removing all "early stop" actions to encourage actually trying. LangChain allows you to define "early stop" actions. If the LLM picks one of these actions, the entire reasoning chain gets aborted. I found that the LLM would often pick an early stop action as its very first action. So, I strongly encouraged it to always try to answer by removing all early stop actions. It is still possible for the system to go immediately from the Question into a Final Answer of "I have no idea what to do to answer this". But in practice the system is actually trying now, when it often wasn't before.
  4. Dropping old Thoughts so it can't confuse its guesses for facts. I was semi-frequently seeing the LLM hypothesize a wild idea, decide on an appropriate action and input to test that wild idea, receive the right answer, and then write a Final Action that treated the wild idea as if it were a fact. This behavior is understandable, and it is also very bad. So, now the LLM doesn't get to see its previous Thoughts.
  5. Actively seeding Thoughts to recover from unparseable LLM responses. Sometimes the LLM thinks the most likely next token sequence is nothing, and we get an empty string as the completion. Sometimes it doesn't generate text in the required format. Sometimes it goes off the rails in some other way. All of these break parsing. In the LangChain case, the backend has no effective way to get the whole agent back on track again, so it raises an exception and exits. Ideally, though, it would be able to recover. So, when the LLM gives garbage responses, I've started putting Thoughts in its head. Extending the system prompt slightly past the colon with an innocuous phrase — to something like Thought: I need to decide on an Action and Action Input — is enough to break the LLM free. This "seeding its thoughts" approach works well even when the temperature is turned down to 0.0 (with maximal determinism during text generation), because the approach ensures a slightly different prompt from the one that failed.

With these changes, I was able to raise the performance from <5% to on the order of 70-90% pretty quickly.

I suppose in addition to "keep thinking; a simpler solution will come", the wider lesson that was reinforced for me here is that popular ideas can easily suck up more than their share of oxygen in the public conversation (even good ideas like prompt engineering!). "Don't stop with what's popular; make sure everyone is looking at the real behaviors" is my takeaway for myself.