Demystifying DSPy: The Framework Redefining Your LLM Workflow

Apr 10, 2024

DSPy is a Python library that has seen an exponential uptick in interest lately. This article is my informed take on whether that’s justified.

The Why

Say you’re building an app that generates cover letters given a user’s resume and job description. You go with GPT 3.5 as the LLM of choice. You start with a simple prompt, but the cover letters generated are too verbose.

Instinctively, you adjust the prompt. You’ve been influenced by that “AI guy” on LinkedIn so you try a suite of prompt hacks: “Take a deep breath before you begin”, “Count to 25 silently and then proceed” etc. But you’ve also read some papers and kept up with AI News so you throw in some tested templates like “Chain of Thought”.

Somehow you make it work. With hours of manual prompt tweaking and evaluating, you end up with cover letters that align closely with what you were expecting. Job done, off to deployment.

But wait. Claude’s Haiku gets released with better performance and half the inference cost of GPT 3.5! That’s enough money saved to buy a ticket to <upcoming hot AI conference>! Switching out a model is easy enough but you notice Haiku is performing poorly on your evaluation dataset.

Epiphany strikes: LLMs are very prompt sensitive. You’d have to essentially start from scratch to find the optimal prompt for the Haiku model.

And this is merely a vanilla example of an LLM workflow. Realistically, you could expect many more tasks in the pipeline like prompt rewriting, retrieval, reranking etc.

Many more LLMs, many more prompts, many more ways for things to go sideways.

The What

DSPy exists to solve this problem.

By feeding a few examples of expected input/output for your pipeline, DSPy’s compiler can automate optimizing your prompt against an evaluation metric of your choosing (more on metrics soon).

This naturally raises questions about how the optimization is actually being done so let’s dig deeper.

The How

To understand what the DSPy compiler does we need to first understand the components of a prompt.

Optimizing a prompt involves modifying one or more of these components:

Prompting techniques: these are prompting strategies like “Chain-of-Thought” and “ReAct” that define the reasoning steps you want the Language Model (LM) to emulate.
Instructions: The “how” of a prompt: a string used to guide the LM to get desired output given input.
Demonstrations: Input/output examples to guide the LM.

I encourage you to look at a full example of a prompt with all these components by scrolling down to the bottom of this notebook.

One last thing to address before DSPy’s optimizers - metrics. How can an optimizer evaluate a prompt (or any of its components) to be good or bad? You define a metric. For example, if the instruction for the LM is to answer math questions, you can create a metric that checks if the LM’s answer matches the correct answer exactly. Note, the output of a metric function here can be a continuous number, a discrete rating, a boolean etc. View the docs here to see more examples of metrics.

Now that we know what prompt components to tweak and what metric(s) to set up, let’s talk about some of DSPy’s optimizers and how they work.

Boostrap fewshot optimizer: Uses a teacher LM to select the best demonstrations to include for the prompt from a larger set of demonstrations provided by the user. The teacher LM can be the original LM itself or a more powerful model.
COPRO optimizer: Finds the best-performing instruction for the model. Starts with a set of initial instructions, generates variations of those instructions, evaluates each variation and finally returns the best performing instruction.
MIPRO optimizer: Finds the best-performing combination of instruction and demonstrations. Working similarly to the COPRO optimizer, it returns the best-performing combination of instructions and examples.

In a lot of these optimizers, the temperature for the instruction or demonstration generating model is set high to get more creative candidate samples that are later filtered down based on performance on the metric.

The Verdict

Here’s what I love about DSPy:

It’s not as expensive as it may seem: It typically costs $3-10 on average for 100 examples considered by the optimizer. You can do a lot of experimenting cheaply with DSPy.
Super easy migration: If you want to switch your LM of choice, for example, all you need to do is recompile to get the newly optimized prompt.
Integrations: As a framework focused on optimization, DSPy sits comfortably in the center of all frameworks. This allows you to integrate components from other libraries like Langchain for chain optimization and Phoenix for evaluations and take your LM pipeline even further!

That being said, here are some aspects I believe everyone should be cautious about when using DSPy:

Abstractions always come at a cost: If the domain of your LM app is general, like generating cover letters, then DSPy can work really well as it will likely know a lot about the domain and can generate great instructions and examples. However, unsurprisingly, this will not be the case if your domain is very specialized . In that case you would be a far superior prompt engineer yourself.
Still early days: DSPy is still new and there are a lot of code changes being made so that is important to be aware of if using DSPy in production level code.

Overall, there is a lot of promise here and I’ve already begun migrating my previous prompt logic and templates to DSPy code. Stay tuned for future articles where I actually show code examples of DSPy.

Topics I Didn’t Touch Upon

In the interest of describing the core characteristics of DSPy, I did not elaborate on the following topics:

Fine-tuning: In DSPy you can optimize the fine-tuning data fed to your LM. Essentially, it uses a teacher model and the Bootstrap Few Shot optimizer (as defined previously) to generate a set of demonstrations that serve as additional training data to finetune the student model.
Assertions: DSPy has functions that you can call to introduce constraints on LM output, like restricting the output to follow a specific format like json. View more details here.
DSPy abstractions like signatures, modules, programs, traces etc. They tend to distract newcomers from the core selling point of DSPy which is pipeline optimization. To learn more about these, I’d encourage reading the original paper. There are several blogs that explain these too, and you can refer to this data dictionary I made in your exploration.
Noteworthy code examples:
1. DSPy PII Masking Demo by Eric Ness
2. Writing blog posts with DSPy by Conner Shorten
3. Other tutorials

Thanks for reading! Feel free to contact or connect with me on Linkedin. Also, I’m currently actively looking for full-time roles in AI/ML so do reach out if you are hiring or know someone who is :)

This post is public so feel free to share it.

Bassim’s Substack

Discussion about this post