Eugene Yan - Building Blocks for LLM Systems & Products
What are some building blocks for integrating LLMs into production systems and customer-facing products? In this talk, we'll discuss evals, RAG, guardrails, and collecting feedback.
Thank you, everyone. I'm Eugene Yen, and today I want to share with you about some building blocks for LLM systems and products. Like many of you here, I'm trying to figure out how to effectively use these LLMs in production. So a few months ago, to clarify my thinking, I wrote some patterns about building LLM systems and products, and the community seemed to like it there's. Jason asking for this to be seminar. So here you go, Jason. Today I'm going to focus on four of those patterns evaluations, retrieval, momentum, generation, guardrails, and collecting feedback. All the slides will be made available after this talk, so I ask you to just focus. Buckle up, hang on tight, because we'll be going really fast. All right, let's start with Evals, or what I really consider the foundation of it all. Why do we need evals? Well, Evals help us understand if our prompt engineering, our retrieval augmentation or our fine tuning isn't doing anything at all. Right, consider eval driven development, where evals guide how you build your system and product. We can also think of evals as test cases, right, where we run these evals before deploying any new changes. It makes us feel safe. And finally, if managers at OpenAI take the time to write evals or give feedback on them, you know, it's pretty important. But building evals is hard. Here are some things I've seen folks trip up on. Firstly, we don't have a consistent approach to evals. If you think about more conventional machine learning regression, we have root, mean, square, error classification, precision recall, even ranking Ndcg. All these metrics are pretty straightforward, and there's usually only one way to compute them. But what about for LLMs? Well, we have this benchmark whereby we write a prompt. There's a multiple choice question. We evaluate the model's ability to get it right. Mmlu is an example that's widely used where it assesses LLMs on knowledge and reasonability, computer science questions, math, US history, et cetera. But there's no consistent way to run Mmlu. Less than a week ago, Avin and Sayash from Princeton evaluating LLMs is a minefield. They ask, are we assessing prompt sensitivity? Are we assessing the LLM, or are we assessing our prompt to get the LLM to give us what we want? On the same day? Entropic noted that the simple MCQ may not be as simple as it seems. Simple formatting changes such as different parentheses lead to different changes in accuracy, and there's no consistent way to do this. As a result, it makes it really difficult to compare models based on these academic benchmarks. Now, speaking of academic benchmarks, we may have outgrown some of them. For example, this task of summarization. On the top you see the human evaluation scores on the reference summaries, and on the bottom you see the evaluation scores for the automated summaries. You don't have to go through all the numbers there, but the point is that all the numbers on the bottom are already higher than the numbers on top. Here's another one that's more recent on the Xsum data set extreme summarization, where you see that all the human evaluation scores are lower than Instruct GPT, and that's not even GPT four. Now, finally, with all these benchmarks being so easily available, we sometimes forget to ask ourselves, hey, is it a fit for our task? If you think about it, does Mmlu really apply to your task? Maybe, if you are building a college level chatbot. Right? But here's Linus reminding us that we should be measuring our apps on our tasks and not just rely on academic evals. So how do we do evals? Well, I think as an industry, we're still figuring it out. Bar pointer is the number one challenge out there, and we hear so many people talking about evals, I think there are some tenets emerging. Firstly, I think we should build evals for a specific task and it's okay to start small. It may seem daunting, but it's okay to start small. How small? Well here's, technium. He releases a lot of open source models he starts with an evil set of 40 questions for his domain expert task 40 evals, that's all it takes and it can go very far. Second, we should try to simplify the task as much as we can. While LLMs are very flexible, I think we have better chance if we try to make it more specific. For example, if we're using an LLM for Content Mauritian task, you can fall back to simple precision and recall how often is it catching toxicity? How often is it catching bias? How often is it catching hallucination? Next, if it's something broader like writing SQL or extracting JSON, you can try to run the SQL and see if it returns the expected result. That's very deterministic. Or you can check the extracted JSON keys and check if the JSON keys and the values match what you expect. These are still fairly easy to evaluate because we have expected answers but if your task is more open ended, such as dialogue, you may have to rely on a strong LLM to evaluate the output. However, this can be really expensive. Here's Jerry saying 60 evals GPD four it costs him a lot. Finally, even if you have automated evals, I think we shouldn't discount the value of eyeballing the output. Here's Jonathan from Mosaic I don't believe that any of these evals capture what we care about they have a prompt to generate games for a three year and seven year and it was more effective for them to actually just eyeball the output as it trains throughout the epochs. Okay, that's it for evals now retrieval Method Generation I don't think I have to convince you all here why we need retrieval method generation but it lets us add knowledge to our model as input context where we don't have to rely solely on the model's knowledge and second, it's far practical, right? It's cheaper and precise and continuously fine tuning to a new knowledge but retrieving the right documents is really hard. Nonetheless, we have great speakers, Jerry and Anton sharing about this topic tomorrow so I won't go into the challenges of retrieval here. Instead, I'd like to focus on the Llnm side of things and discuss some of the challenges that remain even if we have retrieval omitted generation. The first of all is that LLMs can't really see all the documents you retrieve. Here's an interesting experiment, right? The task is retriever omitted question and answering historical queries on Google and hand annotated answers from Wikipedia as part of the context, they provide 20 documents each of these documents are at most 100 tokens long, so that means 2000 tokens maximum and one of these documents contain the answer and the rest are simply distractors. So the question they had was this how would the position of the document containing the answer affect question answering? Now, some of you may have seen this before, don't spoil it for the rest. If the answer is in the first retrieved document, accuracy is the highest. If it's in the last, accuracy is decent. But if it's somewhere in the middle, it's actually worse accuracy than having no retrieval method generation. So what does this mean? It means that even if context window sizes are growing, we shouldn't allow our retrieval to get worse. Right. Getting the most relevant documents to rank highly still matters, regardless of how big the contact size is. And also, even if the answer is in the context and in the top position, accuracy is only 75%. So that means even with perfect retrieval, you can still expect some mistakes. Now, another gotcha is that LLMs can't really tell if the retrieve context is irrelevant. Here's a simple example. So here are 20 top Sci-Fi movies and you can think of these as movies that I like. And I asked the LLM if I would like twilight. So for folks not familiar with Twilight, it's romantic, fantasy, girl, vampire, werewolf, something like that. But I think I've never watched it before. But I have a really important instruction. If it doesn't think I would like Twilight because I've watched all these Sci-Fi movies, it should reply with not applicable. And this is pretty important in recommendations. We don't want to make bad recommendations. So here's what happened. First, it notes that Twilight is a different genre and not quite Sci-Fi, which is fantastic, right? But then it suggests et because of interspecies relationships. I'm not sure how I feel about that. Yeah, I mean, how would you feel if you got this for a movie recommendation? The point is, these LMS are so fine tuned to be helpful and it's really smart and they try their best to give an answer, but sometimes it's really hard to get them to say something that's not relevant, especially something that's fuzzy like this. Right. So how do we best address these limitations in rag? Well, I think that there are a lot of great ideas in the field of information retrieval. Search and recommendations have been trying to figure out how to show the most relevant documents on top and I think worked really well and there's a lot that we can learn from them. Second, LMS may not know that the retrieve document is irrelevant. Right. I think it helps to include a threshold to exclude irrelevant documents. So in the Twilight and Sci-Fi movie example, I bet we could do something like just measuring item distance between those two. And if it's too far, we don't go to the next step. Next guardrails. So, guardrails are really important in production. We want to make sure what we deploy is safe. What's safe? We can look at OpenAI's moderation API, hate harassment, self harm, all that good stuff. But another thing that I also think about a lot is gyros on factual consistency. Or we call that hallucinations. I think it's really important so that you don't have trust busting experiences. You can also think of these as evils for hallucination. Fortunately, or unfortunately, the field of summarization has been trying to tackle this for a very long time and we can take a leaf from their playbook. So one approach to this is via the Natural Language Inference task. In a nutshell, given a premise and a hypothesis, we classify the hypothesis true or false. So given a premise John likes all fruits, the hypothesis that John likes apples is true. Therefore it's entailment because there's not enough information to confirm if John eats apples daily, it's neutral. And finally, John dislikes apple is clearly false, therefore contradiction. Do you see how we can apply this to document summarization? The premise is the document and the hypothesis is the summary and it just works. Now, when doing this though, it helps to apply at the sentence instead of the entire document level. So in this example here, the last sentence in the summary is incorrect. So if we run the NLI task on the entire document and summary, it's going to say that the entire summary is correct. But if you run it at a sentence level, it's able to tell you that the last sentence in the summary is incorrect. And they include a really nice ablation study, right, where they check the granularity of the document. As we got finer and finer from document to paragraph to sentence, the accuracy of detecting factual inconsistency goes up. That's pretty amazing. Now, another approach is sampling, right? And here's an example from Sharp check GPD. Given an input document, we generate a summary multiple times. Now we check if those summaries are similar to each other ngram overlap, bird score, et cetera. The assumption is that if the summaries are very different, it probably means that they're not grounded on the context document and therefore likely hallucinating. But if they are quite similar, you can assume that they are grounded effectively and therefore factual. And the final approach is asking a strong LLM conceptually. It's simple. Given an input document and summary, they get the LLM to return a summary score. And this LLM has to be pretty strong and.










