LLM Optimization Techniques
There are two types of challenges that can hinder the performance of LLMs. One challenge is knowledge / context limitation, as pre-trained LLMs are limited to the knowledge they were pre-trained on. Hence, they tend to hallucinate repsponses (being confidently wrong, providing false information), whenever the answer to the user query doesn’t exist in their pretraining knowledge. Another challenge relates to reliability, where pre-trained LLMs may exhibit variability in their outputs due to their inherently random nature. Here, we will discuss on a high-level, 3 common techniques to mitigate these challenges and improve LLM performance.
Prompt Engineering
People often have the misconception that prompt engineering is just trying out random prompts and eye ball results for a one time job/task. Real prompt engineering involves creating prompts that is going to be used repeatedly in high frequencies/volumes as part of an LLM system, with an engineered approach, e.g., define problem/spec, build, version, testing prompts and outputs (speed, accuracy, toxicity, etc), finding edge cases, runtime monitoring, and iterative improvements. Moreover, it can serve as your baseline performance for system evaluations before adding more complex patterns.
Prompt engineering is the process of searching through program space to find the program that empirically seems to perform best on your target task. -François Chollet
One way we can use prompt engineering for context challenges is to augment knowledge beyond the pre-training data (e.g. current data, confidential/internal data, domain-specific data, etc). However, this becomes a problem when we have a large knowledge base we want to augment as LLMs have limited context length. Maxing out the input context with augmented knowledge is not a good idea since LLMs tends to get lost in the middle of long contexts and the number of input tokens is bloated. Furthermore, LLMs can still hallucinate if we augment context that is unrelated to the user query. Therefore, it is crucial to augment knowledge that is relevant to answering the user query.
To address reliability concerns, we can use one prompt engineering technique called few-shot learning. We can augment/provide input-output examples before the user prompt to help LLMs follow the given examples. Nevertheless, this can become problematic as we increase the number of examples, given the context length limitation and the increase in number of tokens.
Prompt engineering is indeed a good start when building LLM systems. However, given the challenges mentioned above, we may need other approaches to overcome them and optimize our system’s performance. The following techniques present solutions to these problems, namely RAG for context limitation and Fine-tuning for reliability concerns.
RAG
The main difference between RAG and prompt engineering is the Retrieval step. It is a way to retrieve only the relevant content for the user query. This way, we can retrieve knowledge that is relevant as well as within the context length. This step helps achieve the goals of obtain knowledge beyond their pre-training data and reduce hallucination (false information) but in a more scalable and robust way. Common use cases for RAG are question answering or summarization over large amounts of unstructured data (documents) or structured data (database).
We can think of the Retrieval step as a tool that LLMs can utilize (calling functions). The retrieval method depends on the data type of the knowledge source. For example, given we have large amount of documents (unstructured data), we can use embedding models to retrieve the most relevant text chunks from a vector store to augment the user query. Another example is to retrieve via the internet (e.g. google search, wikipedia, etc) by sending API requests. For retrieval over structured data, we can use a text-to-SQL model to generate queries given user prompt and database information. Then, we can execute the query to obtain the query result as the retrieved context.
Note: The retrieval step (only using embedding models) can be used for other use cases such as semantic search or recommendations, without the need for augmented generation (without using LLMs).
Some helpful examples:
Fine-Tuning
Given the randomness of LLMs, prompt engineering may not be the only solution if you need reliable outputs. Fine-tuning allows LLMs to get better at specific tasks and produce more reliable outputs. This involves providing example inputs and outputs of your specific task, so that LLMs can learn and perform according to your examples. Unlike few-shot learning where the number of examples is limited by the LLM context length, fine-tuning enables us to provide unlimited number of examples to the LLM. In addition, it offers reduced cost and latency through less token usage and a smaller model. Some common use cases for fine-tuning include respond in a specific style or structure, follow complex instructions, perform a specific task (e.g. code interpreter), or handle edge cases.
We can think of fine-tuning as continuing the model training process on our dataset. We can first create a training dataset that has examples of how to perform our desired task. We may optionally create a validation dataset for evaluation to avoid overfitting. If using a fine-tuning API, we may need to validate our datasets so that they adhere to the expected format. Once the datasets are ready, we can start the fine-tuning process to train the pre-trained LLM with the training examples we have formed. We can experiment with different hyperparameters (batch size, learning rate, number of epochs) when fine-tuning. Finally, we can use this fine-tuned LLM for inference. Note that it is best practice to first evaluate the fine-tuned model before serving it for inference.
Note:
- Try to maximize your efforts using prompt engineering (few shot learning), prompt chaining, or function calling, before moving towards fine-tuning.
- Start small and focus on quality when preparing your training datasets for fine-tuning.
Some helpful examples: