I had the opportunity to speak at the Monterey College of Law's annual symposium in January 2024 about the evolving landscape of generative AI for legal applications. At the time, Retrieval Augmented Generation (RAG) was just emerging. By now, it has taken center stage. What is it, and how can we use it?
Retrieval Augmented Generation
Retrieval-augmented generation (RAG) means retrieving information from your own in-house data sources to augment the performance of large language models (LLM) such as ChatGPT. Sometimes, this is advertised as ‘chat with your PDF.’ It dramatically reduces the chances of model hallucination.
Example: Imagine a legal chatbot powered by RAG technology. A client asks, "What are the recent changes to copyright law in the EU?"
- Retrieval: The system searches your database of high-quality vetted material for up-to-date information on EU copyright law changes.
- Augmentation: It combines this retrieved information with its general understanding of legal concepts.
- Generation: The AI then generates a comprehensive response, explaining the recent changes in a clear, context-appropriate manner with references to your in-house database
This approach allows the chatbot to provide current, accurate information beyond its initial training data, making it particularly useful in fields like law, where information constantly evolves.
How RAG works
In a RAG system, chunks, prompts, and queries are interconnected components of the information retrieval and generation process. Here's how they relate:
- Chunks: These are the smaller sections of documents in your corpus (e.g., your law firm's database of cases, contracts, and legal opinions). For example, a 50-page contract might be divided into chunks of a few paragraphs each. No part of the text is discarded in this process. Every word from the original PDF or other file type is included in one or more chunks.
- Queries: These are the questions or requests a user (like a lawyer or paralegal) inputs into the system. For instance, "What are the termination clauses in our recent tech company contracts?"
- Prompts: These are the instructions given to the language model, combining the user's query with relevant retrieved information to guide the model's response.
Here's how they interact:
- When a user's query is received, the system converts it into an embedding (a vector representation of its meaning).
- This query embedding is compared to the embeddings of all the chunks in the database.
- The system retrieves the most relevant chunks based on this comparison.
- These relevant chunks, along with the original query, are used to construct a prompt for the language model, often ChatGPT. The language model can be locally deployed or accessed via the internet using an Application Programming Interface (API)
- The language model then generates a response based on this prompt.
Let's use a legal example to illustrate this:
Query: "What are our standard force majeure clauses for software licensing agreements?"
Process:
- The system searches through all chunks of your firm's documents.
- It might retrieve chunks from various software licensing agreements containing force majeure clauses.
- The prompt to the language model might read: "The user asked: 'What are our standard force majeure clauses for software licensing agreements?' Based on the following excerpts from our documents [insert relevant chunks here], provide a summary of our standard force majeure clauses for software licensing agreements."
- The language model then generates a response synthesizing information from these chunks and its general knowledge of legal language and concepts.
This process allows the RAG system to provide responses grounded in your firm's specific documents and practices rather than just general legal knowledge. The chunks serve as a bridge between your firm's proprietary information and the language model's capabilities, allowing for more accurate and relevant responses to queries.
Small Chunks, Large Chunks, What’s Better?
Chunking in retrieval-augmented generation systems is a critical process that significantly impacts the system's effectiveness, especially in legal applications. The debate between small and large chunks isn't about finding a universally superior option but rather about understanding the trade-offs and selecting the most appropriate approach for specific use cases.
Small chunks, typically ranging from 100 to 500 tokens (words or subword units), offer granularity. They excel in pinpointing specific information, which can be crucial when searching for particular legal clauses or citations. This granularity allows for more precise retrieval and can improve the system's ability to answer highly specific queries. However, small chunks risk fragmenting complex legal concepts and may lose important context.
Large chunks, often 1000 tokens or more, preserve more context and are better suited for capturing comprehensive legal arguments or entire contract clauses. They reduce the risk of missing crucial connections between ideas, which is particularly important in legal reasoning. The downside is that large chunks may retrieve more irrelevant information and could be less efficient for answering targeted questions.
In practice, the choice between small and large chunks often depends on the nature of the legal documents and the types of queries the system is expected to handle. For instance, contract analysis might benefit from larger chunks to keep clauses intact, while case law research might require a mix of chunk sizes to balance between capturing full arguments and allowing for precise citation retrieval.
Many advanced RAG systems in legal tech are moving beyond the binary choice between small and large chunks. Instead, they are adopting more sophisticated approaches like hierarchical chunking, which preserves document structure while allowing for retrieval at different levels of detail. Some employ overlapping or sliding window techniques to maintain context across chunk boundaries, which is crucial for understanding the full scope of legal arguments.
The effectiveness of chunk size can also vary based on the embedding model used and the specific retrieval algorithms employed. Some models may perform better with smaller chunks, while others might excel at capturing the semantics of larger text segments.
Ultimately, the optimal chunking strategy often emerges through experimentation and empirical testing. Legal AI developers typically run extensive benchmarks, testing different chunk sizes and strategies against diverse sets of legal queries to find the approach that yields the best performance for their specific use case. If you deploy an open-source model on your own hardware, you control many parameters, including, for example, chunking size, sometimes called snippet size, and the number of snippets retrieved per prompt.
In the rapidly evolving field of legal AI, chunk size remains an active area of research. As embedding models and retrieval algorithms advance, new chunking strategies may emerge that better balance the needs for precision, context preservation, and computational efficiency in legal information retrieval.
Leveraging Open-Source LLMs for Secure, Cost-Effective Legal RAG Systems
Open-source LLMs that can run on mid-to-high-end consumer computers are indeed becoming increasingly viable for RAG systems, including those used in legal applications. Numerous open-source models can be easily deployed locally using services such as ChatGPT4All and LM Studio. Importantly, locally deployed models do not send any data over the web.
Advantages of locally deployed LLM-RAG systems:
- Accessibility: They democratize AI technology, allowing smaller law firms or independent practitioners to implement advanced RAG systems without relying on costly commercial providers
- Cost-effectiveness: Even purchasing new hardware with advanced processing power can be cost-effective, considering the high cost of solutions currently offered by legal information service providers
- API costs to connect to ChatGPT4 are optional and usually less expensive than subscription costs to teams or enterprise-level commercial solutions.
- Workflows could be divided into sensitive and non-sensitive; API calls are used for non-sensitive information, and fully locally deployed systems for sensitive matters. ChatGPT4All and LM Studio allow the parallel deployment of LLMs, and switching is easy, for example, from a solution that sends queries via API outside and one that processes them entirely locally
- Privacy: If a local model is used or chosen, all data stays on-premise, which may be desirable
- Customizability: These models can be adapted more easily to specific needs without the restrictions often imposed by proprietary models.
In many cases, these smaller open-source LLMs can be sufficient for RAG tasks, especially when paired with effective retrieval mechanisms. They may not match the raw capabilities of the largest models like GPT-4, but for many legal applications, their performance can be more than adequate since they need only provide the language glue that ties the retrieved chunks together.
For example, in a contract review RAG system, the LLM doesn't need to have a comprehensive knowledge of all legal concepts. It primarily needs to understand the query, interpret the relevant contract clauses retrieved, and generate a coherent response. A well-implemented RAG system with a smaller LLM can handle this effectively.
However, there are trade-offs to consider:
- Generalization: These models might struggle with very broad or complex queries outside their training data.
- Consistency: They may be less reliable in maintaining long-range coherence in extended outputs.
- Resource Intensity: While they can run on consumer hardware, they still require significant computational resources, especially for larger models in the category.
- Setup and Maintenance: Implementing and maintaining these systems requires more technical expertise than using cloud-based APIs.
The sufficiency of these models often depends on the specific task and how well the RAG system is implemented. A well-designed RAG system can compensate for some limitations of the underlying LLM by providing it with highly relevant context.
In legal applications, where precision and reliability are paramount, it's crucial to thoroughly test and validate the performance of these open-source LLMs in RAG systems. While they may not always match the capabilities of the largest models, their ability to run locally, combined with the benefits of RAG architecture, makes them a compelling option for many legal AI applications.
As these open-source models continue to improve rapidly, we're likely to see their capabilities and efficiency increase, narrowing the gap with larger, proprietary models even further.
Figure 1: Launch screen of GPT4All with RAG for local docs. You can choose the LLM model that fits your preferences from a list called “Find Models.” Many configurations are entirely local without data traffic over the internet to third-party providers. Some larger models may or may not work well on your current hardware.
Installation and Hardware requirements
Downloading and installing the environments provided by ChatGPT4All and LM Studio is very easy. None of this requires interaction with the command line (terminal). Intuitive interfaces make it clear what is required of you as a user. For example, adding a document collection just requires a few clicks to connect with documents on your hard drive.
The discussion of hardware requirements is beyond the scope of this article. In general, to run LLMs locally, your hardware setup should focus on having a powerful GPU with sufficient VRAM, ample RAM, and fast storage. Gaming computers with late-model NVIDIA graphic processing units are often a reasonable and affordable choice. For more detail: Recommended Hardware for Running LLMs Locally.
Contractual Considerations for Using AI-Assisted Document Retrieval in Legal Practice
Your in-house collections of documents and data may consist of diverse materials, including documents over which you hold clear copyright, materials where the copyright belongs to third parties, and instances where the copyright status is ambiguous or indeterminate. Historically, under the principles of fair use, you may have reasonably relied on the ability to use these documents for internal purposes, including drafting memoranda, briefs, contracts, and other routine legal work. However, the introduction of artificial intelligence (AI) tools, such as retrieval-augmented generation (RAG) and other generative technologies, complicates this analysis.
Typical contract language used by legal information providers to prohibit their data from being downloaded and used to train generative AI models often includes specific clauses to protect intellectual property rights and restrict AI-related uses. These clauses may explicitly prohibit the use of legal content for AI training unless express permission is granted.
For example, some providers insert clauses stating that no rights are granted for "reproducing or otherwise using the work for purposes of training artificial intelligence technologies" without prior authorization. These clauses may also specify that the data cannot be sublicensed for training AI models unless explicit consent is obtained from the rights holder. Such terms are designed to prevent unauthorized use of proprietary information, especially in industries like publishing, where works could be repurposed for generative AI without compensating the authors or license holders.
If you are using locally stored documents on a locally installed machine to perform retrieval-augmented generation (RAG) for your own attorney work product, the contractual restrictions from legal information providers on AI training would generally not apply in the same way, assuming a few key conditions:
- No Third-Party Data Sharing: The main issue addressed by legal information providers is the sharing or usage of their proprietary data for training public AI models. Suppose your setup is local, and you are not uploading, sharing, or otherwise distributing the data to third parties or external AI platforms. In that case, you are not violating the clauses that typically prohibit the use of the data for training or public dissemination.
- No Data Mining for AI Training: If you use your legal documents merely for internal research or generation of summaries (i.e., retrieval purposes) without allowing the system to use those documents to train its models, this typically falls outside the scope of most AI-related prohibitions. The contractual concerns revolve around AI training, where the documents are used to improve or modify the AI’s underlying models.
- Permitted Use of the Content: If you own the rights to the documents or have legitimate access to them as part of your legal practice, using them in a closed, private system for RAG would likely be within your rights. You would still need to check the licensing terms of any software or information source to ensure that internal or local use is not restricted.
However, if your locally installed system inadvertently shares data with cloud-based AI systems, or if you are using third-party AI tools that access external servers, you could run into issues related to unauthorized data use for training. In such cases, ensuring strict data segregation and local processing is critical to avoid legal conflicts.
Conclusion
Retrieval Augmented Generation (RAG) systems represent a significant leap forward in AI-assisted legal work. By combining the power of large language models with targeted information retrieval from a firm's proprietary documents, RAG offers improved accuracy, context-awareness, and reduced hallucination risks for tasks such as document analysis, contract review, and complex legal queries.
The emergence of open-source LLMs and locally deployable systems has democratized access to this technology, providing cost-effective and privacy-conscious alternatives to commercial solutions. This is particularly beneficial for smaller law practices that can now leverage advanced AI tools without compromising client confidentiality or incurring prohibitive costs.
However, the implementation of RAG systems in legal practice requires careful consideration of several factors:
- Technical aspects: System performance is crucially dependent on optimal chunking strategies, choice of embedding models, appropriate retrieval algorithms, and capable hardware.
- Data privacy and security: Locally deployed systems offer enhanced control over sensitive information, a critical concern in legal practice.
- Contractual and copyright considerations: Users must navigate the complex landscape of licensing agreements and fair use principles, especially when incorporating third-party legal information into their RAG systems.
- Ethical use and compliance: Ensuring AI-assisted work aligns with professional standards and regulatory requirements remains paramount.
As RAG technology continues to evolve rapidly, it presents both opportunities and challenges for the legal profession. Law firms of all sizes now have the chance to enhance their services, improve efficiency, and provide more accurate and timely advice to clients. However, this must be balanced with the need for careful implementation, ongoing evaluation, and a clear understanding of AI's limitations and ethical implications in legal practice.
A podcast that summarized the points of this article is available here: