(Thesis) Integrating Database with LLMs
Abstract
This research investigates the integration of Large Language Models (LLMs) with databases to enhance information extraction and query resolution accuracy. The study primarily focuses on addressing challenges related to encoding methodologies and refining attention mechanisms within LLMs. A key challenge involves individual encoding of data elements within large databases. Observations reveal that attention sinks and position embeddings significantly influence accuracy. Leveraging in- sights from recent advancements, particularly adapter layers inspired by Llama-Adapter, demonstrates noteworthy improvements in the model’s performance. Another critical aspect explored involves managing unlimited contextual information. Strategies to approximate and rectify position information loss during encoding are discussed. Insights into attention heads’ behavior in retrieving and promoting information from context guide the refinement of model performance. Specifically, zeroing out attention heads in final layers has shown promising results in ensuring accurate responses. The study’s key contributions lie in proposing solutions to challenges related to individual encoding and contextual information management. These findings pave the way for integration of LLMs with databases, enabling more precise information extraction and query answering capabilities
Introduction
This study focuses on integrating Large Language Models (LLMs) with databases, aiming to extract information and answer queries accurately. Previous methodologies, such as fine-tuning frameworks and Retrieval Augmented Generation, have proven inadequate when dealing with vast datasets containing millions of entries. To ensure absolute precision, relying solely on summaries or excerpts is insufficient. Instead, leveraging the attention framework within transformers themselves becomes essential. This thesis aims to address two primary challenges in enabling LLMs to work with databases: performing generation with a large context and encoding each data element effectively. The experiments focus on decoder-only models, specifically Llama models.
Task
We try to observe and improve the recall ability of the model. Following are the descriptions of the data provided and queries posed.
Data
We employ a closed-world system’s data to ensure model does not depend on trained knowledge. We use a simple data, that is a extracted from knowledge corpus; two entities with or without a relation linking them. Table 1 shows some examples of the data. We refer to each data element as a "fact".
Format | Fact |
---|---|
Two Entities | Williamson is baking |
Oppenheimer is cycling | |
Sameer is Rope Climbing | |
Two Entities and Relation | Williamson is eating with Abhishek |
Ashwin is affiliated to Chennai F.C. | |
Manara city is located in Sahar |
Queries
Given one entity and a relation (if present), we expect the model to retrieve the associated entity. Table 2 shows examples of the queries based on the data presented in Table 1
Fact | Query |
---|---|
Williamson is baking | What is Williamson doing? |
Who is baking? | |
Oppenheimer is cycling | What is Oppenheimer doing? |
Who is cycling? | |
Williamson is eating with Abhishek | Who is cycling? |
Who is eating with Abhishek? | |
Ashwin is affiliated to Chennai F.C. | Which organization is Ashwin affilated with? |
Who is affiliated with Chennai F.C.? |
Individual Encoding
A database typically comprises a large number of tokens, while transformers have limitations regarding the context window they can handle. Recent endeavors, such as those proposed in Peng et al., 2023, have suggested methods to extend the window by manipulating position embeddings (RoPE: Su et al., 2023). However, despite these approaches, sequential processing of tokens still results in a quadratic inference time complexity. Additionally, encoding each fact independently, without influence from other facts, is desired. Moreover, in the context of data updates, sequentially encoding facts would necessitate re-encoding all facts with each new update, proving inefficient. Thus, the aim is to explore methods for individual encoding
Naive Approach
In this attempt, we obtain the hidden state representations of all the facts by processing them individually. During generation, we allow the model to perform attention computation over these individually encoded hidden states and the preceding query hidden states. We noticed a poor accuracy. The model misassociated the entities in a fact with entities in an another. Table 3 provides the responses to select queries.
Query | Response |
---|---|
What is Williamson doing? | baking |
What is Oppenheimer doing? | baking |
Who is cycling? | Oppenheimer |
Who is baking? | Oppenheimer |
Observations and Approaches
Observations
Need for Attention Sinks
Common first few tokens when encoding each fact improved the accuracy. These attention sinks are used by the model to determine the position encoded in the hidden states of the succeeding tokens. This was also observed in Xiao et al., 2023.
Behaviour of Position Embeddings
The RoPE plays a major role in the retrieval process. We noticed that the misassociation of the entities occurred only across facts that were encoded at the same "distance" from the attention sinks (This distance is measured in terms of number of facts present as context during encoding). The hidden states encode this relative position information from the attention sink provided by RoPE and depend upon this to find and retrieve associated tokens in the context.
Finetuning Approach
The recent Llama-Adapter paper Gao et al., 2023 has achieved multimodal capabilities in LLMs. They do so by using an adapter to convert image representations to hidden state representations. Similarly, we transform the individually encoded hidden state representations of each fact through an adapter layer and provide them for attention computation during generation. We observed an improvement in the model’s answering pattern. Currently we are attempting to generalize the encoding to accommodate an arbitrary number, N, of facts and analyze how the adapter layer improved the performance on the task
Unlimited Context
Since there are a huge number of entries in the database, it is not possible to perform attention computation over all of them and hence some sort of selection has to be performed. This can be done before generation like Retrieval Augmented Generation. But we would then be potentially limiting or providing incorrect information since RAG is based on simple embedding similarity. Hence, we attempt to allow the model to retrieve from context at all stages of attention computation. We take inspiration from Bertsch et al., 2023.
What is present in the unlimited context?
We preprocess all the corpus data either individually or sequentially and save all the hidden states in the database. If we proceed to save all the key and value vectors of the corpus data instead, the space required would be very large. However, in decoder only models, this leads to a tradeoff in the accuracy of the retrieval process; the position embedding of all the hidden states in the database is taken as the same.
RoPE is responsible for accounting for the distance between two tokens. It is multiplied to both query and key vectors before attention scores are computed. In the above method, since we are saving the hidden states and not the key vectors, the position information is lost and not saved. We observe some queries incorrectly answered due to this approximation.
Correcting the Approximation
How do the attention heads work to retrieve from context?
We observe that only certain attention heads are concerned with retrieving and promoting information from context and the other heads are concerned with promoting information from its memory (trained data). This has also been observed by Yu et al., 2023. We confirm our findings by restricting all other attention heads to only the preceding query tokens; the response to any of the queries did not change. (We are attempting to use this information to detect hallucination, i.e., when these selected attention heads produce lower attention scores on the tokens in the context)
Observing these selected attention heads, we study how they retrieve relevant tokens from context. During encoding, the position embeddings help encode in the hidden states of a token, the identity of the hidden states of the tokens near it (This has also been observed and studied in Feng et al., 2023). Further, when a specific token is input, the selected heads locate similar tokens in the context and then retrieve tokens that are adjacent or associated with them.
Problem
Based on the above observation, we realized that the incorrect responses are due to heads in the final layers retrieving and promoting incorrect tokens. This has also been observed in Halawi et al., 2023. Based on this study, we zero out the attention heads in the final layers and notice an accurate response. We are currently attempting to verify the observations on differnt types of data
References
- Peng et al., (2023). YaRN: Efficient Context Window Extension of Large Language Models.
- Su et al., (2023). RoFormer: Enhanced Transformer with Rotary Position Embedding.
- Xiao et al., (2023). Efficient Streaming Language Models with Attention Sinks.
- Gao et al., (2023). LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model.
- Bertsch et al., (2023). Unlimiformer: Long-Range Transformers with Unlimited Length Input.
- Yu et al., (2023). Characterizing Mechanisms for Factual Recall in Language Models.
- Feng et al., (2023). How do Language Models Bind Entities in Context?
- Halawi et al., (2023). Overthinking the Truth: Understanding how Language Models Process False Demonstrations.