Griptape | Introducing Griptape Cloud Hybrid Knowledge Bases

Following the addition of retrievers to Griptape Cloud, we are excited to announce Griptape Cloud Hybrid Knowledge Bases. Hybrid knowledge bases simplify the experience for developers that are building LLM applications that combine structured and unstructured data by improving performance and accuracy when working with structured data in your LLM-powered applications.

Large language models can sometimes struggle to interpret structured, or tabular, data (data organized in rows and columns) because they don’t parse tabular data in the way that humans do. As a result, it can be challenging for LLMs to apply filters on columns and to correctly associate column headers with an entire column or row of data.

In addition to these challenges with parsing structured data, structured data sets can be large and loading the entirety of your structured data set into the context or prompt every time you want to ask a question about your data can dramatically increase token counts and lead to very high costs, particularly if you are using commercial model providers. With self-hosted models, you might struggle to work with large data sets as these models have lower token limits for context. Griptape Cloud Hybrid Knowledge Bases allow developers to overcome these challenges and put structured data to work in their LLM-powered applications.

What are Hybrid Knowledge Bases?

Hybrid knowledge bases are a new knowledge base type that compliment vector knowledge bases. Hybrid knowledge bases differ from vector knowledgebases in that they are able to store and retrieve a combination of structured and unstructured data. Vector knowledge bases are a great choice for applications that need to work with unstructured data such as PDF files and text, whether that is directly from text files or generated dynamically from external sources such as web pages. Hybrid knowledge bases add the capability to reliably query structured data in your LLM-Powered applications alongside performing vector similarity searches on any unstructured data that is associated with your structured data.

Hybrid knowledge bases are useful when your applications need to access information that mixes structured and unstructured data. An example would be a recruitment candidate management system where you might have structured information about candidates such as the city or country that they are located in, coupled with unstructured data like the contents of a resumé or LinkedIn profile. With a hybrid knowledge base, you could ask the question ‘which candidates are located in New York and have experience in data analysis with Python’ and the hybrid knowledge base would retrieve the records that match the structured query of location equal to “New York” and then perform similarity search on the unstructured data for terms similar to data analysis with python, returning the results that best match the candidates with those skills.

To deal with updates to structured data, which can be frequent, the Griptape Cloud scheduled refresh feature for knowledge bases simplifies the process of creating data pipelines that automate knowledge base updates, whether these are existing vector knowledge bases or new hybrid knowledge bases.

Getting Started with Hybrid Knowledge Bases

Let’s build a hybrid knowledge base and connect it to a Griptape Cloud Assistant to show how this feature works in practice.

For this walkthrough I created a sample data set that contains information in the following format. To save space, I am only showing the first 5 rows here, but my data set contains just over 300 records. This data is synthetic (I generated it using OpenAI’s GPT-4o model), so there’s no need to worry about anyone’s privacy. As they say in movie credits, any similarity to any living person is purely coincidental.

First Name	Last Name	Gender	Country	Age	Date	Fun Fact
Aiden	Abril	Female	Portugal	32	15/10/2017	I can solve a Rubik’s cube in under a minute.
Bella	Hashimoto	Female	Sweden	25	16/08/2016	I have visited more than 20 countries.
Caleb	Gent	Male	Austria	36	21/05/2015	I can play the guitar.
Dahlia	Hanner	Female	Netherlands	25	15/10/2017	I once met Tom Hanks by accident.

‍

Our data needs to be in CSV format, so I downloaded the data from Google Sheets in that format, naming it example.csv. I then uploaded the data from my laptop to a bucket in my Griptape Cloud Data Lake as you can see in the image below. For production use-cases, you can automate this operation using the Griptape Cloud API and couple this with the scheduled refresh feature for knowledge bases to keep the data that your application uses up-to-date.

‍

Before we can create our knowledge base, we need to create a data source. To do this, open the Libraries submenu in the left hand navigation menu in the console and select Data Sources, then click the Create Data Source button in the top right of the data sources page. Select the Griptape Cloud Data Lake option on the create data source page to create a data source from the CSV file that we have uploaded.

‍

We are then prompted to enter the details for our new data source. Enter a name together with an optional description. Select the bucket that we used earlier and enter the name of the asset that we created as the asset path. Then click Add Asset Path and Create.

‍

After a few seconds, our data source will be created. This will be indicated by the status showing ‘🟢 Ready’ on the data source detail screen. Our next step is to create my new hybrid knowledge base. To do this, open the Libraries submenu in the left hand navigation menu in the console and select Knowledge Bases. Next, select the Griptape Cloud Hybrid Knowledge base, as shown below.

‍

On the next page, enter a name and an optional description for the new hybrid knowledge base, and select the data source that we created earlier. Once you do this, the data in the CSV will be evaluated automatically and displayed as structured columns and unstructured columns at the bottom of the page. This gives you an opportunity to validate that the columns in your data have been interpreted correctly. If you wish to switch a column from structured to unstructured or vice-versa, you can do this by clicking the icon to the left of the field name for that column. You can also modify the data types for each of your columns here if necessary. Once you are happy with the schema, click the Create button. In the few seconds, the status on the knowledge base details screen will change to ‘🟢 Ready’, indicating that the knowledge base is ready to use.

‍

To test the knowledge base, let’s create an assistant, connect it to the knowledge base and ask some test questions of our data. To do this, click Assistants in the left navigation menu, then click the Create Assistant button in the top right of the assistants page. On the create assistant page, enter a name and optional description for your assistant and add your new knowledge base by selecting it in the knowledge bases dropdown. Finally, click Create to create your assistant.

As a test, I decided to pose the question “Which countries do people with fun facts related to dancing come from?”. I got the results below.

‍

To validate these results I applied a simple filter to the original CSV data, filtering the fun fact column for rows containing the word ‘dance’. This gave me the results below, which are a perfect match, almost.

First Name	Last Name	Gender	Country	Age	Date	Fun Fact
Milo	Ascencio	Female	Switzerland	32	16/08/2016	I once won a dance competition.
Ronan	Cuccia	Female	Canada	46	21/05/2015	I once danced in a flash mob.
Odin	Lafollette	Female	Austria	34	15/10/2017	I once won a dance-off in a club.
Ellery	Wachtel	Female	Spain	27	16/08/2016	I once won a dance contest.

‍

In the results from the hybrid knowledge base, you will see that we also got a result back for Zariah Muntz from Germany, who can walk on their hands. This illustrates vector similarity search beautifully. As you might expect, acrobatic moves like walking on your hands are very close to dancing in the vector space created during embedding, so this result was returned from the vector search operation performed within the knowledge base, alongside the more conventional dance-related facts.

To validate this, I cleared the conversation memory for the assistant by clicking the X next to the thread identifier in the ‘Run Config’ to the right, and asked a slightly different question - “Which countries do people with fun facts related to dancing come from? Give only the top four results based on how dance-related the results are”. This returned the following results, and excluded our acrobatic friend Zariah.

‍

We really hope you find the Griptape Cloud Hybrid Knowledge Bases valuable in your applications. As usual, if you need any help or guidance in using this feature, or you have feedback or suggestions for improvements to this, or any other feature in Griptape Cloud, please join us in the Griptape Discord. We would love to chat with you. Try creating your own Griptape Cloud Hybrid Knowledge Base today!