Embedbase Documentation

sdk is still in alpha, if you if you have some feedback join our discord (opens in a new tab) 🔥

find it on github (opens in a new tab)

Embedbase

(opens in a new tab)

Open-source API & SDK to connect any data to ChatGPT

Before you start, you need get a an API key at app.embedbase.xyz (opens in a new tab).

Note: we're working on a fully client-side SDK. In the meantime, you can use the hosted instance of Embedbase.

Design philosophy

Simple
Open-source
Composable (integrates well with any AI providers, databases and LLM helpers)

What is it

These are the official clients for Embedbase. Open-source API & SDK to easily create, store and retrieve embeddings.

Who is it for

People who want to

plug their own data into ChatGPT or any other LLM
build recommendation systems
build search engines
build classification engines
etc.

Installation

npm install embedbase-js

Initializing

import { createClient } from 'embedbase-js'
 
// you can find the api key at https://embedbase.xyz
const apiKey = 'your api key'
// this is using the hosted instance
const url = 'https://api.embedbase.xyz'
 
const embedbase = createClient(url, apiKey)

Searching datasets

// fetching data
const data = await embedbase
  .dataset('test-amazon-product-reviews')
  .search('best hot dogs accessories', { limit: 3 })
 
console.log(data)
// [
//   {
//       "similarity": 0.810843349,
//       "data": "This nice little hot dog toaster is a great addition to our kitchen. It is easy to use and makes a great hot dog. It is also easy to clean. I would recommend this to anyone who likes hot dogs."
//       "metadata": {
//         "path": "https://amazon.com/hotdogtoaster",
//         "source": "amazon"
//       },
//       "embedding": [0.35332, 0.23423, ...]
//   },
//   {
//       "similarity": 0.294602573,
//       "data": "200 years ago, people would never have guessed that humans in the future would communicate by silently tapping on glass",
//       "embedding": [0.76532, 0.23423, ...]
//   },
//   {
//       "similarity": 0.192932034,
//       "data": "The average car in space is nicer than the average car on Earth",
//       "embedding": [0.52342, 0.23423, ...]
//   },
// ]

You can also filter by metadata:

const data = await embedbase
  .dataset('test-amazon-product-reviews')
  .search('best hot dogs accessories')
  .where('source', '==', "amazon")

Adding Data

const data =
  await // embeddings are extremely good for retrieving unstructured data
  // in this example we store an unparsable html string
  embedbase.dataset('test-amazon-product-reviews').add(`
  <div>
    <span>Lightweight. Telescopic. Easy zipper case for storage. Didn't put in dishwasher. Still perfect after many uses.</span>
`)
 
console.log(data)
//
// {
//   "id": "eiew823",
//   "data": "Lightweight. Telescopic. Easy zipper case for storage.
//          Didn't put in dishwasher. Still perfect after many uses.",
//   "embedding": [0.1, 0.2, 0.3, ...]
// }

If you have many documents to add, you should use batchAdd:

embedbase.dataset(datasetId).batchAdd([{
  data: 'some text',
}])

For better performance, you can run these add in parallel. For example, you can use batches with Promise.all:

const batch = async (myList: any[], fn: (chunk: any[]) => Promise<any>) => {
    const batchSize = 100;
    return Promise.all(
        myList.reduce((acc: BatchAddDocument[][], chunk, i) => {
            if (i % batchSize === 0) {
                acc.push(myList.slice(i, i + batchSize));
            }
            return acc;
        }, []).map(fn)
    )
}
batch(chunks, (chunk) => embedbase.dataset(datasetId).batchAdd(chunk))

Splitting and chunking large texts

AI models are often limited in the amount of text they can process at once. Embedbase provides a utility function to split large texts into smaller chunks. We highly recommend using this function. To split and chunk large texts, use the splitText function:

import { splitText } from 'embedbase-js/dist/main/split';
 
const text = 'some very long text...';
// ⚠️ note here that the value of maxTokens depends
// on the used embedder in embedbase.
// With models such as OpenAI's embeddings model, you can
// use a maxTokens of 500. With other models, you may need to
// use a lower maxTokens value.
// (embedbase cloud use openai model at the moment) ⚠️
const maxTokens = 500
// chunk_overlap is the number of tokens that will overlap between chunks
// it is useful to have some overlap to ensure that the context is not
// cut off in the middle of a sentence.
const chunkOverlap = 200
splitText(text, { maxTokens: maxTokens, chunkOverlap: chunkOverlap }, async ({ chunk, start, end }) =>
    embedbase.dataset('some-data-set').add(chunk)
)

Check how we send our documentation to Embedbase (opens in a new tab) to let you ask it questions through GPT-4.

Creating a "context"

createContext is very similar to .search but it returns strings instead of an object. This is useful if you want to easily feed it to GPT.

// you can create a context to store data
const data = await embedbase
  .dataset('my-documentation')
  .createContext('my-context')
 
console.log(data)
[
 "Embedbase API allows to store unstructured data...",
 "Embedbase API has 3 main functions a) provides a plug and play solution to store embeddings b) makes it easy to connect to get the right data into llms c)..",
 "Embedabase API is self-hostable...",
]

Generating text

Under the hood, generating text use OpenAI models. If you are interested in using other models, such as open-source ones, please contact us.

Remember that this count in your playground usage, for more information head to the billing page (opens in a new tab).

const data = await embedbase
  .dataset('my-documentation')
  .createContext('my-context')
 
const question = 'How do I use Embedbase?'
const prompt =
`Based on the following context:\n${data.join('\n')}\nAnswer the user's question: ${question}`
 
for await (const res of embedbase.generate(prompt)) {
    console.log(res)
    // You, can, use, ...
}

You can also send the history, like in OpenAI API (opens in a new tab):

const history = [
    {"role": "system", "content": "You are a helpful assistant that teach how to use Embedbase"},
    {"role": "user", "content": "How can I connect my data to LLMs using Embedbase?"},
    {"role": "assistant", "content": "You can npm i embedbase-js, write two lines of code, and..."},
    {"role": "user", "content": "how can i do this with my notion pages now using their api?"},
]
 
// ...
for await (const res of embedbase.generate(prompt, {history})) {
    console.log(res)
    // You, need, to, query, Notion, API, like, so:, ...
}

Adding metadata

Adding metadata can be useful, for example if you are feeding a LLM like ChatGPT, a typical best practice is to add the source of the text as metadata. For example an URL. Then you can ask the AI to add links or footnotes in it's output.

const data =
  await
  embedbase.dataset('test-amazon-product-reviews').add(`
  <div>
    <span>Lightweight. Telescopic. Easy zipper case for storage. Didn't put in dishwasher. Still perfect after many uses.</span>
    // metadata can be anything you want that will appear in the search results later
`, {path: 'https://www.amazon.com/dp/B00004OCNS'})
 
console.log(data)
//
// {
//   "id": "eiew823",
//   "data": "Lightweight. Telescopic. Easy zipper case for storage.
//          Didn't put in dishwasher. Still perfect after many uses.",
//   "metadata": {"path": "https://www.amazon.com/dp/B00004OCNS"},
//   "embedding": [0.1, 0.2, 0.3, ...]
// }

Listing datasets

const data = await embedbase.datasets()
console.log(data)
// [{"datasetId": "test-amazon-product-reviews", "documentsCount": 2}]

Create a recommendation engine

Check out this tutorial (opens in a new tab).

Contributing

We welcome contributions to Embedbase (opens in a new tab).

If you have any feedback or suggestions, please open an issue or join our Discord (opens in a new tab) to discuss your ideas.

🏡 Start Here 📡 API