This page was recently updated. What do you think about it? Let us know!.

Batch process all your records to store structured outputs in KDB.AI.

The requirements are as follows.

  • A KDB.AI Cloud or server instance. Sign Up for KDB.AI Cloud: Starter Edition. Set up KDB.AI Server.

  • The instance’s endpoint URL. Get the KDB.AI Cloud endpoint URL. Get the KDB.AI Server endpoint URL.

  • An API key. Create the API key.

  • The name of the target table to access. Create the table.

    KDB.AI requires the target table to have a defined schema before Unstructured can write to the table. The recommended table schema for Unstructured contains the fields id, element_id, document, metadata, and embeddings, as follows. This example code demonstrates the use of the KDB.AI Client for Python to create a table with this recommended schema, along with creating a vector index that contains 3072 dimensions:

    Python
    import kdbai_client as kdbai
    import os
    
    session = kdbai.Session(
        endpoint=os.getenv("KDBAI_ENDPOINT"),
        api_key=os.getenv("KDBAI_API_KEY")
    )
    
    db = session.database("default")
    
    schema = [
        {
            "name": "id",
            "type": "str"
        },
        {
            "name": "element_id",
            "type": "str"
        },
        {
            "name": "document",
            "type": "str"
        },
        {
            "name": "metadata", 
            "type": "general"
        },
        {
            "name": "embeddings",
            "type": "float32s"
        }
    ]
    
    indexes = [ 
        {
            "name": "vectorIndex",
            "type": "flat", 
            "params": {
                "dims": 3072,
                "metric": "L2"
            },
            "column": "embeddings"
        }
    ]
    
    table = db.create_table(
        table=os.getenv("KDBAI_TABLE"),
        schema=schema,
        indexes=indexes
    )
    
    print(f"The table named '{table.name}' now exists.")
    

The KDB.AI connector dependencies:

CLI, Python
pip install "unstructured-ingest[kdbai]"

You might also need to install additional dependencies, depending on your needs. Learn more.

The following environment variables:

  • KDBAI_ENDPOINT - The KDB.AI instance’s endpoint URL, represented by --endpoint (CLI) or endpoint (Python).
  • KDBAI_API_KEY - The KDB.AI API key, represented by --api-key (CLI) or api_key (Python).
  • KDBAI_TABLE - The name of the target table, represented by --table-name (CLI) or table_name (Python).

Now call the Unstructured CLI or Python. The source connector can be any of the ones supported. This example uses the local source connector:

This example sends files to Unstructured API services for processing by default. To process files locally instead, see the instructions at the end of this page.

#!/usr/bin/env bash

# Chunking and embedding are optional.

unstructured-ingest \
  local \
    --input-path $LOCAL_FILE_INPUT_DIR \
    --chunking-strategy by_title \
    --embedding-provider huggingface \
    --partition-by-api \
    --api-key $UNSTRUCTURED_API_KEY \
    --partition-endpoint $UNSTRUCTURED_API_URL \
    --strategy hi_res \
    --additional-partition-args="{\"split_pdf_page\":\"true\", \"split_pdf_allow_failed\":\"true\", \"split_pdf_concurrency_level\": 15}" \
  kdbai \
    --endpoint $KDBAI_API_KEY \
    --api-key $KDBAI_API_KEY \
    --table-name $KDBAI_TABLE

For the Unstructured Ingest CLI and the Unstructured Ingest Python library, you can use the --partition-by-api option (CLI) or partition_by_api (Python) parameter to specify where files are processed:

  • To do local file processing, omit --partition-by-api (CLI) or partition_by_api (Python), or explicitly specify partition_by_api=False (Python).

    Local file processing does not use an Unstructured API key or API URL, so you can also omit the following, if they appear:

    • --api-key $UNSTRUCTURED_API_KEY (CLI) or api_key=os.getenv("UNSTRUCTURED_API_KEY") (Python)
    • --partition-endpoint $UNSTRUCTURED_API_URL (CLI) or partition_endpoint=os.getenv("UNSTRUCTURED_API_URL") (Python)
    • The environment variables UNSTRUCTURED_API_KEY and UNSTRUCTURED_API_URL
  • To send files to Unstructured API services for processing, specify --partition-by-api (CLI) or partition_by_api=True (Python).

    Unstructured API services also requires an Unstructured API key and API URL, by adding the following:

    • --api-key $UNSTRUCTURED_API_KEY (CLI) or api_key=os.getenv("UNSTRUCTURED_API_KEY") (Python)
    • --partition-endpoint $UNSTRUCTURED_API_URL (CLI) or partition_endpoint=os.getenv("UNSTRUCTURED_API_URL") (Python)
    • The environment variables UNSTRUCTURED_API_KEY and UNSTRUCTURED_API_URL, representing your API key and API URL, respectively.

    Get an API key and API URL.