Concepts

Ingestion is the term that Unstructured uses to refer to the set of activities that happens when files are input for processing. Ingestion enables multiple files to be processed as a batch.

You can perform ingestion with the following tools:

  • The Unstructured Platform, a no-code user interface, unlimited pay-as-you-go platform to get all of your data ready for Retrieval Augmented Generation (RAG) and model fine-tuning.
  • The Unstructured Ingest CLI, with unlimited pay-as-you-go and limited free options, that enable you to use command-line scripts to get all of your data ready for RAG and model fine-tuning.
  • The Unstructured Ingest Python library, with unlimited pay-as-you-go and limited free options, that enable you to use Python code to get all of your data ready for RAG and model fine-tuning.

The Unstructured Python SDK and Unstructured JavaScript/TypeScript SDK can process only one file at a time.

Files are ingested from an originating source location. Each batch of ingested files is processed either all by Unstructured or all locally. The processed data is sent to a target destination location. The kinds of locations you can specify varies:

When you use the Unstructured Platform, the source and destination must both be in cloud storage. Local source or local destination locations are not allowed. For example:

The Unstructured Platform enables you to connect to many kinds of sources and destinations.

If you use the Unstructured Ingest CLI or the Unstructured Ingest Python library, the source or destination can be a cloud storage location or a local location. For example:

Unstructured provides many source and destination connectors.

Ingestion options for the Unstructured service

This is the flow for sending files to Unstructured for processing and the processed data being delivered by Unstructured:

  • This flow always happens for the Unstructured Platform. The Platform only allows sending files from cloud storage and sending processed data to cloud storage.

  • For the Unstructured Ingest CLI or the Unstructured Ingest Python library, to use this flow:

    • When using the Unstructured Ingest CLI, include the --partition-by-api option and set --api-key and --partition-endpoint to a valid, matching Unstructured API key and API URL, respectively.
    • When using the Unstructured Ingest Python library, set partition_by_api=True and api_key and set partition_endpoint to a valid, matching Unstructured API key and API URL, respectively.

Local ingestion options

This is the flow for processing files locally. No files are sent to Unstructured for processing:

  • This flow never happens for the Unstructured Platform. The Platform does not allow sending files from a local destination to Unstructured or Unstructured sending processed data to a local destination.

  • For the Unstructured Ingest CLI or the Unstructured Ingest Python library, to use this flow:

    • When using the Unstructured Ingest CLI, omit the --partition-by-api, --api-key, and --partition-endpoint options.
    • When using the Unstructured Ingest Python library, omit partition_by_api or explicitly set partition_by_api=False. Also omit api_key and partition_endpoint.

Unstructured Ingest CLI

The Unstructured Ingest CLI enables you to use command-line scripts to get all of your data ready for RAG and model fine-tuning.

One approach to using the CLI is installing Python and then running the following command to install the CLI:

pip install unstructured-ingest

This default installation option enables the ingestion of plain text files, HTML, XML, JSON and emails that do not require any extra dependencies. This default option also enables you to specify local source and destination locations.

You might also need to install additional dependencies, depending on your needs. Learn more.

For additional installation options, see:

To display the list of available source connector commands, run the following command:

unstructured-ingest --help

To display the list of available destination connector commands, run the following command:

unstructured-ingest local --help

To display help for a specific source connector command, run the following command:

unstructured-ingest <command-name> --help

To display help for a specific destination connector command, run the following command:

unstructured-ingest local <command-name> --help

To begin using the CLI, see the quickstarts for the:

To migrate from older, deprecated versions of the Ingest CLI that used pip install unstructured, see the migration guide.

Unstructured Ingest Python library

The Unstructured Ingest Python library enable you to use Python code to get all of your data ready for RAG and model fine-tuning.

The following 3-minute video shows how to use the Unstructured Ingest Python library to send multiple PDFs from a local directory in batches to be ingested by Unstructured API services for processing:

One approach to using the Unstructured Ingest Python library is installing Python and then running the following command to install the library and the default connectors:

pip install unstructured-ingest

This default installation option enables the ingestion of plain text files, HTML, XML, JSON and emails that do not require any extra dependencies. This default option also enables you to specify local source and destination locations.

You might also need to install additional dependencies, depending on your needs. Learn more.

For additional installation options, see:

Some source and destination connectors provide newer v2 and older v1 implementations, while some provide only older v1 implementations. You should use the v2 implementations wherever they are available, to help ensure better forward-compatibility of your code. For the lists of available v2 and v1 connectors, see:

To begin using the Unstructured Ingest Python library, see the code examples for the source and destination connectors.

To migrate from older, deprecated versions of the Ingest Python library that used pip install unstructured, see the migration guide.

Generate Python code examples

You can connect any available source connector to any available destination connector. However, the source connector code examples in the documentation show connecting only to the local destination connector. Similarly, the destination connector code examples in the documentation show connecting only to the local source connector.

To quickly generate an Unstructured Ingest Python library code example that connects any available source connector to any available destination connector, do the following:

  1. Open the Unstructured Ingest Code Generator webpage.

  2. Select your input (source) location type from the Get unstructured documents from drop-down list.

  3. Select your output (destination) location type from the Upload RAG-ready documents to drop-down list.

  4. Select your chunking strategy from the Chunking strategy drop-down list:

    • None - Do not chunk the data elements’ content.
    • basic - Combine sequential data elements to maximally fill each chunk. However, do not mix Table and non-Table elements in the same chunk.
    • by_title - Use the basic strategy and also preserve section boundaries. Optionally preserve page boundaries as well.
    • by_page - Use the basic strategy and also preserve page boundaries.
    • by_similarity - Use the sentence-transformers/multi-qa-mpnet-base-dot-v1 embedding model to identify topically similar sequential elements and combine them into chunks. This strategy is availably only when calling Unstructured API services.

    To learn more, see Chunking strategies and Chunking configuration.

  5. For any chunking strategy other than None:

    • Enter your chunk size in the Chunk size (characters) box, or leave the default of 1000 characters.
    • If you need to apply overlapping to the chunks, enter the chunk overlap size in the Chunk overlap (characters) box, or leave default of 20 characters.

    To learn more, see Chunking configuration.

  6. To generate vector embeddings, select the provider in the Embedding provider drop-down list.

    To learn more, see Embedding configuraton.

  7. Click Generate code.

  8. Copy the example code from the Generated Code pane into your code project.

  9. The code example will contain one or more environment variables that you must set for the code to run correctly. To learn what to set these variables to, click the documentation links that are below the Generated Code pane.

Migration guide

The older unstructured versions of the Unstructured Ingest CLI and Unstructured Ingest Python library have been replaced and are now deprecated.

To migrate to the newer unstructured-ingest versions of the Ingest CLI and Ingest Python library, do the following:

  1. If you previously ran pip install unstructured only for the purposes of using the Ingest CLI or the Ingest Python library, upgrade to the latest versions by running the following commands:

    a. pip uninstall unstructured
    b. pip install unstructured-ingest

  2. If you previously installed an older version of a source or destination connector, for example pip install "unstructured[azure]" for the Azure Storage connector, upgrade to the latest version by running the following commands:

    a. pip uninstall "unstructured[azure]"
    b. pip install "unstructured-ingest[azure]"

  3. If you were running Python code against an older version of the Ingest Python library, update your import statements by replacing all instances of unstructured.ingest with unstructured_ingest to run against the latest version.