Ingest
The Unstructured Python Ingest library is a powerful tool designed to coordinate the process of pulling data from data providers, partitioning the content, and pushing that new content to a desired location. This technical documentation will provide an in-depth understanding of the Python Ingest library, including its features, architecture, installation, configuration, usage, API reference, troubleshooting, examples, and more.
The following 3-minute video shows how to use the Unstructured Ingest Python library to send multiple PDFs from a local directory in batches to be ingested by Unstructured API services for processing:
The following 5-minute video goes into more detail about the various components of the Unstructured Ingest Python library:
Library Documentation
Connect to your favorite data storage platforms for an effortless batch processing of your files.
Connect to your favorite data storage platforms to write you ingest results to.
Each configuration used when generating an ingest process.
Features
The Unstructured Ingest CLI and Unstructured Ingest Python offer the following key features:
-
Data Ingestion: Facilitates the ingestion of data from various sources, such as databases, APIs, files, or streaming services.
-
Partitioning: Efficiently partitions data to extract relevant text data.
-
Customization: Allows users to define data sources, ingestion processes, and destination targets.
-
Fault Tolerance: Provides mechanisms for handling errors and retries during data ingestion.
-
Scalability: Scales horizontally to accommodate large volumes of data.
-
Logging: Offers comprehensive logging and monitoring capabilities to track the ingestion process.
Architecture
The Unstructured Python Ingest library follows a modular architecture comprising the following components:
-
Source Connectors: These components are responsible for fetching data from external sources, which can include databases, web services, file systems, or data streams.
-
Partitioning Engine: This component optimally partitions the incoming data into dedicated
Elements
for processing and distribution. -
Reformatters: Optional steps supported to manipulate the partitioned content output, such as chunking and adding embeddings.
-
Destination Connectors: These components send the partitioned data to the desired destination, which could be a database, data warehouse, cloud storage, or any other user-defined target.
-
The library’s modular architecture provides flexibility and extensibility, allowing users to integrate custom components and adapt the library to their specific needs.
Installation
To install the Unstructured Ingest CLI and the Unstructured Ingest Python library, follow these steps:
-
Run
pip install unstructured-ingest
to install the latest version of the Ingest CLI and Ingest Python library. -
For specific connectors, run
pip install "unstructured-ingest[CONNECTOR_DEPS]"
whereCONNECTOR_DEPS
references the extra dependency label for a particular connector. For example,pip install "unstructured-ingest[s3]"
will install the dependencies to interact with the s3 connectors. If these aren’t installed before hand, a convenient error message will be printed for you when you run the Unstructured Ingest CLI for the first time, prompting you with the correct pip command to run. -
Once installed, you can run
unstructured-ingest --help
to get all the available commands.
pip install unstructured
, see the migration guide.Configuration
The Unstructured Python Ingest library requires configuration to define data sources, ingestion processes, and destination targets. For the CLI, configuration is done through the various cli parameters supported. When the library is run in python, those parameters that are exposed in the CLI map to python config classes, which are described in more detail in the configs section.
Generate Python code examples
You can connect any available source connector to any available destination connector. However, the source connector code examples in the documentation show connecting only to the local destination connector. Similarly, the destination connector code examples in the documentation show connecting only to the local source connector.
To quickly generate an Unstructured Ingest Python library code example that connects any available source connector to any available destination connector, do the following:
-
Open the Unstructured Ingest Code Generator webpage.
-
Select your input (source) location type from the Get unstructured documents from drop-down list.
-
Select your output (destination) location type from the Upload RAG-ready documents to drop-down list.
-
Select your chunking strategy from the Chunking strategy drop-down list:
- None - Do not chunk the data elements’ content.
- basic - Combine sequential data elements to maximally fill each chunk. However, do not mix
Table
and non-Table
elements in the same chunk. - by_title - Use the
basic
strategy and also preserve section boundaries. Optionally preserve page boundaries as well. - by_page - Use the
basic
strategy and also preserve page boundaries. - by_similarity - Use the
sentence-transformers/multi-qa-mpnet-base-dot-v1
embedding model to identify topically similar sequential elements and combine them into chunks. This strategy is availably only when calling Unstructured API services.
To learn more, see Chunking strategies and Chunking configuration.
-
For any chunking strategy other than None:
- Enter your chunk size in the Chunk size (characters) box, or leave the default of 1000 characters.
- If you need to apply overlapping to the chunks, enter the chunk overlap size in the Chunk overlap (characters) box, or leave default of 20 characters.
To learn more, see Chunking configuration.
-
To generate vector embeddings, select the provider in the Embedding provider drop-down list.
To learn more, see Embedding configuraton.
-
Click Generate code.
-
Copy the example code from the Generated Code pane into your code project.
-
The code example will contain one or more environment variables that you must set for the code to run correctly. To learn what to set these variables to, click the documentation links that are below the Generated Code pane.
Was this page helpful?