Retrieve Content

Your organization is likely to have data in many different formats, distributed across many different repositories. IDOL can help you make the most of your data, but the first step is to retrieve and enrich the data. This is called ingestion. Ingestion includes all of the processing that takes place before documents are indexed.

The ingestion process might include the following steps:

  • Connect to a data repository and extract files and metadata.
  • Extract the contents of container files such as zip files.
  • Detect the format of each file and route it through appropriate processing tasks. Most files are filtered by IDOL KeyView which extracts any text contained within the file. However, if a connector extracts an image file you might want to run Optical Character Recognition (OCR) to extract text from the image. If a connector extracts an audio or video file you might want to determine whether the file contains speech and, if so, transcribe the speech into text.
  • Tag or categorize documents based on the information that they contain.
  • Discard documents that do not contain useful information.
  • Standardize metadata field names, so that documents retrieved from different data repositories have common properties such as a last modified date or author name in the same metadata fields.
  • Index the resulting information into your IDOL index.

The IDOL product suite provides multiple ways to ingest information:

  • IDOL NiFi Ingest. You can create an ingestion pipeline using IDOL NiFi Ingest, a set of IDOL components for data retrieval and enrichment, that run within an open-source platform called Apache NiFi. With IDOL NiFi Ingest you can run your IDOL Connectors and data processing tasks inside Apache NiFi, and index documents into your IDOL index. IDOL NiFi Ingest is the recommended way to ingest data.
  • IDOL Connectors and Connector Framework Server. You can run IDOL Connectors as standalone servers, which retrieve information from data repositories and send it to an IDOL Connector Framework Server (CFS). CFS processes documents and then indexes them into your IDOL index.

The following topics provide more information about ingestion: