Import Tasks are processing tasks that are performed on documents by CFS, before the documents are indexed into IDOL Server. Import Tasks enable you to manipulate and enrich the documents that are created by CFS.
CFS includes Import Tasks that meet common processing requirements. For example, there are Import Tasks to filter advertisements out of HTML files, or divide document content into shorter sections.
You can use the IdxWriter
and XmlWriter
tasks to write documents to disk in IDOL IDX or XML format. This allows you to view the information that is being indexed into IDOL Server, so that you can check the information is being indexed as you expected. If necessary, you can then use other import tasks or custom Lua scripts to manipulate and enrich the information.
The CsvWriter
and JsonWriter
tasks write documents to disk in CSV or JSON format. You can also use the SqlWriter
task to write document metadata and content to disk in the form of SQL "insert" statements, so that you can insert the information from the documents into a database.
You can use import tasks to enrich documents, without needing to write custom scripts. For example, you can:
use the HtmlExtraction
import task to extract the meaningful content from HTML, and discard advertisements, headers, and sidebars.
use the Sectioner
import task to divide document content into shorter sections. Dividing a document can result in more relevant query results, because IDOL can return a specific part of a document in response to a query.
use the Eduction
import task to run Eduction.
use the IdolSpeech
import task to extract speech from audio and video files, and write a transcription of the speech to the document content. IDOL Server can then use the speech for retrieval, clustering, and other operations.
use the ImageServerAnalysis
import task to run analysis on image files. You can run analysis tasks such as optical character recognition (OCR), object detection, and face recognition. The results of the analysis are written into the document.
You can use import tasks to reject documents that you do not want to index into IDOL server. For example, the BadFilesFilter
task rejects documents that do not contain valid content. When a document is rejected, it is not processed further and is not indexed into IDOL. However, you can index the document into an IDOL Server that has been configured to handle failed documents.
The Lua task runs a Lua Script. Lua is an embedded scripting language that you can use to manipulate documents and define custom processing rules. CFS includes Lua functions for manipulating documents and running other tasks. For example, you can add, modify, or remove fields and their values.
Import tasks are configured in the [ImportTasks]
section of the CFS configuration file.
You can run Import Tasks before or after documents are processed by KeyView. Pre import tasks run before KeyView processing. Post Import tasks run after KeyView processing.
Pre Import Tasks are often used to control processing.
Post Import Tasks are often used to write a document to disk (in either IDX or XML format). You then have a backup copy of the content that is indexed into your IDOL Server, and you can see how the data is sent to IDOL.
|