Connectors, including the IDOL Web Connector, can send documents for ingestion that have associated HTML files.
You could send the HTML files to the KeyViewFilterDocument processor, which discards the HTML markup and extracts the text contained in the file. However, HTML pages often contain irrelevant content such as invalid HTML, headers, sidebars, advertisements, and scripts. This text does not contain any useful information and could pollute the IDOL index, degrading performance. KeyView does not remove this irrelevant content.
The ContentFromHTML processor uses an embedded browser to process HTML in a similar way to the IDOL Web Connector. There are many reasons to use this processor over other methods of processing HTML:
NOTE: To use the ContentFromHTML processor, you must install the IDOL Web Connector on the NiFi host machine.
Name | Default Value | Description |
---|---|---|
Document Registry Service | A DocumentRegistryService controller service that manages and updates a document registry database. This ensures that documents are indexed in the correct order. | |
WKOOP Path | The path to the WKOOP executable file. WKOOP is not included with NiFi, so you must install an IDOL Web Connector. | |
Url | data:text/html,
|
The source URL of the HTML content. Specify a URL if you want to resolve links into absolute URLs, or if external resources are required to process the page - for example if external JavaScripts must run before the page is processed. You do not need to specify the exact URL of the page being processed, as long as all URLs in the document being processed are absolute or relative to the web server. You can extract the value from a FlowFile attribute using NiFi expression language, for example |
Clipped | false |
Specifies whether to clip web pages. Clipping removes uninteresting parts of a page such as advertisements. To clip pages, set this property to true. To specify the parts of pages to keep and remove, set the properties Clip Page Using CSS: Select and Clip Page Using CSS: Unselect. If you do not set these properties the processor uses an algorithm to decide which parts of the page to keep. |
Clip Page Using CSS: Select | A comma-separated list of CSS selectors to specify the parts of a page to keep when the page is clipped. The processor also keeps all descendants of these elements. | |
Clip Page Using CSS: Unselect |
A comma-separated list of CSS selectors to specify the parts of a page to remove when the page is clipped. The processor also removes all descendants of these elements. The Clip Page Using CSS: Select property is applied before Clip Page Using CSS: Unselect, so you can use this property to remove unwanted descendants of elements identified by Clip Page Using CSS: Select. |
|
Temp Directory | temp | The path of the directory in which to store temporary files. |
Extract Links | true | Specifies whether to extract links from pages and add the links to the document metadata. |
Extract HTML Meta | true | Specifies whether to extract information from the meta tags in HTML documents. |
WKOOP Config | The configuration to pass to WKOOP for HTML processing. Do not include a section header at the top. You can specify any configuration options that are not listed above. For information about the options that you can set, refer to the documentation for the WkoopHtmlExtraction task in the IDOL Connector Framework Server documentation. |
Name | Description |
---|---|
success | Successfully processed FlowFiles are routed to this relationship. |
failure | FlowFiles that had an invalid or unknown format. |
extracted | Child documents extracted from a HTML document. This relationship receives documents when you set options such as ChildDocumentSelector in the configuration passed to WKOOP. |
|