Clip Pages
The content on most web pages includes headers, footers, navigation bars, and advertisements. Unless these are removed from pages, the text from these items could reduce the quality of the documents indexed into IDOL Server and reduce the effectiveness of operations such as categorization.
You can configure Web Connector to remove irrelevant content from pages before they are ingested.
Clip Pages Automatically
Web Connector can clip pages automatically, using one of the following algorithms to decide which parts of the page to keep and which to discard.
- To use the Mozilla readability library, set ClippingMode to
READABILITY
. -
To use the SmartPrint algorithm, set ClippingMode to
SMARTPRINT
. SmartPrint works best with common page designs, such as pages where the content is in the center and there are navigation panels to the top and left, with extra content to the right.The SmartPrint algorithm evaluates each section of the page and decides whether to clip it based on several factors, including:
- The position of the section on the page (central content is preferred).
- The ratio of links to words (a smaller proportion of links is preferred).
Clip Pages using CSS Selectors
The automatic clipping algorithms have been designed to work with many different pages, but this means that automatic clipping might not give the best results for every page. For this reason, you can use CSS selectors to choose which parts of the page to keep and which to discard. To clip pages with CSS selectors, set ClippingMode=CSSCLIPPING
in your task configuration, and then set one or both of the parameters ClipPageUsingCssSelect
and ClipPageUsingCssUnselect
.
ClipPageUsingCssSelect | A comma-separated list of CSS selectors that specify parts of the page to keep. The connector also keeps all descendents of these elements. |
ClipPageUsingCssUnselect |
A comma-separated list of CSS selectors that specify parts of the page to remove. The connector also removes all descendents of these elements. The |
The Web Connector supports standard CSS selectors. To construct the selectors, view the source HTML of the pages that you need to clip. CSS allows you to select elements based on the structure of the page. For example, you can select elements of a certain type that are descendents of another element. Also, the designer of the page might have added classes to the relevant elements in order to style them, and you can use these same classes to clip the page.
The following example shows a simple page:
<html> <head> </head> <body> <nav> <!-- navigation and links --> </nav> <div class="maincontent"> <p>Some content</p> </div> <div class="footer"> <!-- footer --> </div> </body> </html>
To select the main content but exclude the navigation element and the footer, you could use the following configuration:
[MyTask] ... ClippingMode=CSSCLIPPING ClipPageUsingCssSelect=div.maincontent ClipPageUsingCssUnselect=nav,div.footer
TIP: Web Connector includes an example tool that can help you find the CSS selectors you need to clip web pages. For more information about this utility, see Find Selectors using the CSS Selector Builder Tool.