Choose the Content to Index with a Lua Script

Web Connector supports dynamic corpus functionality. This means that you can use IDOL analytics such as categorization to decide whether to ingest content. You can also use this feature to filter the links that are extracted from a page. The connector runs a Lua script to decide whether to ingest the page and which links to follow, so you can also implement a custom algorithm for deciding which pages to index.

NOTE: This feature is available only if your Web Connector license includes dynamic corpus functionality.

The script must contain a function named shouldIngestPage that returns true to ingest the page or false to ignore it. You can optionally return a list of links to override the links were extracted from the page by the connector. For example, if you want to ingest the page but not follow any of the links on the page, you can return true but specify an empty list.

The function should look like this:

function shouldIngestPage(url, contentType, contentFilename, textContentFilename, links, depth)
  -- do something to decide return value...

  -- to ingest the page and follow links extracted by the connector
  return true

  -- to ingest the page but not follow any links
  return true, {}

  -- to ignore the page
  return false 
end

The arguments supplied to the function are:

Argument Type Description
url string The page URL.
contentType string The MIME content type.
contentFilename string The path to the file that contains the page content.
textContentFilename string The path to the file that contains the text that was extracted from the page (or nil if text could not be extracted).
links list of strings The links that were extracted from the page.
depth integer The page depth (the number of links that were followed from the starting point in order to reach the page).

An example script, FilterPages_binarycat.lua, is included with the connector. This script decides whether to ingest a page by calling the IDOL Category component and running the action BinaryCatQuery.

To configure the connector to run your script, set the configuration parameter FilterPagesLuaScript.