Choose the Content to Index
When you configure a fetch task to retrieve information from the Web, you can exclude pages from being downloaded and crawled, and exclude pages from being ingested.
- The configuration parameters
SpiderUrlMustHaveRegex
andSpiderUrlCantHaveRegex
exclude pages from being downloaded and crawled for links. - The configuration parameters
UrlMustHaveRegex
andUrlCantHaveRegex
exclude pages from being ingested.
There is an important difference between these two pairs of parameters. Pages can be crawled (the links on the page are followed by the connector) but not ingested.
For example, consider the following site structure:
index.html |- products/software.html | |- products/software/product1.html | |- products/software/product2.html | |- products/software/product3.html | |- products/hardware.html |- products/hardware/hardware1.html |- products/hardware/hardware2.html |- products/hardware/hardware3.html
If you set SpiderUrlCantHaveRegex=.*software\.html
, the connector does not download or crawl the page products/software.html
, so the links to the pages product1.html
, product2.html
, and product3.html
are not followed. The pages highlighted below are therefore not ingested:
index.html |- products/software.html | |- products/software/product1.html | |- products/software/product2.html | |- products/software/product3.html | |- products/hardware.html |- products/hardware/hardware1.html |- products/hardware/hardware2.html |- products/hardware/hardware3.html
TIP: Site structures are usually more complex than the example shown here. If the page hardware1.html
contained a link to product1.html
, product1.html
would still be crawled and ingested.
Alternatively, if you set UrlCantHaveRegex=.*software\.html
, the connector does not ingest the page products/software.html
, but the page is still crawled for links. The pages product1.html
, product2.html
and product3.html
do not match the regular expression and so they are still ingested. Only the single page highlighted below is excluded from being ingested:
index.html
|- products/software.html
| |- products/software/product1.html
| |- products/software/product2.html
| |- products/software/product3.html
|
|- products/hardware.html
|- products/hardware/hardware1.html
|- products/hardware/hardware2.html
|- products/hardware/hardware3.html