Retrieve Information using a URL File

This section describes how to retrieve content from the Web by providing the connector with a list of URLs. When you provide a list of URLs (a URL file), the connector does not follow links to other pages.

Providing a list of URLs is usually impractical for a large site, but you might want to do this if you have an external process generating the URLs. You must create a text file that contains the URLs of the pages to ingest, with one URL on each line.

You can update the URL file without changing the connector's configuration. If a URL is removed from the file, the connector sends an ingest-delete for that page on the next synchronize cycle.

To create a new Fetch Task

  1. Stop the connector.
  2. Open the configuration file in a text editor.
  3. In the [FetchTasks] section of the configuration file, specify the number of fetch tasks using the Number parameter. If you are configuring the first fetch task, type Number=1. If one or more fetch tasks have already been configured, increase the value of the Number parameter by one (1). Below the Number parameter, specify the names of the fetch tasks, starting from zero (0). For example:

    [FetchTasks]
    Number=1
    0=MyTask
  4. Below the [FetchTasks] section, create a new TaskName section. The name of the section must match the name of the new fetch task. For example:

    [FetchTasks]
    Number=1
    0=MyTask
    
    [MyTask]
  1. In the new section, set the following parameters:

    SitemapFile The path to a plain text file that contains a list of URLs of pages to ingest. The file must contain one URL on each line.
    UrlCantHaveRegex (Optional) A Perl-compatible regular expression to restrict the content retrieved by the connector. If the full URL of a page matches the regular expression, the page is not ingested. You can set this parameter to filter the list of URLs.
    UrlMustHaveRegex (Optional) A Perl-compatible regular expression to restrict the content retrieved by the connector. The full URL of a page must match the regular expression, otherwise the page is not ingested. You can set this parameter to filter the list of URLs.

    For example:

    [MyTask]
    SitemapFile=my-list-of-urls.txt
    RemoveNoscripts=TRUE
    RemoveScripts=TRUE
    SynchronizeThreads=5 PageDelay=5s

    For a complete list of configuration parameters that you can use, refer to the Web Connector Reference.

  2. (Optional) If the connector is installed on a machine that is behind a proxy server, see Retrieve Information through a Proxy Server.

  3. Save and close the configuration file.