SitemapUrl
The URL of a sitemap that lists the pages to ingest.
If you set this parameter, only the pages on the sitemap are ingested. The connector does not crawl the site by following links. You can set further parameters, including UrlCantHaveRegex and UrlMustHaveRegex, to filter the pages contained in the sitemap.
TIP: Web Connector can retrieve information in one of the following ways:
- To start from a URL and follow links to other pages, set the parameter Url.
- To retrieve the pages contained in a sitemap, set the parameter SitemapUrl. A sitemap is an XML document, used by some web sites to present web crawlers with a list of pages to retrieve. Using a site map is often the best option, if there is one, because the connector retrieves the pages suggested by the site administrator. This can be easier than crawling the site and choosing the pages to ingest based on their URL or content.
- To retrieve a list of URLs that are specified in a text file, set the parameter SitemapFile. You must create the file, which is not practical for large sites, but you might want to use this option if you have an external process generating the URLs.
In each case the other parameters are ignored. SitemapUrl
has precedence, followed by SitemapFile
, followed by Url
.
You can also set this parameter to the URL of a sitemap index (a list of sitemaps). If you do so, you can choose which of the sitemaps to process by setting the configuration parameters SitemapIndexUrlCantHaveRegex and SitemapIndexUrlMustHaveRegex.
Type: | String |
Default: | |
Required: | You must set Url , SitemapUrl , or SitemapFile |
Configuration Section: | TaskName or FetchTasks or Default |
Example: | SitemapUrl=http://www.mywebsite.com/sitemap.xml
|
See Also: |