Process XML Feeds

Web sites might have pages, such as RSS feeds, that contain XML rather than HTML. This section describes how to process XML pages with Web Connector.

Web Connector identifies XML pages by the MIME type contained in the Content-Type response header returned by the web server.

XML pages can include processing instructions that instruct a web browser to apply an XSL transformation to the page, and sometimes the XML is transformed into HTML. The processing instruction looks similar to this:

<?xml-stylesheet type="text/xsl" href="transform_to_html.xsl"?>

If the Content-Type header indicates that a page contains XML and an XSL transformation is provided, the connector applies the transformation and processes the page as if it were HTML. This means that the connector can:

When the Web Connector applies an XSL transformation to an XML document and logging is configured with LogLevel=Full, you will see the following messages in the synchronize log:

WKOOP:Applying XSL transfrom
WKOOP:XSL Transform applied to XML document

For RSS pages that do not provide their own XSL transformation, you can provide the path of an XSL transformation to use. In your fetch task, set the configuration parameter RSSXSLFilePath. The XSL transformation must convert the XML into HTML. A transform named RSSTransform.xsl is supplied with the connector, for transforming XML that complies with the RSS specification.

As a result, you can use the Web Connector to process an RSS feed. The Web Connector, like the RSS Connector, retrieves the content contained in the feed, such as page titles and summaries. In addition, the Web Connector can follow the links contained in the feed and ingest the content on the associated pages.

If the Content-Type response header indicates that the page contains XML but no XSL transformation is provided, the connector ingests the page as an XML document. In this case the connector does not follow links.

To process an RSS feed with Web Connector

  1. In the Web Connector task configuration, specify the URL of the feed using the Url parameter.
  2. (Optional) Set Depth=1. This ensures that the connector follows the links from the RSS feed, but does not follow any links that are extracted from the associated web pages.
  3. If the RSS feed does not provide an XSL transformation, supply one by setting the RSSXSLFilePath parameter. The transformation must convert the XML into HTML. A transform named RSSTransform.xsl is supplied with the connector, for transforming XML that complies with the RSS specification. For example:

    [FetchTasks]
    Number=1
    0=RssFeed
    			
    [RssFeed]
    Url=http://www.example-news-website.com/feed/rss.xml
    Depth=1
    RSSXSLFilePath=RSSTransform.xsl