Crawl Settings: Link Sources

To access this feature, click the Edit menu and select Default Scan Settings or Current Scan Settings. Then, in the Crawl Settings category, select Link Sources.

What is Link Parsing?

The Fortify WebInspect crawler sends a request to a start URL and recursively parses links (URLs) from the response content. These links are added to a work queue and the crawler iterates through the queue until it is empty. The techniques used to extract the link information from the HTTP responses are collectively referred to as ‘link parsing.’ There are two choices for how the crawler performs link parsing: Pattern-based and DOM-based.

Pattern-based Parsing

Pattern-based link parsing uses a combination of text searching and pattern matching to find URLs. These URLs include the ordinary content that is rendered by a browser, such as <A> elements, as well as invisible text that may reveal additional site structure.

This option matches the default behavior of Fortify WebInspect 10.40 and earlier versions. This is a more aggressive approach to crawling the website and can increase the amount of time it takes to conduct a scan. The aggressive behavior can cause the crawler to create many extra links which are not representative of actual site content. For these situations, DOM-based parsing should expose the site’s URL content with fewer false positives.

Note: All of the DOM-based Parsing techniques for finding links are used when Pattern-based Parsing is selected. Pattern-based Parsing, however, is not capable of computing the metadata for the link source. DOM-based Parsing is capable of computing this information and thus provides more intelligent parsing. DOM-based Parsing also provides more control over which parsing techniques are used.

DOM-based Parsing

The Document Object Model (DOM) is a programming concept that provides a logical structure for defining and building HTML and XML documents, navigating their structure, and editing their elements and content.

A graphical representation of an HTML page rendered as DOM would resemble an upside-down tree: starting with the HTML node, then branching out in a tree structure to include the tags, sub-tags, and content. This structure is called a DOM tree.

Using DOM-based parsing, Fortify WebInspect parses HTML pages into a DOM tree and uses the detailed parsed structure to identify the sources of hyperlinks with higher fidelity and greater confidence. DOM-based parsing can reduce false positives and may also reduce the degree of ‘aggressive link discovery.’

On some sites, the crawler iteratively requests bad links and the resulting responses echo those links back in the response content, sometimes adding extra text that compounds the problem. These repeated cycles of ‘bad links in and bad links out’ can cause scans to run for a long time or, in rare cases, forever. DOM-based parsing and careful selection of link sources provide a mechanism for limiting this runaway scan behavior. Web applications vary in structure and content, and some experimentation may be required to get optimal link source configurations.

To refine DOM-based Parsing, select the techniques you want to use for finding links. Clearing techniques that may not be a concern for your site may decrease the amount of time it takes to complete the scan. For a more thorough scan, however, select all techniques or use Pattern-based Parsing. The DOM-based Parsing techniques are described in the following table. For more information, see Limitations of Link Source Settings.

Technique	Description
Include Comment Links (Aggressive)	Programmers may leave notes to themselves that include links inside HTML comments that are not visible on the site, but may be discovered by an attacker. Use this option to find links inside HTML comments. Fortify WebInspect will find more links, but these may not always be valid URLs, causing the crawler to try to access content that does not exist. Also, the same link can be on every page and those links can be relative, which can exponentially increase the URL count and lengthen the scan time.
Include Conditional Comment Links	A conditional comment link occurs when the HTML on the page is conditionally included or excluded depending on the user agent (browser type and version) making the request. Regular comment example: `<!—hidden.txt -->` Conditional comment example: `<!--[if lt IE9]> <script src="//www.somesite.com/static/v/all/js/html5sh.js"></script> <link rel="stylesheet" type"text/css" href='//www.somesite.com/static/v/fn-hp/css/IE8.css'> <![endif]-->` Fortify WebInspect emulates browser behaviors in evaluating HTML code and processes the DOM differently depending on the user agent. A link found in a comment by one user agent is a normal HTML link for other user agents. Use this option to find conditional links that are inside HTML commands, such as those commented out based on browser version. These conditional statements may also contain script includes that need to be executed when script parsing is enabled. Crawling these links will be more thorough, but can increase the scan time. Additionally, such comments may be out of date and pointless to crawl.
Include Plain Text Links	Plain text in a .txt file or a paragraph inside HTML code can be formatted as a URL, such as `http://www.something.com/mypage.html`. However, because this is only text and not a true link, the browser would not render it as a link, and the text would not be functionally part of the page. For example, the content may be part of a page that describes how to code in HTML using fake syntax that is not meant to be clicked by users. Use this option for Fortify WebInspect to parse these text links and queue them for a crawl. Also, using smart pattern matches, Fortify WebInspect can identify common file extensions, such as .css, .js, .bmp, .png, .jpg, .html, etc., and add these files to the crawl queue. Auditing these files that are referenced in plain text can produce false positives.
Include Links in Static Script blocks	Use this option for Fortify WebInspect to examine inside the opening and closing script tags for text that looks like links. Valid links may be found inside these script blocks, but developers may also leave comments that include text resembling links inside the opening and closing script tags. For example: `<script type="text/javascript"> // go to http://www.foo.com/blah.html for help var url = "http:www.foo.com/xyz/" + path + "?help" </script>` Additionally, JavaScript code inside these tags can be handled by the JavaScript execution engine during the scan. However, searching for static links in a line of code that sets a variable, such as the “var url” in the example above, can create problems when those partial paths are added to the queue for crawling. If the variable includes a relative link with a common extension, such as “foo.html”, the crawler will append the extension to the end of every page that includes the line of code. This can produces unusable URLs and may create false positives.
Parse URLs Embedded in URLs	Use this option for Fortify WebInspect to parse any text that is inside an href attribute and add it to the crawl queue. The following is an example of a URL embedded in a URL: `<a href="http://www.foo.com/xyz/bar.html?url=http%3A%2F%2Fwww.zzzz.com%2Fblah" />` On some sites, however, file not found pages return the URL in a form action tag and append the URL to the original URL as follows: `<form action="http://www.foo.com/xyz/bar.html?url=http%3A%2F%2Fwww.zzzz.com%2Fblah? http://www.foo.com/xyz/bar.html?url=http%3A%2F%2Fwww.zzzz.com%2Fblah" />` Fortify WebInspect will then request the form action, and receive another file not found response, again with the URL appended in a form action, as shown below: `<form action="http://www.foo.com/xyz/bar.html?url=http%3A%2F%2Fwww.zzzz.com%2Fblah? http://www.foo.com/xyz/bar.html?url=http%3A%2F%2Fwww.zzzz.com%2Fblah? http://www.foo.com/xyz/bar.html?url=http%3A%2F%2Fwww.zzzz.com%2Fblah? http://www.foo.com/xyz/bar.html?url=http%3A%2F%2Fwww.zzzz.com%2Fblah" />` On such a site, these URLs will continue to produce file not found responses that add more URLs to the crawl queue, creating an infinite crawl loop. To avoid adding this type of URL to the crawl queue, do not use this option.
Allow Un-rooted URLs (for the above items)	This option modifies the behavior of the previous five options. Some URLs do not include the specific scheme, such as http, and are not fully qualified domain names. These URLs, which may resemble `xyz.html`, are considered unanchored or “un-rooted.” The assumption is that the un-rooted URL is relative to the request. For example, the non-fully qualified URL `<a href='foo.html' />` does not include a scheme. This URL uses the scheme of the context URL. If an HTTPS page requested to get the content, then HTTPS would be prepended to the URL. Use this option to treat un-rooted URLs as links when parsing. If this option is selected, the scan will be more thorough and more aggressive, but may take considerably longer to complete. URL Samples and Parsing Results The following samples describe various URLs and how they are parsed during a crawl. A Normal URL The URL in the following request includes a forward (or anchor) slash. Request from `http://www.foo.com/x/y/z/` For `<a href='/bar.html' />` Results in a link to `http://www.foo.com/bar.html`. Simple Un-rooted URL The URL in the following request is un-rooted because it does not include a forward slash. Request from `http://www.foo.com/` For `<a href='bar.html' />` Results in a link to `http://www.foo.com/bar.html`. Long Un-rooted URL The following request shows a long, un-rooted URL. Request from `http://www.foo.com/x/y/z/` For `<a href='bar.html' />` Results in a link to `http://www.foo.com/x/y/z/bar.html`. Comments in Code You may include comments, such as `<!-- baz_ads.js -->`, in your code before a script include. The following request shows how this comment is interpreted during an aggressive crawl. Request from `http://www.foo.com/x/y/z/` For `<!-- baz_ads.js -->` Results in a link to `http://www.foo.com/x/y/z/baz_ads.js` If you include this comment on your master page, then during an aggressive scan, the comment will be discovered on many, if not all, page responses in the site. This configuration can cause runaway scans. The comment `<!-- baz_ads.js -->` on the master page results in multiple links: `http://www.foo.com/baz_ads.js` `http://www.foo.com/x/baz_ads.js` `http://www.foo.com/x/y/baz_ads.js` `http://www.foo.com/x/y/z/baz_ads.js` And so on for all pages in the site.

Form Actions, Script Includes, and Stylesheets

Some link types—such as form actions, script includes, and stylesheets—are special and are treated differently than other links. On some sites, it may not be necessary to crawl and parse these links. However, if you want an aggressive scan that attempts to crawl and parse everything, the following options will help accomplish this goal. For more information, see Limitations of Link Source Settings.

Note: You can also allow un-rooted URLs for each of these options. See “Allow Un-rooted URLs” in this topic.

Option	Description
Crawl Form Action Links	When Fortify WebInspect encounters HTML forms during the crawl, it creates variations on the inputs that a user can make and submits the forms as requests to solicit more site content. For example, for forms with a POST method, Fortify WebInspect can use a GET instead and possibly reveal information. In addition to this type of crawling, use this option for Fortify WebInspect to treat form targets as normal links.
Crawl Script Include Links	A script include imports JavaScript from a .js file and processes it on the current page. Use this option for Fortify WebInspect to crawl the .js file as a link.
Crawl Stylesheet Links	A stylesheet link imports the style definitions from a .css file and renders them on the current page. Use this option for Fortify WebInspect to crawl the .css file as a link.

Miscellaneous Options

The following additional options may help improve link parsing for your site. For more information, see Limitations of Link Source Settings.

Option Description

Option	Description
Crawl Links on FNF Pages	If you select this option, Fortify WebInspect will look for and crawl links on responses that are marked as “file not found.” This option is selected by default when the Scan Mode is set to Crawl Only or Crawl & Audit. The option is not available when the Scan Mode is set to Audit Only.
Suppress URLs with Repeated Path Segments	Many sites have text that resembles relative paths that become unusable URLs after Fortify WebInspect parses them and appends them to the URL being crawled. These occurrences can result in a runaway scan if paths are continuously appended, such as `/foo/bar/foo/bar/`. This setting helps reduce such occurrences and is enabled by default. With the setting enabled, the options are: 1 – Detect a single sub-folder repeated anywhere in the URL and reject the URL if there is a match. For example, `/foo/baz/bar/foo/` will match because “`/foo/`” is repeated. The repeat does not have to occur adjacently. 2 – Detect two (or more) pairs of adjacent sub-folders and reject the URL if there is a match. For example, `/foo/bar/baz/foo/bar/` will match because “`/foo/bar/`” is repeated. 3 – Detect two (or more) sets of three adjacent sub-folders and reject the URL if there is a match. 4 – Detect two (or more) sets of four adjacent sub-folders and reject the URL if there is a match. 5 – Detect two (or more) sets of five adjacent sub-folders and reject the URL if there is a match. If the setting is disabled, repeating sub-folders are not detected and no URLs are rejected due to matches.

Crawl Links on FNF Pages

If you select this option, Fortify WebInspect will look for and crawl links on responses that are marked as “file not found.”

This option is selected by default when the Scan Mode is set to Crawl Only or Crawl & Audit. The option is not available when the Scan Mode is set to Audit Only.

Suppress URLs with Repeated Path Segments

Many sites have text that resembles relative paths that become unusable URLs after Fortify WebInspect parses them and appends them to the URL being crawled. These occurrences can result in a runaway scan if paths are continuously appended, such as /foo/bar/foo/bar/. This setting helps reduce such occurrences and is enabled by default.

With the setting enabled, the options are:

1 – Detect a single sub-folder repeated anywhere in the URL and reject the URL if there is a match. For example, /foo/baz/bar/foo/ will match because “/foo/” is repeated. The repeat does not have to occur adjacently.

2 – Detect two (or more) pairs of adjacent sub-folders and reject the URL if there is a match. For example, /foo/bar/baz/foo/bar/ will match because “/foo/bar/” is repeated.

3 – Detect two (or more) sets of three adjacent sub-folders and reject the URL if there is a match.

4 – Detect two (or more) sets of four adjacent sub-folders and reject the URL if there is a match.

5 – Detect two (or more) sets of five adjacent sub-folders and reject the URL if there is a match.

If the setting is disabled, repeating sub-folders are not detected and no URLs are rejected due to matches.

Limitations of Link Source Settings

Clearing a link source check box prevents the crawler from processing that specific kind of link when it is found using static parsing. However, these links can be found in many other ways. For example, clearing the Crawl Stylesheet Links option does not control path truncation nor suppress .css file requests made by the script engine. Clearing this setting only prevents static link parsing of the .css response from the server. Similarly, clearing the Crawl Script Include Links option does not suppress .js, AJAX, frameIncludes, or any other file request made by the script engine. Therefore, clearing a link source check box is not a universal filter for that type of link source.

The goal for clearing a check box is to prevent potentially large volumes of bad links from cluttering the crawl and resulting in extremely long scan times.

See Also

Crawl Settings: Link Parsing

Crawl Settings: Session Exclusions