Log On to a Web Site

Some Web sites require that you log on before you can access content. To retrieve content from these Web sites, configure the connector to log on to the site.

Basic, HTTP Digest, and NTLMv2 Authentication

To log on to a Web site that uses Basic, HTTP Digest, or NTLM version 2 authentication, specify a user name and password by setting the configuration parameters AuthUser and AuthPassword. For example:

[MyTask]
Url=http://www.example.com/
AuthUser=user
AuthPassword=pass

Alternatively, you can set the AuthSection parameter and set the AuthUser and AuthPassword parameters in a different section of the configuration file. This means you can use the same user name and password for several tasks:

[MyTask]
Url=http://www.example.com/
AuthSection=MyAuthSection

[AnotherTask]
Url=http://www.another-example.com/
AuthSection=MyAuthSection
			
[MyAuthSection]
AuthUser=user
AuthPassword=pass

If you create a fetch task to crawl more than one site, or you need to specify more than one set of credentials for a site, you can use multiple sections containing authentication details. The AuthSection parameter accepts multiple values. Set the AuthUrlRegex parameter in each section to specify the URLs that the authentication details can be used against. For example:

[MyTask]
Url=http://www.example.com/
AuthSection0=LogOnAuthExample 
AuthSection1=LogOnAuthExampleSubDomain 

[LogOnAuthExample] 
AuthUrlRegex=.*www\.example\.com/.* 
AuthUser=MyUsername 
AuthPassword=MyP4ssw0rd

[LogOnAuthExampleSubDomain]
AuthUrlRegex=.*subsite\.example\.com/.* 
AuthUser=MySubsiteUsername 
AuthPassword=MySubsiteP4ssw0rd

Submit a Form

If the Web site does not use Basic, HTTP Digest or NTLMv2 authentication, the connector might be able to log on by submitting a form.

Configure the connector to submit a form by setting the following configuration parameters:

FormUrlRegex

A regular expression to identify the page that contains the log-on form. The connector does not attempt to submit form data unless the URL of a page matches the regular expression.

InputSelector A list of CSS selectors to identify the form fields to populate. Specify the selectors in a comma-separated list or by using numbered parameters.
InputValue The values to use for the form fields specified by the InputSelector parameter. Specify the values in a comma-separated list or by using numbered parameters.
SubmitSelector A CSS selector that identifies the form element to use to submit the form.
ValidateFormData A Boolean value that specifies whether the connector attempts to validate the data supplied to complete a form. The connector can validate the data based on the types of the input elements.

You can set these parameters in the [TaskName] section of the configuration file, for example:

[MyTask]
Url=http://www.example.com/
FormUrlRegex=.*login\.php
InputSelector0=input[name=username]
InputSelector1=input[name=password]
InputValue0=MyUsername
InputValue1=MyP4ssw0rd!
SubmitSelector=input[name=login]

To specify the information in a separate section of the configuration file, set the FormsSection parameter:

[MyTask]
Url=http://www.example.com/
FormsSection=LogOnForm

[LogOnForm]
FormUrlRegex=.*login\.php
InputSelector0=input[name=username]
InputSelector1=input[name=password]
InputValue0=MyUsername
InputValue1=MyP4ssw0rd!
SubmitSelector=input[name=login]

To submit different forms during a single task, you can create multiple sections containing form settings. The FormsSection parameter accepts multiple values. In each section, use the FormUrlRegex parameter to identify the page that contains the form:

[MyTask]
Url=http://www.example.com/
FormsSection0=LogOnFormExample 
FormsSection1=LogOnFormExampleSubDomain 

[LogOnFormExample] 
FormUrlRegex=.*www\.example\.com/.*login\.php 
InputSelector0=input[name=username]
InputSelector1=input[name=password]
InputValue0=MyUsername
InputValue1=MyP4ssw0rd!
SubmitSelector=input[name=login]

[LogOnFormExampleSubDomain] 
FormUrlRegex=.*subsite\.example\.com/.*login\.php 
InputSelector0=input[name=username]
InputSelector1=input[name=password]
InputValue0=MySubsiteUsername
InputValue1=MySubsiteP4ssw0rd!
SubmitSelector=input[name=login]