Manage datasets

In Fusion, a dataset defines the location of data on a source, the rules and schedule for processing, and the grammar rules to identify within the data during processing.

Once you have created at least one source, you can create as many datasets as necessary for that source. For example, you want to process data from only specific directories on a given file system. You would create a source for the file system and then create a dataset for each of the directories on that file system that contain data you want to process. This lets you focus the processing on only the desired data, omitting known irrelevant data.

On the Manage Datasets page, you can filter the list of datasets by dataset TYPE and choose whether to VIEW the list sorted by sources. The analysis information specifies the processing conducted for the dataset. The data count represents the number of emails or documents for unstructured data sources or the number of tables for structured data sources. When the count represents documents, the document count reflects the number of parent documents processed (extracted attachments are not included in the count). The size of each dataset represents the size on disk.

From the Manage Datasets page, you can view additional information about each dataset.

  • View the processing options (analysis) initially selected for the dataset.

  • Click anywhere in the row for the desired dataset and then click the open detail panel icon (open detail panel icon) to display dataset details. The detail panel includes the options defined for the dataset as well as key information about the documents in the dataset.

    From the detail panel, you can edit, scan, activate/deactivate, and delete the dataset. For datasets with data, you can go to the data volume chart (data volume chart detail panel icon), focused on the selected dataset. For file system and SharePoint datasets containing data, you can go to the sensitive data heat map (sensitive data heat map detail panel icon) in Analyze, focused on the dataset.

    • Click the Change link next to the schedule information to open the Edit dataset dialog to the Schedule information.

    • If the grammar sets have been updated since the dataset was created, a warning icon () and a Re-analyze link display next to the Grammar Sets information. The grammar set applied to the dataset that has been updated (is "out of sync") displays in red text.

      IMPORTANT: Re-analyzing the dataset to update the grammar sets is optional. Changes to the grammar sets for this dataset are not automatically updated for the dataset. Re-analyzing a dataset for which the content has not been stored requires reprocessing the full dataset and may take additional time and impact server load.

    • On the METRICS tab, view metrics related to the selected dataset.

    • On the GRAMMARS tab, view the grammar types and grammar rules defined for the dataset.

    • On the ACTIVITY tab, view the details of the last 10 activities performed. If more than 10 activities have been performed, click the MORE link to see the full list for the dataset on the Agent Activity page.

    • On the SECURITY tab, view the security policy and associated users and groups that have been given specific access to the dataset.

Datasets are not automatically scanned when the dataset is created. If you defined a schedule, the dataset scan will run according to that schedule. If a schedule was not defined for the dataset, you must initiate a scan of the dataset to process the data. Wait for the dataset initialization to complete before initiating a scan of a newly created dataset.

If you need to process a dataset outside of the scheduled run time, you can manually start a scan of the dataset. If you request to scan a dataset and the dataset is currently processing, the scan request is not acted upon.

When editing a dataset, keep the following in mind.

  • If items exist in this dataset, some options may be dimmed and cannot be edited.

  • If items exist in this dataset, you cannot change the agent cluster type from a cloud-based cluster to an on-premises cluster, or from an on-premises cluster to a cloud-based cluster.

  • If you change any of the credentials, you are required to re-enter the password for the defined user.

  • You can load the schedule, attribute, and grammar options from a template, even if the dataset is based on a different template. This action overrides any existing schedule, attribute, or grammar options and can be refined further as needed.

  • Changes to existing grammar selections trigger a reprocessing of the dataset.

    If you have datasets that include grammars applied before the introduction of grammar sets, you will see the originally applied grammars in the list of selected grammars. If you select grammar sets, the originally applied grammars are overridden.

CAUTION: Do not change the dataset location—File System directory, SharePoint site URL, or Content Manager dataset—unless the location has actually changed. The physical location (or Exchange group name) must have changed (or be changed) prior to updating the location in this dataset.

Do not change an Exchange group name unless the name of the group has changed. The group name in Exchange must be changed prior to updating the group name in this dataset. Changes to group names may affect tracking of delete activities.

If you want to create a dataset with the same definitions of the dataset you are editing, create a new dataset.

You can deactivate and then activate datasets as needed. A deactivated dataset cannot be processed. If the dataset was already processing data, no additional data is processed once the dataset is deactivated. Deactivated datasets cannot be edited either. Deactivated datasets display a gray icon next to the dataset name.

You can remove the connection to a dataset ("delete" the dataset) if there are no workspaces associated with the dataset and no documents associated with the dataset are on hold. If the dataset has associated documents, you can deactivate the dataset but you cannot delete it.