How to Find Duplicates via Representative Creator Step

Reported for version 10

How to find duplicate values in a column and replace them with randomly generated keys?

Representative Creator step can be used to achieve this task. The step computes a set of representative records from source data utilizing a process where records are classified by a specified rule and grouped by an appropriate key. 
The records are sorted for each group, and the best records are selected. The new representative record is collected from a specific data record or from aggregated values of a group of records.

To complete the task, a configuration of the Representative Creator step needs to be accomplished by following the guide below:

Step-by-step guide

  1. Define the Grouping strategy. It specifies how you want to group similar/duplicate records. For example, if you want to group by key, just put the name of the key column in GROUP BY table. Please note, you can use DQC expressions if you want to use more complex rules for grouping.
  2. Specify the Rules. In this section, you can specify what you want to do with the grouped records and the process of selecting record attributes (column values). To create a rule, double-click under the Attribute Sets column and then on the edit element. You will need to define the following information for every new rule:
    1. Representative attributes – Here you define what you want to have in the representative record. This section is very useful in the case you want to select best values from different records inside the group and create one record that represents the best record. What you want to see as the best is defined inside Selection Rules section that will be described below. Since only modification of duplicate records is the wanted approach and no merging or creating of a representative is necessary, you do not need to configure this section. Otherwise, you can specify what you want to see as representants by adding columns and choosing best values for related columns. You can automatically fill in default columns from source file with best values by clicking the Fill columns button.
    2. Instance attributes –  This section allows you to specify what you want to do with records that are inside the group, e.g. with all duplicates. Here you can modify the keys of the duplicate records. First, click on Fill columns button and choose all needed columns with this. prefix. The special word this refers to an instance record that is currently being processed. After you chose needed columns you need to set what you want to do with duplicate values. In order to generate new keys in the specified range, you need to double-click on the column that you want to modify and paste the following expression into the Expression field: 

      iif(this.[YOUR_COLUMN_NAME] = best.[YOUR_COLUMN_NAME]
      this.[YOUR_COLUMN_NAME],
      random([START_OF_THE_RANGE], [END_OF_THE_RANGE]))

      Here a logical operator iif and DQC function random() is used to generate random integers. Before using the expression, you need to specify your column name and range inside the brackets. More information about the function may be found in DQC Basics > Expressions section.

    3. Selection Rules – this section defines what you want to see as the best values in the representative record. It is possible to configure various patterns, acceptance rules and the order of records to define what is the best values wished to be achieved. When no representatives are created, this section is not needed.

  3. The last step is to choose the output of the step. Representative Creator step has two outputs: 

    1. Out contains all rows from the source file with modified records without representatives.

    2. Out_merged contains only representatives of grouped data, i.e. does not contain duplicated rows. As the representatives' appearance is not specified, the output will be empty.