How to Configure the XML Reader Step

Reported for version 10

An XML Reader step reads data from an XML file, extracts user-defined information into columns, and can output to one or more endpoints.

  1. From the XML reader properties, point the File Name path to your XML file.
  2. In the XML Reader properties, XPath expression(s) can be added under ”Data Streams”. Using the example nm:extract/nm:extractBody/nm:client and its name "client" in the example, denotes that the step will process each client record and send the data to a stream named “client”.

    Namespaces

    • Though not absolutely necessary, using Namespaces we have shortened the writing of the parent domain "http://www.ataccama.com" (as in our example) to nm: in XML expressions. To configure a namespace from the XML reader properties, click Namespaces on the left-hand pane. You can find the proper namespace in the beginning of your XML file in the xmlns attribute. It is <extract xmlns="http://www.ataccama.com"> in our example.
  3. To configure what is extracted from each client, such as the address, click to add a new data stream and populate the out field. Using ”address” and the XPath Expression as used in our example, nm:basic_data/nm:address will return basic_data/address from the client entity.
  4. The last step is to define output columns and their order for each expression. The column is defined by its name and the XPath expression where the data is stored, such as name = “src_primary_key”, type = “STRING”, Xpath = “../../@primary_key”. This can be done by navigating two levels higher in the file structure to get to <client> and get the value of an attribute “primary key” from the client element.

    Xpath tips

    • DQC supports XPath expressions such as @attribute_name to extract the value of an attribute, or [_] to extract data only from an element with specific value. You can also use the [element=value] construct.
      Using nm:enum_table[nm:table_name="PROVINCE"]/nm:rows/ as in the second step of the example, will process only the table, which has this scheme: <enum_table> <table_name> PROVINCE </table_name> </enum_table>
    • Using position() will enable you to select a single value in a structure with multiple values. In an example structure of <v>A</v> <v>B</v> <v>C</v>, the expression nm:v[position()=2] will return "B".

    Note: columns must be defined for all streams, including the address and the client. Additionally, the root "XPath" in data streams under "Input File Name definition" does not support Xpath expressions like those above. When specifying additional parameters, ensure that it's done on lower level streams.


When you want to get a parent value in XML Reader step using XPath expression, you need to be aware of the following matter. Once the XPath of a data stream is defined, it selects all the data from the current level of processing. When there is only one Data Stream of the lowest level of processing, there is no data in the above levels. Therefore it is not possible to select a value of the parent when there is no data to choose from. It is always needed to create a separate data stream for every level you want to read the data from.

In cases, when you need to get a parent value from XPath expression, a possibility is to add another Data Stream in the XML Reader step where you can select the wished attribute from. Once this is done, you can use the parent XPath reference on the level below. However, doing so creates another output stream of the step. You need to clarify the location it points to (e.g. trash).

Values from different levels of XML file cannot be combined unless the selected data from a particular level are in a separate data stream.

To see a demonstration of a case when data from the above level are chosen, see the following sample configuration of XML Reader step:

Getting a parent value in XPath expression...
<!-- (Xml Reader) --><step id="Xml Reader" className="com.ataccama.dqc.tasks.io.xml.reader.XmlReader" disabled="false" mode="NORMAL">
  <properties fileName="[your_xml_file_name]">
    <dataFormatParameters ... />
    <namespaces>
      <prefixNamespacePair prefix="[namespace_one]" namespace="[url_to_namespace_one]"/>
      <prefixNamespacePair prefix="[namespace_two]" namespace="[url_to_namespace_two]"/>
      </namespaces>
      <recordsOutputs>
        <recordsOutput rowsRootXPath="[namespace_one]:[element_one]/[namespace_two]:[element_two]/[namespace_two]:[element_two]" out="[element_two_parent]">
          <attributes>
            <attribute name="[attribute_one]" type="STRING" xmlValue="false" xpath="[@attribute_one]"/>
          </attributes>
          <recordsOutputs>
            <recordsOutput rowsRootXPath="[namespace_two]:[element_two]" out="[element_two]">
              <attributes>
                <attribute name="[attribute_two]" type="STRING" xmlValue="false" xpath="[@attribute_two]"/>
                <attribute name="[attribute_three]" type="FLOAT" xmlValue="false" xpath="[@attribute_three]"/>
                <attribute name="[attribute_one]" type="STRING" xmlValue="false" xpath="../[@attribute_one]"/>
              </attributes>
            </recordsOutput>
          </recordsOutputs>
          <shadowColumns/>
        </recordsOutput>
      </recordsOutputs>
    </properties>
    <visual-constraints bounds="192,120,48,48" layout="vertical"/>
	</step>
  <connection className="com.ataccama.dqc.model.elements.connections.StandardFlowConnection" disabled="false">
		<source step="Xml Reader" endpoint="[element_two_parent]"/>
		<target step="Trash" endpoint="in"/>
		<visual-constraints>
			<bendpoints/>
		</visual-constraints>
	</connection>
	<connection className="com.ataccama.dqc.model.elements.connections.StandardFlowConnection" disabled="false">
		<source step="Xml Reader" endpoint="[element_two]"/>
		<target step="out" endpoint="in"/>
		<visual-constraints>
			<bendpoints/>
		</visual-constraints>
	</connection>

Configure XML properties

Specify Data Streams

After defining the path to the input XML file, it is necessary to define at least one Data Stream. A Data Stream is a set of user-defined attributes populated with data from the XML file's nodes and their attributes. Each Data Stream is bound to an XPath expression that limits the selection of values and attributes in this Data Stream to the specified node and its descendants.


Data Stream Definition

Specify Data in each column

Sample XML:

Data to be extracted...
<?xml version="1.0" encoding="UTF-8"?>
<extract>
    <extractHeader>
        <type_of_records>PARTY</type_of_records>
        <number_of_records>3</number_of_records>
        <date>2012-12-12</date>
    </extractHeader>
    <extractBody>
        <client primary_key="1">
            <basic_data>
                <customer_type>P</customer_type>
                <personal_data>
                    <sin>000000000</sin>
                    <name>Dr. John Smith</name>
                    <gender>M</gender>
                    <birth_date>12/16/1978</birth_date>
                    <card>88682239496</card>
                </personal_data>
                <address>
                    <street>8500 Leslie</street>
                    <city>Toronto</city>
                    <province>Ontario</province>
                    <zip>L3T7M8</zip>
                </address>
                <contact>
                    <cont_type>email</cont_type>
                    <cont_value>john.smith@.com</cont_value>
                </contact>
            </basic_data>
            <specific_data>
                <meta_last_updated>2006/12/06</meta_last_updated>
            </specific_data>
        </client>
        <client primary_key="2">
            ...
        </client>
        <client primary_key="3">
            ...
        </client>                
    </extractBody>
</extract>

Tutorials and further information regarding XPath expressions can be found at http://www.w3schools.com/xml/xpath_nodes.asp