How to Read and Write Multiple Files with Reader Steps

Reported for version 11

DQC lets you read several files with the same metadata and process them serially in one plan execution. Conversely, it is possible to write the data flow to multiple files—partition them by an input column.

This functionality is useful when you have regular data dumps that need to be processed in batch.

How It Works

Reading and writing multiple files works by masking their names with column variables.

When reading files:

  • the mask is used to select the files to read
  • columns used as variables are populated with the values taken from file names and serve as identifiers of the origin of the data

When writing files:

  • the mask specifies columns whose values are used to partition the data flow 
  • values from the columns used as variables are used to construct output file names

Columns variables are written in curly brackets {}.

What Can be Masked

You can mask the names of folders and files, so it is possible to process files in a deep structure of directories. Likewise, it is possible to create directory structures constructed from the values of columns used as variables when writing data.

Columns used as variables must exist: either use the input columns or add them as shadow columns.

The reader steps support column variables of all data types, however, the best practice is to define the variables as STRING.

Understanding Masks and Column Variables

Let's look at a few sets of files and the ways we can mask them for reading:

File SetMaskWhat Will Happen

file_1.csv

file_2.csv

file_3.csv

file_{number}.csvThe number column will be populated with 1, 2, and 3 depending on which file the data is coming from.

2016-12-31.csv

2017-12-01.csv

2017-12-02.csv

2017-12-03.csv

{proc_year}-{proc_month}-{proc_day}.csvColumns will be populated with values on the source file: proc_year (2016, 2017), proc_month (12), and proc_day (31, 01, 02, 03).
2017-{proc_month}-{proc_day}.csv2016-12-31.csv will not be processed.
  • 11
    • 01.csv
    • 02.csv
    • 03.csv
  • 12
    • 01.csv
    • 02.csv
    • 03.csv
{proc_month}/{proc_day}.csv

In this case, the processed files are stored in different folders per month.

This example demonstrates how to mask filesystem paths.

In the case of writing into multiple files, the folder structures and files are created according to the values of columns used as variables.

Supported Steps

  • Text File Reader
  • Fixed Width File Reader
  • Text File Writer

Reading Multiple Files

Below is an example of a setup and configuration for reading and processing multiple files from the same folder.

Reading and processing several files from the same folder requires that the files have the same metadata (identical column names and data types).

Configuration Example

Input Data

Assume the following data coming from three files (the file name is displayed in the virtual [origin] column):

src_namesrc_gendersrc_birth_date[origin]
John SmithM12/16/19782017-12-01.csv
Jane SmithF16.12.19782017-12-02.csv
Mark SpencerM7812162017-12-03.csv
Edgar PoeM11/16/19782017-12-03.csv
William ShakespeareM16.11.19782017-12-03.csv

Reader Step Configuration

Assume a Text File Reader configured as below. Three shadow columns have been added, which will be populated with the values based on file names.

Output Data

After processing, the following data flow will form:

src_namesrc_gendersrc_birth_dateproc_yearproc_monthproc_day[origin]
John SmithM12/16/1978201712012017-12-01.csv
Jane SmithF16.12.1978201712022017-12-02.csv
Mark SpencerM781216201712032017-12-03.csv
Edgar PoeM11/16/1978201712032017-12-03.csv
William ShakespeareM16.11.1978201712032017-12-03.csv

If you need to write the data in the same multi-file format, you can use the same columns to partition the data stream into multiple files when writing data.

Writing Data into Several Files

Similarly to reading multiple files, it is possible to write a data flow to several files, partitioning the flow by several columns. This might be useful when it is necessary to separate the data flow based on an attribute like year, customer type, or any other category. Another scenario is if your input consisted of multiple files, you can easily process the data and write it to an identical set of files.

Configuration Example

Assuming the output data from the previous example, to process the data and write it to an identical set of files, we configure Text File Writer to use the three column variables in output file names.

It is also possible to use column variables to mask folder names both when reading and writing data.

Reading and Writing Multiple Files on a Cluster

Reading multiple files into a single data flow and writing into multiple files in a cluster context differs from the local run by the following:

  • No support of column variables.

    Currently, Text File Reader is the only step supporting column variables in a cluster context. The step can use column variables only if the input files are located on Amazon S3 server and the plan is processed via Spark.

  • The ability to read the whole folder by masking file names with "*". For example, /data/in/* reads files in the in folder, /data/*/* reads the content of the data folder and its subfolders.
  • When writing, the data flow is divided into several files based on Hadoop technical criteria and the user cannot modify it.