Convert XML to CSV data using NiFi

In this article, we are going to see how to read data from XML document and convert it into a CSV file using NiFi processors.

Data

This, Census population by Zip Code, data comes from the 2010 Census Profile of General Population and Housing Characteristics and in XML format.

<response>
   <row>
      <row _id="row-rxi4~97qp_yzif" _uuid="00000000-0000-0000-59F8-45EF46B4850A" _position="0" _address="https://data.lacity.org/resource/nxs9-385f/row-rxi4~97qp_yzif">
         <zip_code>91371</zip_code>
         <total_population>1</total_population>
         <total_households>1</total_households>
         <average_household_size>1.00</average_household_size>
         <median_age>73.5</median_age>
         <total_males>0</total_males>
         <total_females>1</total_females>
      </row>
      <row _id="row-pmdj_3j5i~upbu" _uuid="00000000-0000-0000-39F4-389B31B6944F" _position="0" _address="https://data.lacity.org/resource/nxs9-385f/row-pmdj_3j5i~upbu">
         <zip_code>90001</zip_code>
         <total_population>57110</total_population>
         <median_age>26.6</median_age>
         <total_males>28468</total_males>
         <total_females>28642</total_females>
         <total_households>12971</total_households>
         <average_household_size>4.40</average_household_size>
      </row>
....
....
</response>

Problem Statement

Requirement is to read the XML file and convert the data into CSV format. This is an example and in real time, we will be dealing with continuous stream of xml data.

Assumptions

Data files are present in local machine itself and NiFi has access to the files i.e. NiFi can read the contents of the file.

NiFi Processors

The flow required to solve the problem can be constructed using below processors.

  • GetFile
  • SplitXml
  • MergeRecord
  • UpdateAttribute
  • PutFile

Couple of things to note here.

1. SplitXml processor splits an xml file into multiple flow files and MergeRecord processor is used to read and combine all the flow files, using CSVRecordSetWriter controller service, into one CSV file.

2. UpdateAttribute processor is used to change the file name extension to .csv from .xml; Otherwise source xml filename is used as is to store generated csv file.

GetFile

Read the data file from specified location in the local machine, converts it into flow file and give to downstream processor. Requires source data directory path. All files in the given directory will be pulled hence file name is not required. Minimum permission required is ‘read’ otherwise NiFi will ignore files.

SplitXml

Splits an XML File into multiple separate FlowFiles, each comprising a child or descendant of the original root element. This processor has only one required property i.e. split depth to indicate the XML-nesting depth to start splitting XML fragments. A depth of 1 means split the root’s children, whereas a depth of 2 means split the root’s children’s children and so forth. In case of our example data, split depth is 2. Because we want all children of ‘row’ tag.

SplitXml processor
SplitXml processor

MergeRecord

This processor merges together multiple flow file into a single flow file. We want our output to be stored in single file instead of multiple files. Without this processor, each ‘row’ document in xml will be converted into CSV and stored in separate/dedicated file.

This processor requires two controller services; One controller service to read incoming data i.e. XML file and another controller service to write out the input data as CSV.

MergeRecord processor
MergeRecord processor

UpdateAttribute

This processor is used to update the filename attribute. Without this processor, result flow file is written as CSV file but with source file name. Using this processor, one can capture the results in file name of their choice.

UpdateAttribute processor
UpdateAttribute processor

PutFile

Writes the contents of a FlowFile to the local file system; Minimum requirement is to set valid directory path

NiFi Controller Services

MergeRecord processor requires below controller services to read XML data and convert into CSV data.

  • XMLReader
  • CSVRecordSetWriter

XMLReader

Controller service used to read XML content and create records. Our example data is simple one and does not have complex data. To know more about creating XMLReader for complex data, please visit here. Here schema text is used to define the format of input data and is given below in ‘Data Schema’ section.

XMLReader controller service
XMLReader controller service

CSVRecordSetWriter

This controller service is used to write the contents of a RecordSet as CSV data. Here schema text is used to define the format of output data and is given below in ‘Data Schema’ section.

CSVRecordSetWriter controller service
CSVRecordSetWriter controller service

One important attribute to note in this controller service is ‘Include Header Line‘; Setting this to ‘true‘ will include header line in resulting csv file. Header column value is derived from schema text.

Data Schema

Schema is used to interpret the incoming and outgoing data.

{
"type": "record",
"namespace": "forkedblog.census",
"name": "forkedblog_census",
"fields": [
{"name": "zip_code", "type":"int"},
{"name": "total_population", "type":"long"},
{"name": "median_age", "type":"float"},
{"name": "total_males", "type":"long"},
{"name": "total_females", "type":"long"},
{"name": "total_households", "type":"long"},
{"name": "average_household_size", "type":"double"}
]
}

Pre-Requisites

There is no pre-requisites except source data and source/target data path.

NiFi Flow

NiFi flow to convert XML data to CSV
NiFi flow to convert XML data to CSV

What happens when xml data is not as per schema?

There are two possible scenarios.

  1. Xml has extra attributes that are not defined in schema – In this case, extra attribute will be omitted.
  2. Xml has missing attributes but defined in schema – In this case, resulting csv will have empty value for those attribute values.

Hope this is useful and you enjoyed reading. Please let us know your thoughts/queries/feedback in comments. Thank you.

Leave a Reply

avatar
  Subscribe  
Notify of