How to load JSON data to Hive database from file

In this article, we are going to see the steps to load JSON data into Hive database table. Our previous article explained the importance of ‘LOAD DATA’ command and steps to load csv data into table. Visit our previous article to understand more about LOAD DATA command usage and syntax.

JSON SerDe?

SerDe is short form for Serializer/Deserializer. Hive uses the SerDe interface for IO. The interface handles both serialization and deserialization and also interpreting the results of serialization as individual fields for processing.

A SerDe allows Hive to read in data from a table, and write it back out to HDFS in any custom format. Anyone can write their own SerDe for their own data formats.

To use JSON SerDe, specify the fully qualified class name org.apache.hadoop.hive.serde2.JsonSerDe in create table query.

Behaviors

  1. Data is stored as plain text file in JSON format
  2. org.apache.hadoop.hive.serde2.JsonSerDe does not support JSON data that starts with square bracket ‘[‘
  3. JsonSerDe is based on the text SerDe and each newline is considered as a new record
  4. The JsonSerDe for JSON files is available in Hive 0.12 and later
  5. Starting in Hive 3.0.0, JsonSerDe is added to Hive Serde as “org.apache.hadoop.hive.serde2.JsonSerDe” (HIVE-19211)

Valid Data Format 1

{"world_rank": 1,"country": "China","population": 1388232694,"World": 0.185},
{"world_rank": 2,"country": "India","population": 1342512706,"World": 0.179},
{"world_rank": 3,"country": "U.S.","population": 326474013,"World": 0.043},
{"world_rank": 4,"country": "Indonesia","population": 263510146,"World": 0.035}

Valid Data Format 2

{"world_rank": 1,"country": "China","population": 1388232694,"World": 0.185}
{"world_rank": 2,"country": "India","population": 1342512706,"World": 0.179}
{"world_rank": 3,"country": "U.S.","population": 326474013,"World": 0.043}
{"world_rank": 4,"country": "Indonesia","population": 263510146,"World": 0.035}

Please note that there is no comma character at the end of each documents in valid data format 2 data.

Invalid Data Format 1

[
{"world_rank": 1,"country": "China","population": 1388232694,"World": 0.185},
{"world_rank": 2,"country": "India","population": 1342512706,"World": 0.179},
{"world_rank": 3,"country": "U.S.","population": 326474013,"World": 0.043},
{"world_rank": 4,"country": "Indonesia","population": 263510146,"World": 0.035}
]

JsonSerDe does not support JSON data that starts with square bracket ‘[‘.

Invalid Data Format 2

{
    "world_rank": "1",
    "country": "China",
    "population": "1388232694",
    "World": "0.185"
  },
  {
    "world_rank": "2",
    "country": "India",
    "population": "1342512706",
    "World": "0.179"
  },
  {
    "world_rank": "3",
    "country": "U.S.",
    "population": "326474013",
    "World": "0.043"
  },
  {
    "world_rank": "4",
    "country": "Indonesia",
    "population": "263510146",
    "World": "0.035"
  }

JsonSerDe is based on the text SerDe and each newline is considered as a new record.

Load JSON Data Command Examples – Valid Format Data

Data Format 1 – Flat JSON

Data

{"world_rank": 1,"country": "China","population": 1388232694,"World": 0.185},
{"world_rank": 2,"country": "India","population": 1342512706,"World": 0.179},
{"world_rank": 3,"country": "U.S.","population": 326474013,"World": 0.043},
{"world_rank": 4,"country": "Indonesia","population": 263510146,"World": 0.035}

Create Table Statement

CREATE TABLE world_population (
  world_rank INT,
  country STRING,
  population BIGINT,
  world DOUBLE
 )
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.JsonSerDe'
STORED AS TEXTFILE;

Load Command

load data local inpath "/home/hive/json" into table world_population;

/home/hive/json” is the data file path where the data is stored in local file system. Skip ‘local‘ keyword in load command if the data is available in hdfs location.

Data Format 2 – Complex JSON

Data

{"county":"Albany","category":"Animal","groups":{"taxonomic_group":"Amphibians","taxonomic_subgroup":"Salamanders"},"species_name":{"scientific_name":"Pseudotriton ruber","common_name":"Red Salamander"},"year_last_documented":"1990-1999","listing_status":{"ny_listing_status":"Game with no open season","federal_listing_status":"not listed"},"conservation_rank":{"state_conservation_rank":"S3S4","global_conservation_rank":"G5"},"distribution_status":"Recently Confirmed"},
{"county":"Albany","category":"Animal","groups":{"taxonomic_group":"Amphibians","taxonomic_subgroup":"Frogs and Toads"},"species_name":{"scientific_name":"Lithobates palustris","common_name":"Pickerel Frog"},"year_last_documented":"1990-1999","listing_status":{"ny_listing_status":"Game with open season","federal_listing_status":"not listed"},"conservation_rank":{"state_conservation_rank":"S5","global_conservation_rank":"G5"},"distribution_status":"Recently Confirmed"},
{"county":"Albany","category":"Animal","groups":{"taxonomic_group":"Amphibians","taxonomic_subgroup":"Frogs and Toads"},"species_name":{"scientific_name":"Lithobates clamitans","common_name":"Green Frog"},"year_last_documented":"1990-1999","listing_status":{"ny_listing_status":"Game with open season","federal_listing_status":"not listed"},"conservation_rank":{"state_conservation_rank":"S5","global_conservation_rank":"G5"},"distribution_status":"Recently Confirmed"},
{"county":"Albany","category":"Animal","groups":{"taxonomic_group":"Amphibians","taxonomic_subgroup":"Frogs and Toads"},"species_name":{"scientific_name":"Lithobates catesbeianus","common_name":"Bullfrog"},"year_last_documented":"1990-1999","listing_status":{"ny_listing_status":"Game with open season","federal_listing_status":"not listed"},"conservation_rank":{"state_conservation_rank":"S5","global_conservation_rank":"G5"},"distribution_status":"Recently Confirmed"}

Create Table Statement

CREATE TABLE biodiversity (
county string,
category string,
groups struct<taxonomic_group: string,
    taxonomic_subgroup: string>,
species_name struct<scientific_name: string,
    common_name: string>,
year_last_documented string,
listing_status struct<ny_listing_status: string,
    federal_listing_status: string>,
conservation_rank struct<state_conservation_rank: string,
    global_conservation_rank: string>,
  distribution_status string
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.JsonSerDe'
STORED AS TEXTFILE;

STRUCT allows us to create complex data structures.

  • A complex data type, representing multiple fields of a single item.
  • Each field inside struct can be a different type
  • A field within a STRUCT can also be another STRUCT, or an ARRAY or a MAP
  • Maximum nesting depth is 100
  • Individual field is accessed by using dot notation, such as struct_column_name.field_name

Load Command

load data local inpath "/home/hive/json" into table biodiversity;

SELECT examples – For complex JSON

> select groups from biodiversity limit 2;
 +----------------------------------------------------+
 |                       groups                       |
 +----------------------------------------------------+
 | {"taxonomic_group":"Amphibians","taxonomic_subgroup":"Salamanders"} |
 | {"taxonomic_group":"Amphibians","taxonomic_subgroup":"Frogs and Toads"} |
 +----------------------------------------------------+
> select groups.taxonomic_group,groups.taxonomic_subgroup  from biodiversity limit 2;
 +------------------+---------------------+
 | taxonomic_group  | taxonomic_subgroup  |
 +------------------+---------------------+
 | Amphibians       | Salamanders         |
 | Amphibians       | Frogs and Toads     |
 +------------------+---------------------+

Data Format 3 – Complex JSON

Data

{"xaxis": 244, "yaxis": 255, "labels": ["p1", "p2"]}
{"xaxis": 256, "yaxis": 266, "labels": ["p1", "p2", "p3"]}

Create Table Statement

CREATE TABLE drawing
(
 xaxis INT,
 yaxis INT,
 labels ARRAY<STRING>
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.JsonSerDe' 
STORED AS TEXTFILE;

ARRAY – A complex data type that can represent an arbitrary number of ordered elements. The elements can be scalars such as list of integers, list of string or another complex type such as STRUCT, MAP.

Load Command

load data local inpath "/home/hive/json" into table drawing;

SELECT query examples

> select * from drawing;
--------+----------+--------------------+
| drawing.x| drawing.y| drawing.labels  |
+----------+----------+-----------------+
| 244      | 255      | ["p1","p2"]     |
| 256      | 266      | ["p1","p2","p3"]|
+----------+----------+-----------------+

Exceptions with Invalid Format Data

Invalid Format 1 – JSON not in expected format

Data

[
{"world_rank": "1","country": "China","population": "1388232694","World": "0.185"},
{"world_rank": "2","country": "India","population": "1342512706","World": "0.179"}
]

Exception

Error: java.io.IOException: org.apache.hadoop.hive.serde2.SerDeException: java.io.IOException: Start token not found where expected (state=,code=0)

Data Json starts with ‘[‘ and hive JSON SerDe does not support this format. Load command will not report any errors while running it. The exception noted above occurs only when SELECT query is executed on table after loading data.

Invalid Format 2 – Data type mismatch

Create Table Statement

CREATE TABLE world_population (
   world_rank INT,
   country STRING,
   population BIGINT,
   world DOUBLE
  )
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.JsonSerDe'
STORED AS TEXTFILE;

Data

{"world_rank": "1","country": "China","population": "1388232694","World": "0.185"},
{"world_rank": "2","country": "India","population": "1342512706","World": "0.179"},
{"world_rank": "3","country": "U.S.","population": "326474013","World": "0.043"},
{"world_rank": "4","country": "Indonesia","population": "263510146","World": "0.035"}

Note that all the values are string in actual data. But in create table statement, different data types are mentioned. This will result in ‘JsonParseException‘ while running SELECT query after loading the data.

Exception

Error: java.io.IOException: org.apache.hadoop.hive.serde2.SerDeException: org.codehaus.jackson.JsonParseException: Current token (VALUE_STRING) not numeric, can not use numeric value accessors
  at Source: java.io.ByteArrayInputStream@79f267c2; line: 1, column: 17

Hope this article is informative. Please let us know your thoughts in comments section. Happy reading..!

Leave a Reply

avatar
  Subscribe  
Notify of