How to Load data to Hive database from CSV file

In this article we are going to see one of the helpful feature of Hive – Load data into database table from file. Assume that you are in a situation where you have 100GB of data file and have to move the data inside the file to a database table to run SQL queries that fetch required data.

Without ‘LOAD DATA’, only option to achieve this is to read the file and insert it into database table using any supported programming language. This way of manipulating data is time consuming – 1) It takes time to develop a working code that reads file and push to database 2) The data move operation itself takes time to move all the data from file to database.

With LOAD command, we can bulk load data into the table without writing any single line of code except that we need to write ‘LOAD’ SQL command. Load operations are pure copy/move operations that move datafiles into locations corresponding to Hive tables. Also ‘LOAD’ does not allow any data transformation while loading data into tables.

This LOAD functionality is not available in relational databases.

LOAD DATA Command Syntax

LOAD DATA INPATH "/path/to/file_or_directory" INTO TABLE TABLE_NAME;

LOAD DATA LOCAL INPATH "/path/to/file_or_directory" INTO TABLE TABLE_NAME;

LOAD DATA

This is the SQL command that loads the data from file and writes it into the table.

INPATH

INPATH "/path/to/file_or_directory"

Specifies the file path from where Hive will look up for data files.

filepath accepts,

  • a relative path in HDFS, such as data/source
  • an absolute path in HDFS, such as /user/hive/data/source
  • a full URI with scheme and (optionally) an authority, such as hdfs://namenode:9000/user/hive/data/source

filepath can,

  • refer to a file in HDFS in which case Hive will move the file into the table
  • refer to a directory in HDFS in which case Hive will move all the files within that directory into the table. In either case, file path addresses a set of files.

To learn about HDFS setup, read our article ‘Setup Hadoop 3.x 3 Node cluster’.

LOCAL

If the keyword LOCAL is specified, then the load command will look for filepath in the local file system. Above mentioned properties of filepath still applicable. Exception is that the user can specify a full URI for local files like file:///user/hive/data/source

Load command will try to copy all the files addressed by file path to the target filesystem. The target file system is inferred by looking at the location attribute of the table. Default value specified in hive configuration is used for ‘Location’ attribute, if not given while creating table. User can overwrite this location by mentioning ‘LOCATION’ keyword and path while creating table. The copied data files will then be moved to the table.

File Format

Hive LOAD DATA command uses serielizers and deserializers, called SerDes, in the background to read and load data. The Hive SerDe library is in org.apache.hadoop.hive.serde2

Built-in SerDes

  • Avro (Hive 0.9.1 and later)
  • ORC (Hive 0.11 and later)
  • RegEx
  • Thrift
  • Parquet (Hive 0.13 and later)
  • CSV (Hive 0.14 and later)
  • JsonSerDe (Hive 0.12 and later in hcatalog-core)

This article focuses on CSV SerDe. To use the SerDe, specify the fully qualified class name org.apache.hadoop.hive.serde2.OpenCSVSerde in create table query.

Dataset

I am using modified version of world population data in CSV format. Actual data, field separators, quote character and line separator will change based on the example we are working on.

world_rank,country,population,world,extra_data
1,China,1388232694,0.19,"3.0,1.2,4.0"
2,India,1342512706,0.18,"3.0,1.2,4.0"
3,U.S.,326474013,0.04,"3.0,1.2,4.0"

CSV SerDe class supports following SERDE properties.

  • separatorChar (Field separator, Default: Comma)
  • quoteChar (Quote character, Default: Double quotes)
  • escapeChar (Escape character, Default: Back slash)

LOAD DATA Command Used

load data local inpath "/home/hive/csv_default" into table world_population;

Below rules are common for all create table queries.

  • ROW FORMAT SERDE‘ is must to read data from csv
  • Order of ‘ROW FORMAT SERDE‘ should come before ‘STORED AS TEXTFILE

Example Use cases

Data Format 1

In the first example, we are using above data as it is.

  • Filed Separator: ,
  • Quote Character: “

These are all default values for CSV SerDe attributes.

world_rank,country,population,world,extra_data
1,China,1388232694,0.19,"3.0,1.2,4.0"
2,India,1342512706,0.18,"3.0,1.2,4.0"
3,U.S.,326474013,0.04,"3.0,1.2,4.0"
CREATE EXTERNAL TABLE world_population (
  world_rank INT,
  country STRING,
  population BIGINT,
  world DOUBLE,
  extra_data STRING
 )
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
STORED AS TEXTFILE
LOCATION "/home/hive/csv_target";

Points to note

  • LOCATION is HDFS file location, will be empty before running load command and loaded file will be present here
  • HDFS warehouse folder(default) will not contain data. Instead data is loaded in path specified using LOCATION.
  • Both /warehouse/tablespace/managed/hive/loadtest.db and /warehouse/tablespace/external/hive/loadtest.db – Folder will be empty.
  • /home/hive/csv_target – Data files will be present here
  • DROP table does not delete data in LOCATION folder

Data Format 2

In this example, we are using above data as it is. Difference is we are not specifying LOCATION while creating table.

  • Filed Separator: ,
  • Quote Character: “

These are all default values for CSV SerDe attributes.

world_rank,country,population,world,extra_data
1,China,1388232694,0.19,"3.0,1.2,4.0"
2,India,1342512706,0.18,"3.0,1.2,4.0"
3,U.S.,326474013,0.04,"3.0,1.2,4.0"
CREATE EXTERNAL TABLE world_population (
  world_rank INT,
  country STRING,
  population BIGINT,
  world DOUBLE,
  extra_data STRING
 )
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
STORED AS TEXTFILE;

Points to note:

  • LOCATION is not specified, hence data will be loaded in default HDFS location – /warehouse/tablespace/external/hive/loadtest.db
  • Default location will be empty before running load command and loaded file will be present here
  • HDFS warehouse folder will not contain data –
    • HDFS file path /warehouse/tablespace/managed/hive/loadtest.db will be empty.
    • HDFS file path /warehouse/tablespace/external/hive/loadtest.db will have data in world_population folder.
  • DROP table does not delete data in HDFS folder

List HDFS:

[hdfs@forkedblog]$ hdfs dfs -ls /warehouse/tablespace/external/hive/loadtest.db/world_population
Found 1 items
-rw-rw-rw-+  3 hive hadoop       5103 2018-06-17 15:32 /warehouse/tablespace/external/hive/loadtest.db/world_population/world_population.csv

After running insert query:

insert into world_population values (197,'test c1',900,0.01);
[hdfs@forkedblog]$ hdfs dfs -ls /warehouse/tablespace/managed/hive/loadtest.db/world_population
Found 2 items
drwxrwx---+  - hive hadoop          0 2018-06-17 15:27 /warehouse/tablespace/managed/hive/loadtest.db/world_population/world_population.csv
drwxrwx---+  - hive hadoop          0 2018-06-17 15:29 /warehouse/tablespace/managed/hive/loadtest.db/world_population/delta_0000001_0000001_0000

Data Format 3

In this example, we are using above data as it is. Differences – we are not specifying LOCATION while creating table and using managed table – Note that there is no EXTERNAL keyword.

  • Filed Separator: ,
  • Quote Character: “

These are all default values for CSV SerDe attributes.

world_rank,country,population,world,extra_data
1,China,1388232694,0.19,"3.0,1.2,4.0"
2,India,1342512706,0.18,"3.0,1.2,4.0"
3,U.S.,326474013,0.04,"3.0,1.2,4.0"
CREATE TABLE world_population (
  world_rank INT,
  country STRING,
  population BIGINT,
  world DOUBLE,
  extra_data STRING
 )
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES
(
    "separatorChar" = ",",
    "quoteChar"     = "\""
)
STORED AS TEXTFILE;

Points to note:

  • LOCATION is not specified, hence data will be loaded in default HDFS location – /warehouse/tablespace/external/hive/loadtest.db
  • Default location will be empty before running load command and loaded file will be present here
  • HDFS warehouse folder will contain data –
    • HDFS file path /warehouse/tablespace/managed/hive/loadtest.db will have data in world_population folder.
    • HDFS file path /warehouse/tablespace/external/hive/loadtest.db will be empty.
  • DROP table deletes data in HDFS folder.
  • SERDEPROPERTIES is mentioned in create table query with default values. It is not required to mention this for default values but still can be used.
  • The rules are same for any separator, quote and escape character.

List HDFS:

[hdfs@forkedblog]$ hdfs dfs -ls /warehouse/tablespace/managed/hive/loadtest.db/world_population
Found 2 items
drwxrwx---+  - hive hadoop          0 2018-06-17 15:27 /warehouse/tablespace/managed/hive/loadtest.db/world_population/delta_0000001_0000001_0000

After running insert query:

insert into world_population values (197,'test c1',900,0.01);
[hdfs@forkedblog]$ hdfs dfs -ls /warehouse/tablespace/managed/hive/loadtest.db/world_population
Found 2 items
drwxrwx---+  - hive hadoop          0 2018-06-17 15:27 /warehouse/tablespace/managed/hive/loadtest.db/world_population/delta_0000001_0000001_0000
drwxrwx---+  - hive hadoop          0 2018-06-17 15:29 /warehouse/tablespace/managed/hive/loadtest.db/world_population/delta_0000002_0000002_0000

Data Format 4

Hyphen as a field Separator and single quote as quote character

world_rank-country-population-world-extra_data
1-China-1388232694-0.19-'3.0-1.2-4.0'
CREATE TABLE world_population (
  world_rank INT,
  country STRING,
  population BIGINT,
  world DOUBLE,
  extra_data STRING
 )
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES
(
    "separatorChar" = "-",
    "quoteChar"     = "'"
)
STORED AS TEXTFILE;

Data Format 5

‘|’ as a field Separator and double quote as quote character

world_rank|country|population|world|extra_data
1|China|1388232694|0.19|"3.0|1.2|4.0"
CREATE TABLE world_population (
  world_rank INT,
  country STRING,
  population BIGINT,
  world DOUBLE,
  extra_data STRING
 )
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES
(
    "separatorChar" = "|"
)
STORED AS TEXTFILE;

Quote character is default hence no need mention it in properties. Applicable for all properties, no need to mention in serde properties if the value is default.

Data Format 6

Tab as a field Separator and single quote as quote character

world_rank    country    population    world    extra_data
1    China    1388232694    0.19    '3.0    1.2    4.0'
CREATE TABLE world_population (
  world_rank INT,
  country STRING,
  population BIGINT,
  world DOUBLE,
  extra_data STRING
 )
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES
(
    "separatorChar" = "\t",
    "quoteChar"     = "'"
)
STORED AS TEXTFILE;

If you have data already in the HDFS location, it is loaded by default and you don’t need load it again using LOAD DATA which moves the files to the default hive location /user/hive/warehouse. All you have to do is simply define the table using the external keyword, which leaves the files in place, but creates the table definition in the hive meta store.

Please read our follow up articles for more informations

  • Loading data from sub directories (Partitioned Table)

Please let us know your thoughts in comments section. Happy reading..!

Leave a Reply

avatar
  Subscribe  
Notify of