How Hive Stores Schemas
I recently had to sync the schemas from a hive database to a normal MySQL database for reasons I won’t bother going into. The exercise required me to get all columns from all tables in hive for each DB though, and I found that this was not amazingly straight-forward.
The hive metastore DB is a normal MySQL/etc database with a hive schema in it. The hive schema holds the hive tables though. So, the information schema is irrelevant to hive; to get the hive table details, you have to interrogate the TBLS table, for example. To get columns, you need to interrogate COLUMNS_V2, and to get the databases themselves, you look toward the DBS table.
Missing Columns
Other posts I’ve seen would leave you here – but I found when I joined these three tables together (plus some others you have to route through), I was still missing some columns in certain tables for some reason. It turns out that partition columns are implicit in hive. So, in a file system of hive data (like HDFS), a partition column in a table is literally represented by just having the directory named with the partition value; there are no columns with the value in the data.
What this means is that partition columns don’t show up in these normal tables. You have to look to a separate partition keys table to find them with a separate query.
The Working Query
The query below finds all columns of any kind and sorts them in the order they’ll appear when you select from a table in hive/presto/etc. I hope it helps you!
select db_name, table_name, column_name from ( SELECT d.NAME as db_name, t.TBL_NAME as table_name, c.COLUMN_NAME as column_name, c.INTEGER_IDX as idx FROM DBS d JOIN TBLS t on t.DB_ID = d.DB_ID JOIN SDS s on t.SD_ID = s.SD_ID JOIN COLUMNS_V2 c on c.CD_ID = s.CD_ID WHERE d.NAME = :dbName union SELECT d.NAME as db_name, t.TBL_NAME as table_name, k.PKEY_NAME as column_name, 50000 + k.INTEGER_IDX FROM DBS d JOIN TBLS t on t.DB_ID = d.DB_ID join PARTITION_KEYS k on t.TBL_ID = k.TBL_ID where d.NAME = :dbName ) x order by db_name, table_name, idx