Fill in the Blanks: Mastering Hive Table Design (sql - hive


Fill in the Blanks Mastering Hive Table DesignOnline version Implement and get drilled on Hive Table design problems. by Good Sam 1 department FIELDS TERMINATED BY id name ',' CREATE FORMAT STRING TEXTFILE ROW age INT STORED AS INT DELIMITED STRING employees TABLE Practice Problem #1 - Create a simple Hive Table : Create a table named employees with four columns ( id , name , age , department ) . The ROW FORMAT DELIMITED clause specifies how Hive should interpret data to fit into this table schema . ( , , , ) ; 2 32 user_id AS BUCKETS ORC user_id CREATE timestamp BIGINT activity_details BY PARTITIONED BY LOCATION CLUSTERED STRING user_activity_logs TABLE STRING INTO '/path/to/user/activity/logs' activity_type STORED INT Practice Problem #2 - Design a Hive Table : Let's say you're given a dataset containing user activity logs with fields : timestamp , user_id , activity_type , and activity_details . Design a Hive table to store this data , partitioned by activity_type and optimized for querying by user_id . ( , , ) ( ) ( ) ; 3 STORED '/path/to/product/reviews' TABLE INT review_text ORC BY rating AS product_reviews PARTITIONED product_id user_id INT INT STRING review_id LOCATION CREATE STRING review_date INT EXTERNAL Practice Problem #3 : Given a dataset of product reviews with fields : review_id , product_id , review_text , user_id , rating , and review_date ( in YYYY - MM - DD format ) , design a Hive table to store this data , optimized for querying reviews by product and date . Think about how you would partition and store the table . ( , , , ) ( , ) ; 4 PARTITIONED BY AS STORED INT transaction_amount transaction_date INT transaction_id DECIMAL 2 DATE CREATE user_id TABLE 10 PARQUET daily_transactions Practice Problem #4 - Daily Transaction Logs : Design a Hive table for the scenario Scenario : You have daily transaction logs containing transaction_id , user_id , transaction_amount , and transaction_date . ( , , ( , ) ) ( ) ; 5 SELECT AS TIMESTAMP date_format login_timestamp login_timestamp login_timestamp login_history CREATE TABLE LOCATION INT user_id STORED INT INSERT login_history_staging login_id CREATE user_id login_history TABLE STRING BY TABLE login_month STORED login_timestamp user_id logout_timestamp logout_timestamp login_month 'yyyy-MM' EXTERNAL '/path/to/login/history' login_month AS ORC login_history_staging INT AS TIMESTAMP PARTITIONED INT PARTITION login_id login_id TIMESTAMP logout_timestamp FROM INTO ORC TIMESTAMP Practice Problem #5 - User Login History : Design a Hive table for the scenario Scenario : Track user login history with login_id , user_id , login_timestamp , and logout_timestamp , optimizing for queries on monthly login activity . Solution : - - Staging table creation ( , , , ) ; - - Main table creation with partitioning ( , , , ) ( ) ; - - Data insertion from staging to main table ( ) , , , , ( , ) ; 6 INT last_update_date PARTITIONED BY CREATE LOCATION inventory_count DATE AS STRING '/path/to/inventory' product_id TABLE EXTERNAL INT STORED ORC store_location product_inventory Practice Problem #6 - Product Inventory : Design a Hive table for the scenario Scenario : Store product inventory records including product_id , store_location , inventory_count , and last_update_date , optimized for querying inventory by location . Solution : ( , , ) ( ) ; 7 feedback_id INT STORED AS INT STRING DATE TEXTFILE customer_id PARTITIONED BY TABLE message customer_feedback received_date CREATE category STRING Practice Problem #7 - Customer Feedback Messages : Design a Hive table for the scenario Scenario : Manage customer feedback with feedback_id , customer_id , message , category , and received_date , optimized for reviewing feedback by category and date . Solution : ( , , ) ( , ) ; 8 PARTITIONED region sale_amount BY STORED 10 AS ORC sale_date DATE TABLE DECIMAL product_id INT 2 INT sale_id sales_records STRING CREATE Practice Problem #8 - Sales Records with Geography : Design a Hive table for the scenario Scenario : Analyze sales records with sale_id , product_id , sale_amount , sale_date , and region , needing frequent access by region and specific dates . ( , , ( , ) ) ( , ) ; 9 transaction_id BUCKETS amount INT PARQUET 10,2 PARTITIONED BY CREATE TABLE account_id financial_transactions account_id CLUSTERED BY transaction_type transaction_date 100 INT INTO STORED AS DECIMAL STRING DATE Problem #9 : Financial Transactions ( Parquet ) Scenario : You are tasked with managing a dataset of financial transactions that includes transaction_id , account_id , amount , transaction_type , and transaction_date . You need efficient querying by account_id and transaction_date . Solution : ( , , ( ) , ) ( ) ( ) ; 10 STRING STRING LOCATION signup_date AS EXTERNAL name customer_id year DATE INT customer_profiles CREATE STORED '/path/to/customer/profiles' INT AVRO TABLE PARTITIONED BY email Problem #10 : Customer Profiles ( Avro ) Scenario : You need to store customer profile data including customer_id , name , email , signup_date , and last_login . The data must support evolving schemas as new fields might be added in the future . Solution : ( , , , ) ( ) ; 11 event_logs DATE STORED AS CREATE event_id STRING TABLE event_type STRING event_details user_id PARTITIONED BY ORC INT INT event_date Problem #11 : Event Logs ( Orc ) Scenario : Design a table to manage web event logs with fields : event_id , user_id , event_type , event_details , and event_date . You expect frequent complex queries involving multiple fields . Solution : ( , , , ) ( ) ; 12 JSON marketing_campaigns budget campaign_id STORED AS INT TABLE STRING PARTITIONED INT campaign_name start_year 10,2 '/path/to/marketing/campaigns' DECIMAL CREATE EXTERNAL LOCATION BY Problem #12 : Marketing Campaign Data ( JSON ) Scenario : Store marketing campaign data including campaign_id , campaign_name , start_date , end_date , and budget . The data is occasionally queried by marketing analysts who prefer readable format for ad - hoc queries . Solution : ( , , ( ) ) ( ) ; 13 DATE TABLE TEXTFILE study_field INT PARTITIONED BY STORED AS record_id STRING data entry_date CREATE researcher_id research_data STRING INT Problem #13 : Research Data ( TEXTFILE ) Scenario : Store research data including record_id , researcher_id , study_field , data , and entry_date . Data is primarily textual and occasionally accessed . Solution : ( , , , ) ( ) ; 14 ORC PRIMARY departments CONSTRAINT CONSTRAINT fk_dept ORC department_name pk_user department_id STRING INT AS STRING INT PRIMARY users STORED AS TABLE pk_dept departments KEY KEY CREATE INT CONSTRAINT REFERENCES KEY department_id department_id CREATE TABLE user_name FOREIGN user_id department_id department_id STORED user_id Problem #14 : Implementing Constraints Scenario : Design a table to store user information with a unique user_id and a reference to a department_id from a departments table . Solution : ( , , ( ) ) ; ( , , , ( ) , ( ) ( ) ) ; 15 INT ADD 10 TABLE ALTER COLUMN category_id ALTER products COLUMNS price price products 2 CHANGE TABLE DECIMAL Problem #15 : Table Schema Modification Scenario : You already have a products table and need to add a new column category_id and change the data type of the existing price column . Solution : ( ) ; ( , ) ; 16 SELECT BY OVERWRITE AVG TABLE category_id sales_amount FROM sales_summary sales GROUP INSERT category_id Problem #16 : Hive SQL Query Scenario : Calculate and update the average sales for each product category in a sales_summary table . Solution : , ( ) ; 17 LOAD transactions INPATH TABLE '/path/to/transactions.csv' INTO DATA Problem #17 : Loading Data into Hive Table Scenario : Load data into a transactions table from a CSV file located in HDFS . Solution : ; 18 s.department_id = d.department_id BY ON FROM sales d.department_name SUM total_sales s.amount JOIN d.department_name SELECT departments GROUP AS Problem #18 : Filtering , Aggregation , and Join Scenario : Retrieve the total sales by department from a sales table and a departments table . Solution : , ( ) s d ; 19 daily_total temp_daily_sales CREATE transaction_date TABLE GROUP BY transaction_date AS amount SELECT sales SUM AS FROM TEMPORARY Problem #19 : Temporary Tables Scenario : Create a temporary table to hold daily sales data for analysis within a session . Solution : , ( ) ; 20 VIEW SELECT CREATE FROM region customer_name age customer_demographics AS customers Problem #20 : Creating and Using Views Scenario : Create a view to simplify access to customer demographics data without exposing sensitive details like personal IDs or payment methods . Solution : , , ; 21 path/to/schema/file' name AVRO TBLPROPERTIES CREATE STRING INT STORED schema 'avro id AS TABLE url' 'hdfs Problem #21 : Configuring Schema Evolution for Avro 1 . Avro Avro format supports schema evolution out of the box with Hive . When using Avro , the schema is stored with the data , which helps Hive manage changes seamlessly . However , to explicitly enable and manage Avro schema evolution , you can use table properties like the following : avro_table ( , ) ( . . = : / / ) ; 22 orc ETL STORED exec strategy ORC evolution evolution allowed' 'false' orc hive 'orc AS 'true' CREATE schema sensitive' STRING exec schema renames SET true INT hive column 'orc split first_name TABLE TBLPROPERTIES case SET id Problem #22 : Configuring Schema Evolution for ORC ORC supports schema evolution through its columnar format and metadata storage capabilities . To manage schema changes , you might need to adjust the following Hive configuration settings : . . . . = ; . . . . = ; hive . exec . orc . split . strategy : Setting this to ETL optimizes reading of ORC files that might have evolved schemas . hive . exec . orc . schema . evolution : Enabling this allows Hive to handle changes in the ORC file schemas over time . Additionally , when creating ORC tables , consider enabling column renaming as part of schema evolution : orc_table ( , ) ( . . . . = , . . . = ) ; 23 parquet name STRING AS SET enable INT PARQUET STORED dictionary true TABLE id CREATE Problem #23 : Configuring Schema Evolution for PARQUET Parquet also supports schema evolution to a degree , especially with additions of new columns . To use Parquet effectively with schema evolution in Hive , ensure that your Hive version and settings align with Parquet ? s capabilities : parquet_table ( , ) ; For schema evolution in Parquet , the changes are mostly handled transparently by Hive , but you can ensure better management with configurations like : . . = ;

1

department FIELDS TERMINATED BY id name ',' CREATE FORMAT STRING TEXTFILE ROW age INT STORED AS INT DELIMITED STRING employees TABLE

Practice Problem #1 - Create a simple Hive Table :

Create a table named employees with four columns ( id , name , age , department ) . The ROW FORMAT DELIMITED clause specifies how Hive should interpret data to fit into this table schema .

(
,
,
,

)

;

2

32 user_id AS BUCKETS ORC user_id CREATE timestamp BIGINT activity_details BY PARTITIONED BY LOCATION CLUSTERED STRING user_activity_logs TABLE STRING INTO '/path/to/user/activity/logs' activity_type STORED INT

Practice Problem #2 - Design a Hive Table :

Let's say you're given a dataset containing user activity logs with fields : timestamp , user_id , activity_type , and activity_details . Design a Hive table to store this data , partitioned by activity_type and optimized for querying by user_id .

(
,
,

)
( )
( )

;

3

STORED '/path/to/product/reviews' TABLE INT review_text ORC BY rating AS product_reviews PARTITIONED product_id user_id INT INT STRING review_id LOCATION CREATE STRING review_date INT EXTERNAL

Practice Problem #3 :

Given a dataset of product reviews with fields : review_id , product_id , review_text , user_id , rating , and review_date ( in YYYY - MM - DD format ) , design a Hive table to store this data , optimized for querying reviews by product and date . Think about how you would partition and store the table .

(
,
,
,

)
(
,

)

;

4

PARTITIONED BY AS STORED INT transaction_amount transaction_date INT transaction_id DECIMAL 2 DATE CREATE user_id TABLE 10 PARQUET daily_transactions

Practice Problem #4 - Daily Transaction Logs : Design a Hive table for the scenario

Scenario : You have daily transaction logs containing transaction_id , user_id , transaction_amount , and transaction_date .

(
,
,
( , )
)
( )
;

5

SELECT AS TIMESTAMP date_format login_timestamp login_timestamp login_timestamp login_history CREATE TABLE LOCATION INT user_id STORED INT INSERT login_history_staging login_id CREATE user_id login_history TABLE STRING BY TABLE login_month STORED login_timestamp user_id logout_timestamp logout_timestamp login_month 'yyyy-MM' EXTERNAL '/path/to/login/history' login_month AS ORC login_history_staging INT AS TIMESTAMP PARTITIONED INT PARTITION login_id login_id TIMESTAMP logout_timestamp FROM INTO ORC TIMESTAMP

Practice Problem #5 - User Login History : Design a Hive table for the scenario

Scenario : Track user login history with login_id , user_id , login_timestamp , and logout_timestamp , optimizing for queries on monthly login activity .

Solution :

- - Staging table creation
(
,
,
,

)

;

- - Main table creation with partitioning
(
,
,
,

)
( )
;

- - Data insertion from staging to main table
( )

,
,
,
,
( , )
;

6

INT last_update_date PARTITIONED BY CREATE LOCATION inventory_count DATE AS STRING '/path/to/inventory' product_id TABLE EXTERNAL INT STORED ORC store_location product_inventory

Practice Problem #6 - Product Inventory : Design a Hive table for the scenario

Scenario : Store product inventory records including product_id , store_location , inventory_count , and last_update_date , optimized for querying inventory by location .

Solution :

(
,
,

)
( )

;

7

feedback_id INT STORED AS INT STRING DATE TEXTFILE customer_id PARTITIONED BY TABLE message customer_feedback received_date CREATE category STRING

Practice Problem #7 - Customer Feedback Messages : Design a Hive table for the scenario

Scenario : Manage customer feedback with feedback_id , customer_id , message , category , and received_date , optimized for reviewing feedback by category and date .

Solution :

(
,
,

)
( , )
;

8

PARTITIONED region sale_amount BY STORED 10 AS ORC sale_date DATE TABLE DECIMAL product_id INT 2 INT sale_id sales_records STRING CREATE

Practice Problem #8 - Sales Records with Geography : Design a Hive table for the scenario

Scenario : Analyze sales records with sale_id , product_id , sale_amount , sale_date , and region , needing frequent access by region and specific dates .

(
,
,
( , )
)
( , )
;

9

transaction_id BUCKETS amount INT PARQUET 10,2 PARTITIONED BY CREATE TABLE account_id financial_transactions account_id CLUSTERED BY transaction_type transaction_date 100 INT INTO STORED AS DECIMAL STRING DATE

Problem #9 : Financial Transactions ( Parquet )

Scenario : You are tasked with managing a dataset of financial transactions that includes transaction_id , account_id , amount , transaction_type , and transaction_date . You need efficient querying by account_id and transaction_date .

Solution :

(
,
,
( ) ,

)
( )
( )
;

10

STRING STRING LOCATION signup_date AS EXTERNAL name customer_id year DATE INT customer_profiles CREATE STORED '/path/to/customer/profiles' INT AVRO TABLE PARTITIONED BY email

Problem #10 : Customer Profiles ( Avro )
Scenario : You need to store customer profile data including customer_id , name , email , signup_date , and last_login . The data must support evolving schemas as new fields might be added in the future .

Solution :

(
,
,
,

)
( )

;

11

event_logs DATE STORED AS CREATE event_id STRING TABLE event_type STRING event_details user_id PARTITIONED BY ORC INT INT event_date

Problem #11 : Event Logs ( Orc )
Scenario : Design a table to manage web event logs with fields : event_id , user_id , event_type , event_details , and event_date . You expect frequent complex queries involving multiple fields .

Solution :

(
,
,
,

)
( )
;

12

JSON marketing_campaigns budget campaign_id STORED AS INT TABLE STRING PARTITIONED INT campaign_name start_year 10,2 '/path/to/marketing/campaigns' DECIMAL CREATE EXTERNAL LOCATION BY

Problem #12 : Marketing Campaign Data ( JSON )
Scenario : Store marketing campaign data including campaign_id , campaign_name , start_date , end_date , and budget . The data is occasionally queried by marketing analysts who prefer readable format for ad - hoc queries .

Solution :

(
,
,
( )
)
( )

;

13

DATE TABLE TEXTFILE study_field INT PARTITIONED BY STORED AS record_id STRING data entry_date CREATE researcher_id research_data STRING INT

Problem #13 : Research Data ( TEXTFILE )
Scenario : Store research data including record_id , researcher_id , study_field , data , and entry_date . Data is primarily textual and occasionally accessed .

Solution :

(
,
,
,

)
( )
;

14

ORC PRIMARY departments CONSTRAINT CONSTRAINT fk_dept ORC department_name pk_user department_id STRING INT AS STRING INT PRIMARY users STORED AS TABLE pk_dept departments KEY KEY CREATE INT CONSTRAINT REFERENCES KEY department_id department_id CREATE TABLE user_name FOREIGN user_id department_id department_id STORED user_id

Problem #14 : Implementing Constraints
Scenario : Design a table to store user information with a unique user_id and a reference to a department_id from a departments table .

Solution :

(
,
,
( )
) ;

(
,
,
,
( ) ,
( ) ( )
) ;

15

INT ADD 10 TABLE ALTER COLUMN category_id ALTER products COLUMNS price price products 2 CHANGE TABLE DECIMAL

Problem #15 : Table Schema Modification
Scenario : You already have a products table and need to add a new column category_id and change the data type of the existing price column .

Solution :

( ) ;
( , ) ;

16

SELECT BY OVERWRITE AVG TABLE category_id sales_amount FROM sales_summary sales GROUP INSERT category_id

Problem #16 : Hive SQL Query
Scenario : Calculate and update the average sales for each product category in a sales_summary table .

Solution :

, ( )

;

17

LOAD transactions INPATH TABLE '/path/to/transactions.csv' INTO DATA

Problem #17 : Loading Data into Hive Table
Scenario : Load data into a transactions table from a CSV file located in HDFS .

Solution :

;

18

s.department_id = d.department_id BY ON FROM sales d.department_name SUM total_sales s.amount JOIN d.department_name SELECT departments GROUP AS

Problem #18 : Filtering , Aggregation , and Join
Scenario : Retrieve the total sales by department from a sales table and a departments table .

Solution :

, ( )
s
d
;

19

daily_total temp_daily_sales CREATE transaction_date TABLE GROUP BY transaction_date AS amount SELECT sales SUM AS FROM TEMPORARY

Problem #19 : Temporary Tables
Scenario : Create a temporary table to hold daily sales data for analysis within a session .

Solution :

, ( )

;

20

VIEW SELECT CREATE FROM region customer_name age customer_demographics AS customers

Problem #20 : Creating and Using Views
Scenario : Create a view to simplify access to customer demographics data without exposing sensitive details like personal IDs or payment methods .

Solution :

, ,
;

21

path/to/schema/file' name AVRO TBLPROPERTIES CREATE STRING INT STORED schema 'avro id AS TABLE url' 'hdfs

Problem #21 : Configuring Schema Evolution for Avro

1 . Avro
Avro format supports schema evolution out of the box with Hive . When using Avro , the schema is stored with the data , which helps Hive manage changes seamlessly . However , to explicitly enable and manage Avro schema evolution , you can use table properties like the following :

avro_table (
,

)

( . . = : / / ) ;

22

orc ETL STORED exec strategy ORC evolution evolution allowed' 'false' orc hive 'orc AS 'true' CREATE schema sensitive' STRING exec schema renames SET true INT hive column 'orc split first_name TABLE TBLPROPERTIES case SET id

Problem #22 : Configuring Schema Evolution for ORC

ORC supports schema evolution through its columnar format and metadata storage capabilities . To manage schema changes , you might need to adjust the following Hive configuration settings :

. . . . = ;
. . . . = ;

hive . exec . orc . split . strategy : Setting this to ETL optimizes reading of ORC files that might have evolved schemas .

hive . exec . orc . schema . evolution : Enabling this allows Hive to handle changes in the ORC file schemas over time .

Additionally , when creating ORC tables , consider enabling column renaming as part of schema evolution :

orc_table (
,

)

( . . . . = , . . . = ) ;

23

parquet name STRING AS SET enable INT PARQUET STORED dictionary true TABLE id CREATE

Problem #23 : Configuring Schema Evolution for PARQUET

Parquet also supports schema evolution to a degree , especially with additions of new columns . To use Parquet effectively with schema evolution in Hive , ensure that your Hive version and settings align with Parquet ? s capabilities :

parquet_table (
,

)
;

For schema evolution in Parquet , the changes are mostly handled transparently by Hive , but you can ensure better management with configurations like :

. . = ;

Mastering Hive Table Design

Fill in the Blanks

Download the paper version to play

Created by

Top 10 results

Top Games

Fill in the Blanks

COMPLETE THE LYRICS OF SONG

Fill in the Blanks

The Preamble

Fill in the Blanks

FFA Creed Paragraph #2

Fill in the Blanks

FFA Creed Paragraph 3

Fill in the Blanks

Anatomy quiz