Fill in the Blanks
Implement and get drilled on Hive Table design problems.
1
department
FIELDS TERMINATED BY
id
name
','
CREATE
FORMAT
STRING
TEXTFILE
ROW
age
INT
STORED AS
INT
DELIMITED
STRING
employees
TABLE
Practice
Problem
#1
-
Create
a
simple
Hive
Table
:
Create
a
table
named
employees
with
four
columns
(
id
,
name
,
age
,
department
)
.
The
ROW
FORMAT
DELIMITED
clause
specifies
how
Hive
should
interpret
data
to
fit
into
this
table
schema
.
(
,
,
,
)
;
2
32
user_id
AS
BUCKETS
ORC
user_id
CREATE
timestamp
BIGINT
activity_details
BY
PARTITIONED BY
LOCATION
CLUSTERED
STRING
user_activity_logs
TABLE
STRING
INTO
'/path/to/user/activity/logs'
activity_type
STORED
INT
Practice
Problem
#2
-
Design
a
Hive
Table
:
Let's
say
you're
given
a
dataset
containing
user
activity
logs
with
fields
:
timestamp
,
user_id
,
activity_type
,
and
activity_details
.
Design
a
Hive
table
to
store
this
data
,
partitioned
by
activity_type
and
optimized
for
querying
by
user_id
.
(
,
,
)
(
)
(
)
;
3
STORED
'/path/to/product/reviews'
TABLE
INT
review_text
ORC
BY
rating
AS
product_reviews
PARTITIONED
product_id
user_id
INT
INT
STRING
review_id
LOCATION
CREATE
STRING
review_date
INT
EXTERNAL
Practice
Problem
#3
:
Given
a
dataset
of
product
reviews
with
fields
:
review_id
,
product_id
,
review_text
,
user_id
,
rating
,
and
review_date
(
in
YYYY
-
MM
-
DD
format
)
,
design
a
Hive
table
to
store
this
data
,
optimized
for
querying
reviews
by
product
and
date
.
Think
about
how
you
would
partition
and
store
the
table
.
(
,
,
,
)
(
,
)
;
4
PARTITIONED BY
AS
STORED
INT
transaction_amount
transaction_date
INT
transaction_id
DECIMAL
2
DATE
CREATE
user_id
TABLE
10
PARQUET
daily_transactions
Practice
Problem
#4
-
Daily
Transaction
Logs
:
Design
a
Hive
table
for
the
scenario
Scenario
:
You
have
daily
transaction
logs
containing
transaction_id
,
user_id
,
transaction_amount
,
and
transaction_date
.
(
,
,
(
,
)
)
(
)
;
5
SELECT
AS
TIMESTAMP
date_format
login_timestamp
login_timestamp
login_timestamp
login_history
CREATE
TABLE
LOCATION
INT
user_id
STORED
INT
INSERT
login_history_staging
login_id
CREATE
user_id
login_history
TABLE
STRING
BY
TABLE
login_month
STORED
login_timestamp
user_id
logout_timestamp
logout_timestamp
login_month
'yyyy-MM'
EXTERNAL
'/path/to/login/history'
login_month
AS
ORC
login_history_staging
INT
AS
TIMESTAMP
PARTITIONED
INT
PARTITION
login_id
login_id
TIMESTAMP
logout_timestamp
FROM
INTO
ORC
TIMESTAMP
Practice
Problem
#5
-
User
Login
History
:
Design
a
Hive
table
for
the
scenario
Scenario
:
Track
user
login
history
with
login_id
,
user_id
,
login_timestamp
,
and
logout_timestamp
,
optimizing
for
queries
on
monthly
login
activity
.
Solution
:
-
-
Staging
table
creation
(
,
,
,
)
;
-
-
Main
table
creation
with
partitioning
(
,
,
,
)
(
)
;
-
-
Data
insertion
from
staging
to
main
table
(
)
,
,
,
,
(
,
)
;
6
INT
last_update_date
PARTITIONED BY
CREATE
LOCATION
inventory_count
DATE
AS
STRING
'/path/to/inventory'
product_id
TABLE
EXTERNAL
INT
STORED
ORC
store_location
product_inventory
Practice
Problem
#6
-
Product
Inventory
:
Design
a
Hive
table
for
the
scenario
Scenario
:
Store
product
inventory
records
including
product_id
,
store_location
,
inventory_count
,
and
last_update_date
,
optimized
for
querying
inventory
by
location
.
Solution
:
(
,
,
)
(
)
;
7
feedback_id
INT
STORED AS
INT
STRING
DATE
TEXTFILE
customer_id
PARTITIONED BY
TABLE
message
customer_feedback
received_date
CREATE
category
STRING
Practice
Problem
#7
-
Customer
Feedback
Messages
:
Design
a
Hive
table
for
the
scenario
Scenario
:
Manage
customer
feedback
with
feedback_id
,
customer_id
,
message
,
category
,
and
received_date
,
optimized
for
reviewing
feedback
by
category
and
date
.
Solution
:
(
,
,
)
(
,
)
;
8
PARTITIONED
region
sale_amount
BY
STORED
10
AS
ORC
sale_date
DATE
TABLE
DECIMAL
product_id
INT
2
INT
sale_id
sales_records
STRING
CREATE
Practice
Problem
#8
-
Sales
Records
with
Geography
:
Design
a
Hive
table
for
the
scenario
Scenario
:
Analyze
sales
records
with
sale_id
,
product_id
,
sale_amount
,
sale_date
,
and
region
,
needing
frequent
access
by
region
and
specific
dates
.
(
,
,
(
,
)
)
(
,
)
;
9
transaction_id
BUCKETS
amount
INT
PARQUET
10,2
PARTITIONED BY
CREATE
TABLE
account_id
financial_transactions
account_id
CLUSTERED BY
transaction_type
transaction_date
100
INT
INTO
STORED AS
DECIMAL
STRING
DATE
Problem
#9
:
Financial
Transactions
(
Parquet
)
Scenario
:
You
are
tasked
with
managing
a
dataset
of
financial
transactions
that
includes
transaction_id
,
account_id
,
amount
,
transaction_type
,
and
transaction_date
.
You
need
efficient
querying
by
account_id
and
transaction_date
.
Solution
:
(
,
,
(
)
,
)
(
)
(
)
;
10
STRING
STRING
LOCATION
signup_date
AS
EXTERNAL
name
customer_id
year
DATE
INT
customer_profiles
CREATE
STORED
'/path/to/customer/profiles'
INT
AVRO
TABLE
PARTITIONED BY
email
Problem
#10
:
Customer
Profiles
(
Avro
)
Scenario
:
You
need
to
store
customer
profile
data
including
customer_id
,
name
,
email
,
signup_date
,
and
last_login
.
The
data
must
support
evolving
schemas
as
new
fields
might
be
added
in
the
future
.
Solution
:
(
,
,
,
)
(
)
;
11
event_logs
DATE
STORED AS
CREATE
event_id
STRING
TABLE
event_type
STRING
event_details
user_id
PARTITIONED BY
ORC
INT
INT
event_date
Problem
#11
:
Event
Logs
(
Orc
)
Scenario
:
Design
a
table
to
manage
web
event
logs
with
fields
:
event_id
,
user_id
,
event_type
,
event_details
,
and
event_date
.
You
expect
frequent
complex
queries
involving
multiple
fields
.
Solution
:
(
,
,
,
)
(
)
;
12
JSON
marketing_campaigns
budget
campaign_id
STORED AS
INT
TABLE
STRING
PARTITIONED
INT
campaign_name
start_year
10,2
'/path/to/marketing/campaigns'
DECIMAL
CREATE
EXTERNAL
LOCATION
BY
Problem
#12
:
Marketing
Campaign
Data
(
JSON
)
Scenario
:
Store
marketing
campaign
data
including
campaign_id
,
campaign_name
,
start_date
,
end_date
,
and
budget
.
The
data
is
occasionally
queried
by
marketing
analysts
who
prefer
readable
format
for
ad
-
hoc
queries
.
Solution
:
(
,
,
(
)
)
(
)
;
13
DATE
TABLE
TEXTFILE
study_field
INT
PARTITIONED BY
STORED AS
record_id
STRING
data
entry_date
CREATE
researcher_id
research_data
STRING
INT
Problem
#13
:
Research
Data
(
TEXTFILE
)
Scenario
:
Store
research
data
including
record_id
,
researcher_id
,
study_field
,
data
,
and
entry_date
.
Data
is
primarily
textual
and
occasionally
accessed
.
Solution
:
(
,
,
,
)
(
)
;
14
ORC
PRIMARY
departments
CONSTRAINT
CONSTRAINT
fk_dept
ORC
department_name
pk_user
department_id
STRING
INT
AS
STRING
INT
PRIMARY
users
STORED AS
TABLE
pk_dept
departments
KEY
KEY
CREATE
INT
CONSTRAINT
REFERENCES
KEY
department_id
department_id
CREATE
TABLE
user_name
FOREIGN
user_id
department_id
department_id
STORED
user_id
Problem
#14
:
Implementing
Constraints
Scenario
:
Design
a
table
to
store
user
information
with
a
unique
user_id
and
a
reference
to
a
department_id
from
a
departments
table
.
Solution
:
(
,
,
(
)
)
;
(
,
,
,
(
)
,
(
)
(
)
)
;
15
INT
ADD
10
TABLE
ALTER
COLUMN
category_id
ALTER
products
COLUMNS
price
price
products
2
CHANGE
TABLE
DECIMAL
Problem
#15
:
Table
Schema
Modification
Scenario
:
You
already
have
a
products
table
and
need
to
add
a
new
column
category_id
and
change
the
data
type
of
the
existing
price
column
.
Solution
:
(
)
;
(
,
)
;
16
SELECT
BY
OVERWRITE
AVG
TABLE
category_id
sales_amount
FROM
sales_summary
sales
GROUP
INSERT
category_id
Problem
#16
:
Hive
SQL
Query
Scenario
:
Calculate
and
update
the
average
sales
for
each
product
category
in
a
sales_summary
table
.
Solution
:
,
(
)
;
17
LOAD
transactions
INPATH
TABLE
'/path/to/transactions.csv'
INTO
DATA
Problem
#17
:
Loading
Data
into
Hive
Table
Scenario
:
Load
data
into
a
transactions
table
from
a
CSV
file
located
in
HDFS
.
Solution
:
;
18
s.department_id = d.department_id
BY
ON
FROM
sales
d.department_name
SUM
total_sales
s.amount
JOIN
d.department_name
SELECT
departments
GROUP
AS
Problem
#18
:
Filtering
,
Aggregation
,
and
Join
Scenario
:
Retrieve
the
total
sales
by
department
from
a
sales
table
and
a
departments
table
.
Solution
:
,
(
)
s
d
;
19
daily_total
temp_daily_sales
CREATE
transaction_date
TABLE
GROUP
BY
transaction_date
AS
amount
SELECT
sales
SUM
AS
FROM
TEMPORARY
Problem
#19
:
Temporary
Tables
Scenario
:
Create
a
temporary
table
to
hold
daily
sales
data
for
analysis
within
a
session
.
Solution
:
,
(
)
;
20
VIEW
SELECT
CREATE
FROM
region
customer_name
age
customer_demographics
AS
customers
Problem
#20
:
Creating
and
Using
Views
Scenario
:
Create
a
view
to
simplify
access
to
customer
demographics
data
without
exposing
sensitive
details
like
personal
IDs
or
payment
methods
.
Solution
:
,
,
;
21
path/to/schema/file'
name
AVRO
TBLPROPERTIES
CREATE
STRING
INT
STORED
schema
'avro
id
AS
TABLE
url'
'hdfs
Problem
#21
:
Configuring
Schema
Evolution
for
Avro
1
.
Avro
Avro
format
supports
schema
evolution
out
of
the
box
with
Hive
.
When
using
Avro
,
the
schema
is
stored
with
the
data
,
which
helps
Hive
manage
changes
seamlessly
.
However
,
to
explicitly
enable
and
manage
Avro
schema
evolution
,
you
can
use
table
properties
like
the
following
:
avro_table
(
,
)
(
.
.
=
:
/
/
)
;
22
orc
ETL
STORED
exec
strategy
ORC
evolution
evolution
allowed'
'false'
orc
hive
'orc
AS
'true'
CREATE
schema
sensitive'
STRING
exec
schema
renames
SET
true
INT
hive
column
'orc
split
first_name
TABLE
TBLPROPERTIES
case
SET
id
Problem
#22
:
Configuring
Schema
Evolution
for
ORC
ORC
supports
schema
evolution
through
its
columnar
format
and
metadata
storage
capabilities
.
To
manage
schema
changes
,
you
might
need
to
adjust
the
following
Hive
configuration
settings
:
.
.
.
.
=
;
.
.
.
.
=
;
hive
.
exec
.
orc
.
split
.
strategy
:
Setting
this
to
ETL
optimizes
reading
of
ORC
files
that
might
have
evolved
schemas
.
hive
.
exec
.
orc
.
schema
.
evolution
:
Enabling
this
allows
Hive
to
handle
changes
in
the
ORC
file
schemas
over
time
.
Additionally
,
when
creating
ORC
tables
,
consider
enabling
column
renaming
as
part
of
schema
evolution
:
orc_table
(
,
)
(
.
.
.
.
=
,
.
.
.
=
)
;
23
parquet
name
STRING
AS
SET
enable
INT
PARQUET
STORED
dictionary
true
TABLE
id
CREATE
Problem
#23
:
Configuring
Schema
Evolution
for
PARQUET
Parquet
also
supports
schema
evolution
to
a
degree
,
especially
with
additions
of
new
columns
.
To
use
Parquet
effectively
with
schema
evolution
in
Hive
,
ensure
that
your
Hive
version
and
settings
align
with
Parquet
?
s
capabilities
:
parquet_table
(
,
)
;
For
schema
evolution
in
Parquet
,
the
changes
are
mostly
handled
transparently
by
Hive
,
but
you
can
ensure
better
management
with
configurations
like
:
.
.
=
;
|