Skip to main content

OpenHouse with Spark & S3

In this guide, we will quickly set up a running environment and experiment with some simple SQL commands. Our environment will include all the core OpenHouse services such as Catalog Service, House Table service and others, a Spark 3.1 engine and also MinIO S3 Instance. In this walkthrough, we will create some tables on OpenHouse, insert data in them and query the data. For more information on various docker environments and how to set them up please see the SETUP.md guide.

In the consecutive optional section, you can learn more about some simple GRANT REVOKE commands and how OpenHouse manages access control.

Prerequisites

Create and write to OpenHouse Tables

Get environment ready

First, clone OpenHouse github repository and run ./gradlew build command at the root directory. After the command succeeds you should see BUILD SUCCESSFUL message.

openhouse$main>  ./gradlew build

Execute docker compose -f infra/recipes/docker-compose/oh-s3-spark/docker-compose.yml up -d --build command to bring up docker containers for OpenHouse services, Spark and S3.

Run SQL commands

Let us execute some basic SQL commands to create table, add data and query data.

First login to the driver node and start the spark-shell.

oh-hadoop-spark$main>  docker exec -it local.spark-master /bin/bash

openhouse@0a9ed5853291:/opt/spark$ bin/spark-shell --packages org.apache.iceberg:iceberg-spark-runtime-3.1_2.12:1.2.0,software.amazon.awssdk:bundle:2.20.18,software.amazon.awssdk:url-connection-client:2.20.18 \
--jars openhouse-spark-runtime_2.12-*-all.jar \
--conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions,com.linkedin.openhouse.spark.extensions.OpenhouseSparkSessionExtensions \
--conf spark.sql.catalog.openhouse=org.apache.iceberg.spark.SparkCatalog \
--conf spark.sql.catalog.openhouse.catalog-impl=com.linkedin.openhouse.spark.OpenHouseCatalog \
--conf spark.sql.catalog.openhouse.io-impl=org.apache.iceberg.aws.s3.S3FileIO \
--conf spark.sql.catalog.openhouse.s3.endpoint=http://minioS3:9000 \
--conf spark.sql.catalog.openhouse.s3.access-key-id=admin \
--conf spark.sql.catalog.openhouse.s3.secret-access-key=password \
--conf spark.sql.catalog.openhouse.s3.path-style-access=true \
--conf spark.sql.catalog.openhouse.metrics-reporter-impl=com.linkedin.openhouse.javaclient.OpenHouseMetricsReporter \
--conf spark.sql.catalog.openhouse.uri=http://openhouse-tables:8080 \
--conf spark.sql.catalog.openhouse.auth-token=$(cat /var/config/openhouse.token) \
--conf spark.sql.catalog.openhouse.cluster=LocalS3Cluster
note

the configuration spark.sql.catalog.openhouse.uri=http://openhouse-tables:8080 points to the docker container running the OpenHouse Catalog Service.

note

the configuration spark.sql.catalog.openhouse.io-impl is set to org.apache.iceberg.aws.s3.S3FileIO in order enable IO operations on S3. Parameters for this connection is configured via the prefix spark.sql.catalog.openhouse.s3.*.

note

you can access the MinIO UI at http://localhost:9871 of your host machine and inspect the state of objects created for your table. The username is admin and password is password for the MinIO docker setup.

Once the spark-shell is up, we run the following command to create a simple table.

scala>  spark.sql("CREATE TABLE openhouse.db.tb (ts timestamp, data string) PARTITIONED BY (days(ts))")

Run a SHOW TABLES command to confirm the table that we just created!


scala> spark.sql("SHOW TABLES IN openhouse.db").show

+---------+---------+
|namespace|tableName|
+---------+---------+
| db | tb|
+---------+---------+

Great! We have created our first table. Now, let us put some data in it and retrieve it.


scala> spark.sql("""
INSERT INTO TABLE openhouse.db.tb VALUES
(current_timestamp(), 'today'),
(date_sub(CAST(current_timestamp() as DATE), 30), 'today-30d')
""")

scala> spark.sql("SELECT * FROM openhouse.db.tb").show

+--------------------+---------+
| ts| data|
+--------------------+---------+
|2024-03-22 19:39:...| today|
| 2024-02-21 00:00:00|today-30d|
+--------------------+---------+

Looks great! We just added some data to OpenHouse and queried the data using Spark SQL.

To find out more about other SQL commands that OH supports, please visit the SQL User Guide.

(Optional) Control access to Tables

We will continue with the same environment and the table (ie db.table) as before for this section.

You might have seen the parameter spark.sql.catalog.openhouse.auth-token=$(cat /var/config/openhouse.token) when you launched the sparkshell. This parameter sets up the client with your user token.

As you did before, start the spark-shell and run the following SQL command to make it fail.

scala> spark.sql("GRANT SELECT ON TABLE openhouse.db.tb TO user_1").show

java.lang.IllegalArgumentException: 400 , {"status":"BAD_REQUEST","error":"Bad Request","message":"db.tb2 is not a shared table","stacktrace":null,"cause":"Not Available"}

This error means the table is not sharable. In OpenHouse, tables are private by default. You can share them by running the SQL command:

scala> spark.sql("ALTER TABLE openhouse.db.tb SET POLICY ( SHARING=true )")

In order to check the ACLs for this table, run:


scala> spark.sql("SHOW GRANTS ON TABLE openhouse.db.tb2").show
+---------+---------+
|privilege|principal|
+---------+---------+
| SELECT| user_1|
+---------+---------+

You can also apply similar access control for database entity, please refer to the User Guide to learn more.