S3
HDFS
Azure

OpenHouse with Spark & S3

In this guide, we will quickly set up a running environment and experiment with some simple SQL commands. Our environment will include all the core OpenHouse services such as Catalog Service, House Table service and others, a Spark 3.1 engine and also MinIO S3 Instance. In this walkthrough, we will create some tables on OpenHouse, insert data in them and query the data. For more information on various docker environments and how to set them up please see the SETUP.md guide.

In the consecutive optional section, you can learn more about some simple GRANT REVOKE commands and how OpenHouse manages access control.

Prerequisites

Create and write to OpenHouse Tables

Get environment ready

First, clone OpenHouse github repository and run ./gradlew build command at the root directory. After the command succeeds you should see BUILD SUCCESSFUL message.

openhouse$main>  ./gradlew build

Execute docker compose -f infra/recipes/docker-compose/oh-s3-spark/docker-compose.yml up -d --build command to bring up docker containers for OpenHouse services, Spark and S3.

Run SQL commands

Let us execute some basic SQL commands to create table, add data and query data.

First login to the driver node and start the spark-shell.

oh-hadoop-spark$main>  docker exec -it local.spark-master /bin/bash

openhouse@0a9ed5853291:/opt/spark$  bin/spark-shell --packages org.apache.iceberg:iceberg-spark-runtime-3.1_2.12:1.2.0,software.amazon.awssdk:bundle:2.20.18,software.amazon.awssdk:url-connection-client:2.20.18   \
--jars openhouse-spark-runtime_2.12-*-all.jar  \
--conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions,com.linkedin.openhouse.spark.extensions.OpenhouseSparkSessionExtensions   \
--conf spark.sql.catalog.openhouse=org.apache.iceberg.spark.SparkCatalog   \
--conf spark.sql.catalog.openhouse.catalog-impl=com.linkedin.openhouse.spark.OpenHouseCatalog     \
--conf spark.sql.catalog.openhouse.io-impl=org.apache.iceberg.aws.s3.S3FileIO   \
--conf spark.sql.catalog.openhouse.s3.endpoint=http://minioS3:9000  \
--conf spark.sql.catalog.openhouse.s3.access-key-id=admin  \
--conf spark.sql.catalog.openhouse.s3.secret-access-key=password  \
--conf spark.sql.catalog.openhouse.s3.path-style-access=true  \
--conf spark.sql.catalog.openhouse.metrics-reporter-impl=com.linkedin.openhouse.javaclient.OpenHouseMetricsReporter    \
--conf spark.sql.catalog.openhouse.uri=http://openhouse-tables:8080   \
--conf spark.sql.catalog.openhouse.auth-token=$(cat /var/config/openhouse.token) \
--conf spark.sql.catalog.openhouse.cluster=LocalS3Cluster

note

the configuration spark.sql.catalog.openhouse.uri=http://openhouse-tables:8080 points to the docker container running the OpenHouse Catalog Service.

note

the configuration spark.sql.catalog.openhouse.io-impl is set to org.apache.iceberg.aws.s3.S3FileIO in order enable IO operations on S3. Parameters for this connection is configured via the prefix spark.sql.catalog.openhouse.s3.*.

note

you can access the MinIO UI at http://localhost:9871 of your host machine and inspect the state of objects created for your table. The username is admin and password is password for the MinIO docker setup.

OpenHouse with Spark & HDFS

In this guide, we will quickly set up a running environment and experiment with some simple SQL commands. Our environment will include all the core OpenHouse services such as Catalog Service, House Table service and others, a Spark 3.1 engine and also HDFS namenode and datanode. In this walkthrough, we will create some tables on OpenHouse, insert data in them and query the data. For more information on various docker environments and how to set them up please see the SETUP.md guide.

In the consecutive optional section, you can learn more about some simple GRANT REVOKE commands and how OpenHouse manages access control.

Prerequisites

Create and write to OpenHouse Tables

Get environment ready

First, clone OpenHouse github repository and run ./gradlew build command at the root directory. After the command succeeds you should see BUILD SUCCESSFUL message.

openhouse$main>  ./gradlew build

Execute docker compose -f infra/recipes/docker-compose/oh-hadoop-spark/docker-compose.yml up -d --build command to bring up docker containers for OpenHouse services, Spark and HDFS.

Run SQL commands

Let us execute some basic SQL commands to create table, add data and query data.

First login to the driver node and start the spark-shell.

oh-hadoop-spark$main>  docker exec -it local.spark-master /bin/bash

openhouse@0a9ed5853291:/opt/spark$  bin/spark-shell --packages org.apache.iceberg:iceberg-spark-runtime-3.1_2.12:1.2.0   \
--jars openhouse-spark-runtime_2.12-*-all.jar  \
--conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions,com.linkedin.openhouse.spark.extensions.OpenhouseSparkSessionExtensions   \
--conf spark.sql.catalog.openhouse=org.apache.iceberg.spark.SparkCatalog   \
--conf spark.sql.catalog.openhouse.catalog-impl=com.linkedin.openhouse.spark.OpenHouseCatalog     \
--conf spark.sql.catalog.openhouse.metrics-reporter-impl=com.linkedin.openhouse.javaclient.OpenHouseMetricsReporter    \
--conf spark.sql.catalog.openhouse.uri=http://openhouse-tables:8080   \
--conf spark.sql.catalog.openhouse.auth-token=$(cat /var/config/openhouse.token) \
--conf spark.sql.catalog.openhouse.cluster=LocalHadoopCluster

note

the configuration spark.sql.catalog.openhouse.uri=http://openhouse-tables:8080 points to the docker container running the OpenHouse Catalog Service.

OpenHouse with Spark & Azure

In this guide, we will set up a running sandbox environment and experiment with some simple SQL commands. Our environment includes the both Catalog Service and House Table Service. It also utilizes a Spark 3.1 engine and Azure services provisioned with Terraform. In this walkthrough, we will spin up this sandbox environment and deploy the OpenHouse services in Azure. Then, we will create some tables, insert data in them, and query the data. For more information on the sandbox setup, checkout the README.md file.

In the consecutive optional section, you can learn more about some simple GRANT REVOKE commands and how OpenHouse manages access control.

Prerequisites

Deployment

Step 1

Clone OpenHouse github repository and run ./gradlew build command at the root directory. After the command succeeds you should see BUILD SUCCESSFUL message.

openhouse$main>  ./gradlew build

Step 2

Setup an Azure account and create/choose a subscription. Run az login to login to your Azure account and subscription from the terminal.

Step 3

Navigate into the container environment, and initialize and apply the Terraform configuration:

cd infra/recipes/terraform/azure/environments/container
terraform init
terraform apply

Step 4

Navigate into the sandbox environment, and initialize and apply the Terraform configuration:

cd infra/recipes/terraform/azure/environments/sandbox
terraform init
terraform apply

Step 5

Start the docker containers for the Spark shell:

docker compose -f infra/recipes/docker-compose/spark-only/docker-compose.yml up -d --build

Run SQL commands

Start the Spark shell by running

./scripts/spark-shell.sh

When prompted, type "Y".

Once the spark-shell is up, we run the following command to create a simple table.

scala>  spark.sql("CREATE TABLE openhouse.db.tb (ts timestamp, data string) PARTITIONED BY (days(ts))")

Run a SHOW TABLES command to confirm the table that we just created!

scala> spark.sql("SHOW TABLES IN openhouse.db").show

+---------+---------+
|namespace|tableName|
+---------+---------+
| db      |       tb|
+---------+---------+

Great! We have created our first table. Now, let us put some data in it and retrieve it.

scala>  spark.sql("""
INSERT INTO TABLE openhouse.db.tb VALUES
   (current_timestamp(), 'today'),
   (date_sub(CAST(current_timestamp() as DATE), 30), 'today-30d')
 """)

scala> spark.sql("SELECT * FROM openhouse.db.tb").show

+--------------------+---------+
|                  ts|     data|
+--------------------+---------+
|2024-03-22 19:39:...|    today|
| 2024-02-21 00:00:00|today-30d|
+--------------------+---------+

Looks great! We just added some data to OpenHouse and queried the data using Spark SQL.

To find out more about other SQL commands that OH supports, please visit the SQL User Guide.

(Optional) Control access to Tables

We will continue with the same environment and the table (ie db.table) as before for this section.

You might have seen the parameter spark.sql.catalog.openhouse.auth-token=$(cat /var/config/openhouse.token) when you launched the sparkshell. This parameter sets up the client with your user token.

As you did before, start the spark-shell and run the following SQL command to make it fail.

scala> spark.sql("GRANT SELECT ON TABLE openhouse.db.tb TO user_1").show

java.lang.IllegalArgumentException: 400 , {"status":"BAD_REQUEST","error":"Bad Request","message":"db.tb2 is not a shared table","stacktrace":null,"cause":"Not Available"}

This error means the table is not sharable. In OpenHouse, tables are private by default. You can share them by running the SQL command:

scala> spark.sql("ALTER TABLE openhouse.db.tb SET POLICY ( SHARING=true )")

In order to check the ACLs for this table, run:

scala> spark.sql("SHOW GRANTS ON TABLE openhouse.db.tb2").show
+---------+---------+
|privilege|principal|
+---------+---------+
|   SELECT|   user_1|
+---------+---------+

You can also apply similar access control for database entity, please refer to the User Guide to learn more.

OpenHouse with Spark & S3

Prerequisites​

Create and write to OpenHouse Tables​

Get environment ready​

Run SQL commands​

OpenHouse with Spark & HDFS

Prerequisites​

Create and write to OpenHouse Tables​

Get environment ready​

Run SQL commands​

OpenHouse with Spark & Azure

Prerequisites​

Deployment​

Step 1​

Step 2​

Step 3​

Step 4​

Step 5​

Run SQL commands​

(Optional) Control access to Tables​

Prerequisites

Create and write to OpenHouse Tables

Get environment ready

Run SQL commands

Prerequisites

Create and write to OpenHouse Tables

Get environment ready

Run SQL commands

Prerequisites

Deployment

Step 1

Step 2

Step 3

Step 4

Step 5

Run SQL commands

(Optional) Control access to Tables