- S3
- HDFS
- Azure
OpenHouse with Spark & S3
In this guide, we will quickly set up a running environment and experiment with some simple SQL commands. Our environment will include all the core OpenHouse services such as Catalog Service, House Table service and others, a Spark 3.1 engine and also MinIO S3 Instance. In this walkthrough, we will create some tables on OpenHouse, insert data in them and query the data. For more information on various docker environments and how to set them up please see the SETUP.md guide.
In the consecutive optional section, you can learn more about some simple GRANT REVOKE commands and how OpenHouse manages access control.
Prerequisites
Create and write to OpenHouse Tables
Get environment ready
First, clone OpenHouse github repository and
run ./gradlew build
command at the root directory. After the command succeeds you should see BUILD SUCCESSFUL
message.
openhouse$main> ./gradlew build
Execute docker compose -f infra/recipes/docker-compose/oh-s3-spark/docker-compose.yml up -d --build
command to
bring up docker containers for OpenHouse services, Spark and S3.
Run SQL commands
Let us execute some basic SQL commands to create table, add data and query data.
First login to the driver node and start the spark-shell.
oh-hadoop-spark$main> docker exec -it local.spark-master /bin/bash
openhouse@0a9ed5853291:/opt/spark$ bin/spark-shell --packages org.apache.iceberg:iceberg-spark-runtime-3.1_2.12:1.2.0,software.amazon.awssdk:bundle:2.20.18,software.amazon.awssdk:url-connection-client:2.20.18 \
--jars openhouse-spark-runtime_2.12-*-all.jar \
--conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions,com.linkedin.openhouse.spark.extensions.OpenhouseSparkSessionExtensions \
--conf spark.sql.catalog.openhouse=org.apache.iceberg.spark.SparkCatalog \
--conf spark.sql.catalog.openhouse.catalog-impl=com.linkedin.openhouse.spark.OpenHouseCatalog \
--conf spark.sql.catalog.openhouse.io-impl=org.apache.iceberg.aws.s3.S3FileIO \
--conf spark.sql.catalog.openhouse.s3.endpoint=http://minioS3:9000 \
--conf spark.sql.catalog.openhouse.s3.access-key-id=admin \
--conf spark.sql.catalog.openhouse.s3.secret-access-key=password \
--conf spark.sql.catalog.openhouse.s3.path-style-access=true \
--conf spark.sql.catalog.openhouse.metrics-reporter-impl=com.linkedin.openhouse.javaclient.OpenHouseMetricsReporter \
--conf spark.sql.catalog.openhouse.uri=http://openhouse-tables:8080 \
--conf spark.sql.catalog.openhouse.auth-token=$(cat /var/config/openhouse.token) \
--conf spark.sql.catalog.openhouse.cluster=LocalS3Cluster
the configuration spark.sql.catalog.openhouse.uri=http://openhouse-tables:8080
points to the docker container
running the OpenHouse Catalog Service.
the configuration spark.sql.catalog.openhouse.io-impl
is set to org.apache.iceberg.aws.s3.S3FileIO
in order
enable IO operations on S3. Parameters for this connection is configured via the prefix spark.sql.catalog.openhouse.s3.*
.
you can access the MinIO UI at http://localhost:9871
of your host machine and inspect the state of objects
created for your table. The username is admin
and password is password
for the MinIO docker setup.
OpenHouse with Spark & HDFS
In this guide, we will quickly set up a running environment and experiment with some simple SQL commands. Our environment will include all the core OpenHouse services such as Catalog Service, House Table service and others, a Spark 3.1 engine and also HDFS namenode and datanode. In this walkthrough, we will create some tables on OpenHouse, insert data in them and query the data. For more information on various docker environments and how to set them up please see the SETUP.md guide.
In the consecutive optional section, you can learn more about some simple GRANT REVOKE commands and how OpenHouse manages access control.
Prerequisites
Create and write to OpenHouse Tables
Get environment ready
First, clone OpenHouse github repository and
run ./gradlew build
command at the root directory. After the command succeeds you should see BUILD SUCCESSFUL
message.
openhouse$main> ./gradlew build
Execute docker compose -f infra/recipes/docker-compose/oh-hadoop-spark/docker-compose.yml up -d --build
command to
bring up docker containers for OpenHouse services, Spark and HDFS.
Run SQL commands
Let us execute some basic SQL commands to create table, add data and query data.
First login to the driver node and start the spark-shell.
oh-hadoop-spark$main> docker exec -it local.spark-master /bin/bash
openhouse@0a9ed5853291:/opt/spark$ bin/spark-shell --packages org.apache.iceberg:iceberg-spark-runtime-3.1_2.12:1.2.0 \
--jars openhouse-spark-runtime_2.12-*-all.jar \
--conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions,com.linkedin.openhouse.spark.extensions.OpenhouseSparkSessionExtensions \
--conf spark.sql.catalog.openhouse=org.apache.iceberg.spark.SparkCatalog \
--conf spark.sql.catalog.openhouse.catalog-impl=com.linkedin.openhouse.spark.OpenHouseCatalog \
--conf spark.sql.catalog.openhouse.metrics-reporter-impl=com.linkedin.openhouse.javaclient.OpenHouseMetricsReporter \
--conf spark.sql.catalog.openhouse.uri=http://openhouse-tables:8080 \
--conf spark.sql.catalog.openhouse.auth-token=$(cat /var/config/openhouse.token) \
--conf spark.sql.catalog.openhouse.cluster=LocalHadoopCluster
the configuration spark.sql.catalog.openhouse.uri=http://openhouse-tables:8080
points to the docker container
running the OpenHouse Catalog Service.
OpenHouse with Spark & Azure
In this guide, we will set up a running sandbox environment and experiment with some simple SQL commands. Our environment includes the both Catalog Service and House Table Service. It also utilizes a Spark 3.1 engine and Azure services provisioned with Terraform. In this walkthrough, we will spin up this sandbox environment and deploy the OpenHouse services in Azure. Then, we will create some tables, insert data in them, and query the data. For more information on the sandbox setup, checkout the README.md file.
In the consecutive optional section, you can learn more about some simple GRANT REVOKE commands and how OpenHouse manages access control.
Prerequisites
Deployment
Step 1
Clone OpenHouse github repository and
run ./gradlew build
command at the root directory. After the command succeeds you should see BUILD SUCCESSFUL
message.
openhouse$main> ./gradlew build
Step 2
Setup an Azure account and create/choose a subscription. Run az login
to login to your Azure account and subscription from the terminal.
Step 3
Navigate into the container environment, and initialize and apply the Terraform configuration:
cd infra/recipes/terraform/azure/environments/container
terraform init
terraform apply
Step 4
Navigate into the sandbox environment, and initialize and apply the Terraform configuration:
cd infra/recipes/terraform/azure/environments/sandbox
terraform init
terraform apply
Step 5
Start the docker containers for the Spark shell:
docker compose -f infra/recipes/docker-compose/spark-only/docker-compose.yml up -d --build
Run SQL commands
Start the Spark shell by running
./scripts/spark-shell.sh
When prompted, type "Y".
Once the spark-shell is up, we run the following command to create a simple table.
scala> spark.sql("CREATE TABLE openhouse.db.tb (ts timestamp, data string) PARTITIONED BY (days(ts))")
Run a SHOW TABLES
command to confirm the table that we just created!
scala> spark.sql("SHOW TABLES IN openhouse.db").show
+---------+---------+
|namespace|tableName|
+---------+---------+
| db | tb|
+---------+---------+
Great! We have created our first table. Now, let us put some data in it and retrieve it.
scala> spark.sql("""
INSERT INTO TABLE openhouse.db.tb VALUES
(current_timestamp(), 'today'),
(date_sub(CAST(current_timestamp() as DATE), 30), 'today-30d')
""")
scala> spark.sql("SELECT * FROM openhouse.db.tb").show
+--------------------+---------+
| ts| data|
+--------------------+---------+
|2024-03-22 19:39:...| today|
| 2024-02-21 00:00:00|today-30d|
+--------------------+---------+
Looks great! We just added some data to OpenHouse and queried the data using Spark SQL.
To find out more about other SQL commands that OH supports, please visit the SQL User Guide.
(Optional) Control access to Tables
We will continue with the same environment and the table (ie db.table) as before for this section.
You might have seen the parameter spark.sql.catalog.openhouse.auth-token=$(cat /var/config/openhouse.token)
when you
launched the sparkshell. This parameter sets up the client with your user token.
As you did before, start the spark-shell and run the following SQL command to make it fail.
scala> spark.sql("GRANT SELECT ON TABLE openhouse.db.tb TO user_1").show
java.lang.IllegalArgumentException: 400 , {"status":"BAD_REQUEST","error":"Bad Request","message":"db.tb2 is not a shared table","stacktrace":null,"cause":"Not Available"}
This error means the table is not sharable. In OpenHouse, tables are private by default. You can share them by running the SQL command:
scala> spark.sql("ALTER TABLE openhouse.db.tb SET POLICY ( SHARING=true )")
In order to check the ACLs for this table, run:
scala> spark.sql("SHOW GRANTS ON TABLE openhouse.db.tb2").show
+---------+---------+
|privilege|principal|
+---------+---------+
| SELECT| user_1|
+---------+---------+
You can also apply similar access control for database entity, please refer to the User Guide to learn more.