How to set up a K8s cluster for Apache Spark¶

The Charmed Apache Spark solution requires an environment with:

A Kubernetes cluster for running the services and workloads
An object storage layer to store persistent data

In the following guide, we provide details on the technologies currently supported and instructions on how these layers can be set up.

Kubernetes¶

The Charmed Apache Spark solution runs on top of several K8s distributions. We recommend using versions above or equal to 1.30. Earlier versions may still be working, although we do not explicitly test them.

There are multiple ways that a K8s cluster can be deployed. We provide full compatibility and support for:

Canonical K8s
MicroK8s
AWS EKS
Azure AKS

The how-to guide below shows you how to set up these to be used with Charmed Apache Spark.

Canonical K8s¶

Canonical Kubernetes (also named Canonical K8s) is a performant, lightweight, secure and opinionated distribution of Kubernetes supported by Canonical which includes everything needed to create and manage a scalable cluster suitable for all use cases.

Canonical K8s can be deployed in multiple ways:

For development, you can also setup Canonical K8s using concierge, that is an opinionated utility for provisioning charm development and testing machines, and this is what we will use in the following. This will install a single node K8s deployment in the local environment using the Canonical K8s snap. However, for more production-ready use-case, we recommend to follow the more comprehensive product documentation provide above.

First, install concierge snap

sudo snap install concierge --classic

Concierge can be configured using a YAML file, where the versions and configuration for the various development components can be set. Note that concierge can also take care of provisioning also a Juju controller as part of its setup, e.g.

juju:
  channel: 3.6/stable
  agent-version: "3.6.9"
  model-defaults:
    logging-config: <root>=INFO; unit=DEBUG

providers:
  k8s:
    enable: true
    bootstrap: true
    channel: 1.32-classic/stable
    features:
      local-storage:
      load-balancer:
        enabled: true
        l2-mode: true
        cidrs: 10.64.0.0/16
    bootstrap-constraints:
      root-disk: 4G

Feel free to adapt the configuration YAML file above for other versions of the various components.

Once the configuration is defined, concierge will take care of provisioning the various tools/components using

sudo concierge prepare --trace

This will create a Canonical K8s cluster with already a Juju controller bootstrapped.

MicroK8s¶

MicroK8s is the mightiest tiny Kubernetes distribution around. It can be easily installed locally via snaps

sudo snap install microk8s --classic

When installing MicroK8s, it is recommended to configure MicroK8s in a way, so that there exists a user that has admin rights on the cluster.

sudo snap alias microk8s.kubectl kubectl
sudo usermod -a -G microk8s ${USER}
mkdir -p ~/.kube
sudo chown -f -R ${USER} ~/.kube

To make these changes effective, you can either open a new shell (or log in, log out) or use newgrp microk8s

Make sure that the MicroK8s cluster is now up and running:

microk8s status --wait-ready

Export the Kubernetes config file associated with admin rights and store it in the $KUBECONFIG file, e.g. ~/.kube/config:

export KUBECONFIG=path/to/file # Usually ~/.kube/config
microk8s config | tee ${KUBECONFIG}

Enable the K8s features required by the Apache Spark Client snap:

microk8s.enable dns rbac storage hostpath-storage

The MicroK8s cluster is now ready to be used.

External load balancer¶

If you want to expose the Spark History Server UI via a Traefik ingress, we need to enable an external load balancer:

IPADDR=$(ip -4 -j route get 2.2.2.2 | jq -r '.[] | .prefsrc')
microk8s enable metallb:$IPADDR-$IPADDR

AWS EKS¶

To deploy an EKS cluster, make sure that you have working CLI tools properly installed on your edge machine:

AWS CLI correctly set up to use a properly configured service account (see configuration and authentication guides). Once the AWS CLI is configured, make sure that it works properly by testing the authentication call through the following command:
```
aws sts get-identity-caller
```
eksctl installed and configured (refer to the README.md file for more information on how to install it)

Make sure that your service account (configured in AWS) has the right permission to create and manage EKS clusters. In general, we recommend the use of profiles when having multiple accounts.

Creating a cluster¶

An EKS cluster can be created using eksctl, the AWS Management Console, or the AWS CLI. In the following, we will use eksctl.

Create a YAML file with the following content:

apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig

metadata:
    name: spark-cluster
    region: <AWS_REGION_NAME>
    version: "1.27"
iam:
  withOIDC: true

addons:
- name: aws-ebs-csi-driver
  wellKnownPolicies:
    ebsCSIController: true

nodeGroups:
    - name: ng-1
      minSize: 3
      maxSize: 5
      iam:
        attachPolicyARNs:
        - arn:aws:iam::aws:policy/AmazonEKSWorkerNodePolicy
        - arn:aws:iam::aws:policy/AmazonEKS_CNI_Policy
        - arn:aws:iam::aws:policy/AmazonEC2ContainerRegistryReadOnly
        - arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore
        - arn:aws:iam::aws:policy/AmazonS3FullAccess
      instancesDistribution:
        maxPrice: 0.15
        instanceTypes: ["m5.xlarge", "m5.large"] # At least two instance types should be specified
        onDemandBaseCapacity: 0
        onDemandPercentageAboveBaseCapacity: 50
        spotInstancePools: 2

Feel free to replace the <AWS_REGION_NAME> with the AWS region of your choice, and custom further the above YAML based on your needs and policies in place.

You can then create the EKS via CLI:

eksctl create cluster -f cluster.yaml

The EKS cluster creation process may take several minutes. The cluster creation process should already update the kubeconfig file with the new cluster information. By default, eksctl creates a user that generates a new access token on the fly via the aws CLI. However, this conflicts with the spark-client snap that is strictly confined and does not have access to the aws command. Therefore, we recommend you to manually retrieve a token:

aws eks get-token --region <AWS_REGION_NAME> --cluster-name spark-cluster --output json

and paste the token in the kubeconfig file:

users:
- name: eks-created-username
  user:
    token: <AWS_TOKEN>

The EKS cluster is now ready to be used.

Azure AKS¶

To deploy an Azure Kubernetes Service (AKS) cluster, you’d need to make sure you have Azure CLI properly installed and authenticated.

The Azure CLI is the official cross-platform command-line tool to connect to Azure Cloud and execute administrative commands on Azure resources.

Install Azure CLI with snap:

sudo snap install azcli

Alias the command azcli.az as az, since most of the online resources – including the Azure Docs – refer to the CLI as az:

sudo snap alias azcli.az az

Authenticate with an Azure account. The easiest way to do so is by using the device login workflow as follows:

az login --use-device-code

Once you run the login command, you should see an output similar to the following in the console:

To sign in, use a web browser to open the page https://microsoft.com/devicelogin and enter the code XXXXXXX to authenticate.

Browse to the page https://microsoft.com/devicelogin, enter the code displayed in the console earlier, and finally login with your Azure account credentials. When prompted for confirmation, click “Yes”. Once the authentication is successful, you should see authentication success message in the browser. If you have multiple subscriptions associated with your Azure account, you may be asked to choose the subscription by the CLI. Once that is complete, the Azure CLI is ready.

To confirm that the authentication went well, try listing storage accounts in your Azure Cloud:

az storage account list
echo $?

The command should list the storage accounts in your Azure cloud (if any) and the return code printed by echo $? should be 0 if the command was successful.

Creating a cluster¶

Now that the Azure Cloud and CLI are working, start creating the AKS cluster resources. This can be done in multiple ways – using the browser console, using az CLI directly, or by using Azure Terraform provider. In this guide, we’re going to use Terraform to create the AKS cluster.

Terraform is an infrastructure as code (IaC) tool that lets you create, change and version the infrastructure (in our case, Azure cloud resources) safely and efficiently. Start by installing terraform as a snap.

sudo snap install terraform --classic

Once terraform is installed, we’re ready to write the Terraform plan for our new AKS cluster.

The terraform scripts for the creation of a simple AKS cluster for Charmed Spark is already available in the GitHub repository. If you inspect the contents there, you will see that there are different Terraform configuration files. These configuration files contain the specification for the resource group, virtual network and AKS cluster we’re going to create.

We can use the aforementioned Terraform module by referencing it as a module in a local Terraform file. Create a file named setup.tf locally and reference the Terraform module:

# file: setup.tf

module "aks_cluster" {
  source     = "git::https://github.com/canonical/spark-k8s-bundle//releases/3.4/yaml/docs-resources/aks-setup"
}

Now, initialize Terraform and view the plan by running the following command from the same directory that contains the setup.tf file:

terraform init
terraform plan

Under the plan, you should see that an AKS cluster, along with resource group, virtual network, subnet, NAT gateway, etc. are on the “to-add” list.

Apply the plan and wait for the resources to be created:

terraform apply -auto-approve

Once the resource creation completes, check the resource group name and AKS cluster name:

terraform output

# aks_cluster_name = "TestAKSCluster"
# resource_group_name = "TestSparkAKSRG"

Generating the kubeconfig file¶

To generate the kubeconfig file for connecting the client to the newly created cluster:

az aks get-credentials --resource-group <resource_group_name> --name <aks_cluster_name> --file ~/.kube/config

The AKS cluster is now ready to be used.

Object storage¶

Object storage persistence integration with Charmed Apache Spark is critical for:

reading and writing application data to be used in Spark jobs
storing Spark jobs logs to be then exposed via Charmed Apache Spark History Server
enable Hive-compatible JDBC/ODBC endpoints provided by Apache Kyuubi to provide datalake capabilities on top of HDFS/Hadoop/object storages

Charmed Apache Spark provides out-of-box integration with the following object storage backends:

S3-compatible object storages, such as:
- Ceph with RadosGW
- MinIO
- AWS S3 bucket
Azure Storage
- Azure Blob Storage
- Azure DataLake v2 Storage

In the following, we provide guidance on how to set up different object storages and how to configure them to make sure that they seamlessly integrate with Charmed Apache Spark.

In fact, to store Apache Spark logs on dedicated directories, you need to create the appropriate folder (that is named spark-events) in the storage backend. This can be done both on S3 and on Azure DataLake Gen2 Storage. Although there are multiple ways to do this, in the following we recommend you to use snap clients. Alternatively, use Python libraries.

S3-compatible object storages¶

To connect Charmed Apache Spark with an S3-compatible object storage, the following configurations need to be specified:

access_key
secret_key
endpoint
bucket
(optional) region

Leveraging on standard S3 API, you can use the aws-cli snap client to perform operations with the S3 service, like creating buckets, uploading new content, inspecting the structure, and removing data.

Install the AWS CLI client:

sudo snap install aws-cli --classic

The client can then be configured using the parameters above:

aws configure set aws_access_key_id <S3_ACCESS_KEY>
aws configure set aws_secret_access_key <S3_SECRET_KEY>
aws configure set endpoint_url <S3_ENDPOINT>
aws configure set default.region <S3_REGION> # Optional for AWS only

Test that the AWS CLI client is properly working:

aws s3 ls

Ceph with RadowGW¶

Ceph is a highly scalable, open-source distributed storage system designed to provide excellent performance, reliability, and flexibility for object, block, and file-level storage. MicroCeph is a lightweight way of deploying and managing a Ceph cluster, with a user-experience provided via a snap.

In the following, we will setup a single node cluster. However, if you need a multi-node installation please refer to the MicroCeph user documentation for more information.

First of all, install the microceph snap

sudo snap install microceph --channel latest/stable

We generally recommend to hold off from refreshing, that may cause data losses or inconsistencies during upgrades

sudo snap refresh --hold microceph

At this point, a new Ceph cluster can be bootstrapped using

sudo microceph cluster bootstrap

Once the cluster is bootstrapped, we can then add a disk using three file-backed Object Storage Daemons that are a convenient way for creating a small test and development environment

sudo microceph disk add loop,4G,3

In a production system, typically one would assign a physical block device to an OSD. Please refer to the product documentation for more information.

You can check the status of the Ceph cluster by using the status command, i.e.

sudo microceph status

The status output should provide you information about the deployment, e.g. the hostname (IP address) for the cluster.

Enable RadosGW¶

Once the Ceph cluster is up and running, we need to enable the Ceph Object Gateway (aka RadosGW) to interact with the Ceph storage using a S3-compatible API.

To enable the RadosGW

sudo microceph enable rgw [--port <port>]

You can use the --port to specify a port different than the port 80 (default) if this is already occupied.

You can again check that the RadosGW service is enabled by using sudo microceph status command, and making sure that the rgw acronym appears among the Services.

At this point, create a test user for the RadosGW API service, by specifying the chosen access-key and secret-key

sudo microceph.radosgw-admin user create --uid test --display-name test --access-key=<access-key> --secret-key=<secret-key>

The RadosGW API can then be reached at <hostname>:<port>, where hostname is the IP provided by the sudo microceph status command, and the port is the one chosen when enabling the RadosGW (that defaults to 80 if not otherwise specified).

MicroK8s MinIO¶

If you have already a MicroK8s cluster running, you can enable the MinIO storage with the dedicated add-on

microk8s.enable minio

Refer to the add-on documentation for more information on how to customize your MinIO MicroK8s deployment.

You can then use the following commands to obtain the access_key, the secret_key and the MinIO endpoint:

access_key: microk8s.kubectl get secret -n minio-operator microk8s-user-1 -o jsonpath='{.data.CONSOLE_ACCESS_KEY}' | base64 -d
secret_key: microk8s.kubectl get secret -n minio-operator microk8s-user-1 -o jsonpath='{.data.CONSOLE_SECRET_KEY}' | base64 -d
endpoint: microk8s.kubectl get services -n minio-operator | grep minio | awk '{ print $3 }'

Configure the AWS CLI snap with these parameters, as shown above. After that, you can create a bucket using

aws s3 mb s3://<S3_BUCKET>

AWS S3¶

To use AWS S3, you need to have an AWS user that has permission to use S3 resource for reading and writing. You can create a new user or use an existing one, as long as you grant permission to S3, either centrally using the IAM console or from the S3 service itself.

If the service account has the AmazonS3FullAccess permission, you can create new buckets by using

aws s3api create-bucket --bucket <S3_BUCKET> --region <S3_REGION>

Note that buckets will be associated to a given AWS region. Once the bucket is created, you can use the access_key and the secret_key of your service account, also used for authenticating with the AWS CLI profile. The endpoint of the service is https://s3.<S3_REGION>.amazonaws.com.

Setting up the object storage¶

To create a folder on an existing bucket, just place an empty path object spark-events:

aws s3api put-object --bucket <S3_BUCKET> --key spark-events

The S3-object storage should now be ready to be used by Spark jobs to store their logs.

Azure Storage¶

Charmed Apache Spark provides out-of-the-box support also for the following Azure storage backends:

Azure Blob Storage (both WASB and WASBS)
Azure DataLake Gen2 Storage (ABFS and ABFSS)

Caution

Note that Azure DataLake Gen1 Storage is currently not supported, and it has been deprecated by Azure.

To connect Charmed Apache Spark with the Azure storage backends (WASB, WASBS, ABFS and ABFSS) the following configurations need to be specified:

storage_account
storage_key
container

Setting up the object storage¶

You can use the azcli snap client to perform operations with the Azure storage services, like creating buckets, uploading new content, inspecting the structure, and removing data.

To install the azcli client, use

sudo snap install azcli

The client needs to be configured using the Azure storage account and the associated storage key. These credentials can be retrieved from the Azure portal, after creating a storage account. When creating the storage account, make sure that you enable “Hierarchical namespace” if you want to use Azure DataLake Gen2 Storage (which is assumed by default unless configured otherwise).

Once you have that information, the client can be configured by using the following environment variables:

export AZURE_STORAGE_ACCOUNT=<storage_account> 
export AZURE_STORAGE_KEY=<storage_key>

Now test that the azcli client is properly working with

azcli storage container list

Once the credentials are set up, you can create a container in your namespace either from the portal or also using the azcli with

azcli storage container create --fail-on-exist --name <AZURE_CONTAINER>

To create a folder on an existing container, just place any file in the container under the spark-events path. To do this, you can use the azcli client snap:

azcli storage blob upload --container-name <AZURE_CONTAINER> --name spark-events/a.tmp -f /dev/null

Use a local Python package with PySpark¶

You can configure the spark-client snap to see and access Python packages, installed locally, outside of the snap, which are not available in the PySpark shell by default.

For example, to do so for NumPy, follow the steps below.

Create a virtual environment and install the Python package using

python3.10 -m venv .venv
.venv/bin/pip install numpy

Caution

Ensure you use the same Python version as the one shipped in the spark-client snap. The python version is displayed upon entering the PySpark shell.

After the installation, configure the client using the following environment variable:

export PYTHONPATH=path/to/.venv/lib/python3.10/site-packages

You may now enter the PySpark shell and import the package:

import numpy as np