Seamless migration of machine learning codebase from the local or EC2 to Databricks PySpark Clusters(Part II: Manage all-purpose clusters via CLI and API)

Punchh Technology Blog
6 min readMar 29, 2020

Author: Tian Lan

Following the previous blog (Part I), in this article, we are going to show how we configure and start the remote Databricks cluster from the local, and how we upload and install our machine learning codebase discussed in the previous blog to Databricks cluster. This is an essential step to fulfill our goal that “the user interface looks like still running on your old local or EC2 environment.” so that we can “migrate the monster codebase in a seamless way”.

According to Databricks Documentation: “You use all-purpose clusters to analyze data collaboratively using interactive notebooks. You use job clusters to run fast and robust automated jobs”. Essentially, you can manually terminate and restart an all-purpose cluster. Multiple users can also share such clusters to do collaborative interactive analysis. The job cluster is created ad-hoc when you run a job and the cluster is terminated when the job is complete. You cannot restart a job cluster.

In Part II and Part III, I will present the solution to managing the all-purpose clusters; for the job clusters, the solution is similar and will be discussed in Part IV.

To review all the sections of this topic, here is the Table of Contents for the related blogs:

  1. Write the Spark User-Defined Function and codebase adaptor (Part I link)
  2. Manage Databricks all-purpose clusters via CLI and API (Part II)
  3. Build, synchronize and submit job scripts at local (Part III link)
  4. Manage Databricks job clusters exclusively via CLI and API (Part IV link)

Part II: All-purpose cluster: uninstall, install and manage the machine learning codebase from the local (Laptop, AWS EC2, etc.)

Databricks CLI

According to Databricks documentation, the Databricks command-line interface (CLI) provides an easy-to-use interface to the Databricks platform. The CLI is built on top of the Databricks REST API 2.0 and is organized into command groups based on the Workspace API, Clusters API, DBFS API, Groups API, Jobs API, Libraries API, and Secrets API.

Package the Codebase

We choose wheel to package the codebase. A wheel is a ZIP-format archive with a specially formatted filename and the .whl extension. It is designed to contain all the files for a PEP 376 compatible install in a way that is very close to the on-disk format. Many packages will be properly installed with only the “Unpack” step (simply extracting the file onto sys.path), and the unpacked archive preserves enough information to “Spread” (copy data and scripts to their final locations) at any later time.

Remember we have two codebases that need to be packaged for the installation later. One is the original machine learning codebase, the other is the PySpark adaptor codebase that we have discussed in the previous blog.

Uninstall the Old ML Codebase

If it is an existing all-purpose cluster, before you install a new version of ML codebase and the corresponding PySpark codebase(we will show how to install them from the local shortly) to the Databricks cluster, you should uninstall the old ones to avoid possible conflicts.

Firstly you should start your terminated cluster if it is not and make sure it is running. We are using Databricks CLI for clusters:

uninstall.sh

# if the cluster is terminated, start it
status=$(databricks clusters get --cluster-id ${CLUSTERID} | jq '.state')
if [[ "${status}" = *"TERMINATED"* ]]
then
echo "The cluster is not running yet, start ..."
databricks clusters start --cluster-id ${CLUSTERID}
fi
# wait till cluster turns to "Running"
waittime=0
while [[ "${status}" != *"RUNNING"* ]]
do
echo "Waiting for starting the cluster...$waittime seconds"
sleep 10
status=$(databricks clusters get --cluster-id ${CLUSTERID} | jq '.state')
waittime=$((waittime+10))
done

Next, you need to check if your old codebase has been installed already. If so, you should uninstall it. We are using Databricks CLI for libraries:

# check if the target lib is already "Intalled" or "Installing", uninstall it
libs=$(databricks libraries cluster-status --cluster-id ${CLUSTERID} | jq '.library_statuses')
need_restart=0
for row in $(echo "${libs}" | jq -r '.[] | @base64'); do
_jq() {
echo ${row} | base64 --decode | jq -r ${1}
}
if [[ $(_jq '.library.whl') = *"${LIBNAME}"* ]]
then
echo "Uninstall:" $(_jq '.library')
databricks libraries uninstall --cluster-id ${CLUSTERID} --whl $(_jq '.library.whl')
echo "Uninstall finished"
need_restart=1
fi
done

Finally, we restart the cluster to make the uninstallation take effect:

if [ ${need_restart} -eq 1 ]
then
echo "Restart cluster to make uninstallation effective..."
databricks clusters restart --cluster-id ${CLUSTERID}
status=$(databricks clusters get --cluster-id ${CLUSTERID} | jq '.state')waittime=0
while [[ "${status}" != *"RUNNING"* ]]
do
echo "Waiting for starting the cluster...$waittime seconds"
sleep 10
status=$(databricks clusters get --cluster-id ${CLUSTERID} | jq '.state')
waittime=$((waittime+10))
done
fi
echo "Uninstall All Done"

If you run it, you can see the workflow like this

bash sh/uninstall.sh
The cluster is not running yet, start ...
Waiting for starting the cluster...0 seconds
Waiting for starting the cluster...10 seconds
Waiting for starting the cluster...20 seconds
Waiting for starting the cluster...30 seconds
Waiting for starting the cluster...40 seconds
Waiting for starting the cluster...50 seconds
Waiting for starting the cluster...60 seconds
Waiting for starting the cluster...70 seconds
Waiting for starting the cluster...80 seconds
Waiting for starting the cluster...90 seconds
Waiting for starting the cluster...100 seconds
Waiting for starting the cluster...110 seconds
Uninstall: { "whl": "dbfs:/FileStore/users/tianlan/XXXXXXXX.whl" }
WARNING: Uninstalling libraries requires a cluster restart.
databricks clusters restart --cluster-id XXXXXXXXXX
Uninstall finished
Restart cluster to make uninstallation effective...
Waiting for starting the cluster...0 seconds
Waiting for starting the cluster...10 seconds
Waiting for starting the cluster...20 seconds
Waiting for starting the cluster...30 seconds
Waiting for starting the cluster...40 seconds
Uninstall All Done

Install the New ML Codebase

Installation of a new codebase is surprisingly simple with the help of Databricks CLI. As we packaged our code to the wheel file, we just need to upload it to the DBFS in the server and install it on the cluster

install.sh

whl_file={modulename}-${version}-py3-none-any.whlrm -rf dist
python setup.py sdist bdist_wheel
dbfs cp --overwrite ./dist/${whl_file} ${dbfs_dir}
echo "databricks libraries install --cluster-id ${CLUSTERID} --whl ${dbfs_dir}/${whl_file}"
databricks libraries install --cluster-id ${CLUSTERID} --whl ${dbfs_dir}/${whl_file}

Local Command

This is as simple as calling two lines of codes

bash uninstall.sh
bash install.sh

You can see that we have achieved Milestone #2. At this point, we have 100% control over the clusters from your local system (laptop, EC2, etc.). You do not need any Databricks UI, but if you are interested, you can use UI as a monitor to track each status (.e.g, start, uninstallation, restart, upload and installation) along this process.

Next blog, I am going to talk about how to build the local job scripts and synchronize with the cluster and submit the job there. My goal is the task creation, submission, and logging from your local to remote cluster looks as if all are taking place on your local alone.

About Punchh

Headquartered in San Mateo, CA, Punchh is the world leader in innovative digital marketing products for brick and mortar retailers, combining AI and machine learning technologies, mobile-first expertise, and Omni-Channel communications designed to dramatically increase lifetime customer value. Leading global chains in the restaurant, health and beauty sectors rely on Punchh to grow revenue by building customer relationships at every stage, from anonymous, to known, to brand loyalists, including more than 100 different chains representing more than $12 billion in annual spend.

About the Author

Dr. Tian Lan is Tech Lead of A.I. at Punchh, where he leads the development of large-scale and distributed machine learning for recommender systems and personalized marketing.

--

--

Punchh Technology Blog

Punchh is a marketing & data platform. In the blog site, we will share our learnings from data and technology.