释放大数据的潜力：基于kyuubi + spark + celeborn在 Kubernetes 上构建高效的计算集群入门

在大数据处理领域，如何高效地管理和分析海量数据是每个企业都面临的挑战。Kyuubi 和 Apache Spark 是广泛应用的大数据处理工具，而 Celeborn 则作为 Spark RSS组件，负责优化spark shuffle。本文介绍kyuubi + spark 结合celeborn在k8s上的入门实践。

构建celeborn镜像

下载celeborn二进制包，执行如下命令：

docker build . -f docker/Dockerfile --tag 192.168.2.22:18082/xiaozhongcheng2022/celeborn:0.5.1

构建spark镜像

下载spark二进制包，解压后，将celeborn的${CELEBORN_HOME}/spark/celeborn-client-spark-3-shaded_2.12-0.5.1.jar拷贝到spark的jars目录下

执行如下命令：

./bin/docker-image-tool.sh -r 192.168.2.22:18082/xiaozhongcheng2022 -t 3.5.2 -b java_image_tag=8-jdk-focal build

构建kyuubi镜像

下载kyuubi二进制包，执行如下命令：

./bin/docker-image-tool.sh -r 192.168.2.22:18082/xiaozhongcheng2022 -t v1.10.0-SNAPSHOT -b BASE_IMAGE=192.168.2.22:18082/xiaozhongcheng2022/spark:3.5.2 build

celeborn on k8s 部署

创建celeborn缓存路径：

mkdir /mnt/celeborn_ratis
mkdir /mnt/disk1
mkdir /mnt/disk2
mkdir /mnt/disk3
mkdir /mnt/disk4

在celeborn二进制包中，进入目录charts/celeborn，修改values.yaml文件：

repository设置为：192.168.2.22:18082/xiaozhongcheng2022/celeborn

tag设置为：0.5.1

masterReplicas设置为：1

workerReplicas设置为：1

celeborn.master.ha.enabled设置为：false

修改完成之后，安装celeborn：

helm install celeborn charts/celeborn

得到输出：

kyuubi on k8s 部署

创建serviceaccount，用于kyuubi和spark部署：

apiVersion: rbac.authorization.k8s.io/v1
kind: Role

metadata:
  name: kyuubi-cluster-role

rules:
  - apiGroups: ["coordination.k8s.io"]
    resources: ["leases"]
    verbs: ["*"]
  - apiGroups: ["discovery.k8s.io"]
    resources: ["endpointslices"]
    verbs: ["*"]
  - apiGroups: [""]
    resources: ["nodes", "services", "pods", "endpoints", "configmaps", "secrets", "serviceaccounts", "events", "namespaces"]
    verbs: ["*"]
  - apiGroups: ["apps"]
    resources: ["daemonsets", "statefulsets", "deployments"]
    verbs: ["*"]
  - apiGroups: ["extensions"]
    resources: ["deployments"]
    verbs: ["*"]
  - apiGroups: ["apiextensions.k8s.io"]
    resources: ["customresourcedefinitions"]
    verbs: ["*"]
  - apiGroups: ["networking.k8s.io"]
    resources: ["networkpolicies", "ingressclasses", "ingresses"]
    verbs: ["*"]
  - apiGroups: ["policy"]
    resources: ["poddisruptionbudgets"]
    verbs: ["*"]
  - apiGroups: ["rbac.authorization.k8s.io"]
    resources: ["*"]
    verbs: ["*"]
  - apiGroups: ["apiregistration.k8s.io"]
    resources: ["*"]
    verbs: ["*"]
  - apiGroups: ["monitoring.coreos.com"]
    resources: ["*"]
    verbs: ["*"]
  - apiGroups: [ "flink.apache.org" ]
    resources: [ "*" ]
    verbs: [ "*" ]
  - apiGroups: [ "admissionregistration.k8s.io" ]
    resources: [ "*" ]
    verbs: [ "*" ]
  - apiGroups: [ "batch" ]
    resources: [ "*" ]
    verbs: [ "*" ]
  - apiGroups: [""]
    resources: ["persistentvolumeclaims"]
    verbs: ["*"]
  - apiGroups: ["storage.k8s.io"]
    resources: ["storageclasses"]
    verbs: ["get", "list", "watch"]
---

apiVersion: v1
kind: ServiceAccount

metadata:
  name: kyuubi-service-account
  namespace: default

---

apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding

metadata:
  name: kyuubi-cluster-role-binding

roleRef:
  kind: Role
  name: kyuubi-cluster-role
  apiGroup: rbac.authorization.k8s.io

subjects:
  - kind: ServiceAccount
    name: kyuubi-service-account
    namespace: default

将core-site.xml,hdfs-site.xml,hive-site.xml文件放在文件夹hadoop-conf下，

使用spark pod template配置hdfs集群hosts

apiVersion: v1
kind: Pod
spec:
  hostAliases:
    - ip: 192.168.2.23
      hostnames:
      - hadoop

将上述文件保存为：

spark-pod-template.yaml

放到hadoop-conf文件下。

创建configmap：

kubectl create configmap hadoop-conf --from-file=hadoop-conf

在kyuubi二进制包中，进入目录charts/kyuubi，修改values.yaml文件

repository修改为：192.168.2.22:18082/xiaozhongcheng2022/kyuubi，

tag修改为：v1.10.0-SNAPSHOT

serviceAccount.create改为：false

serviceAccount.name改为：kyuubi-service-account

在sparkDefaults中设置如下配置：

    spark.submit.deployMode=client
    spark.shuffle.manager=org.apache.spark.shuffle.celeborn.SparkShuffleManager
    spark.shuffle.service.enabled=false
    spark.celeborn.master.endpoints=celeborn-master-0.celeborn-master-svc.default:9097
    spark.kubernetes.container.image=192.168.2.22:18082/xiaozhongcheng2022/spark:3.5.2
    spark.kubernetes.hadoop.configMapName=hadoop-conf
    spark.kubernetes.authenticate.driver.serviceAccountName=kyuubi-service-account
    spark.kubernetes.driver.podTemplateFile=file:///hadoop-conf/spark-pod-template.yaml
    spark.kubernetes.executor.podTemplateFile=file:///hadoop-conf/spark-pod-template.yaml

把hadoop-conf configmap挂载到kyuubi容器中。

volumes: 
  - name: hadoop-conf
    configMap:
      name: hadoop-conf
      defaultMode: 0777
# Additional volumeMounts for Kyuubi container (templated)
volumeMounts: 
  - name: hadoop-conf
    mountPath: /hadoop-conf

在sparkEnv中设置：

      export HIVE_CONF_DIR=/hadoop-conf

添加hdfs集群的hosts映射到kyuubi中。

安装kyuubi：

helm install kyuubi charts/kyuubi

得到输出：

读取hive表

进入kyuubi容器：
kubectl exec -it kyuubi-0 bash

连接kyuubi：

bin/kyuubi-beeline -u "jdbc:kyuubi://kyuubi-thrift-binary.default.svc.cluster.local:10009" -n xiaozhch5

由于我已经提前将tpcds数据导入hive的tpcds库中，所以下面执行tpcds相关查询：

WITH year_total AS (
  SELECT c_customer_id AS customer_id, c_first_name AS customer_first_name, c_last_name AS customer_last_name, c_preferred_cust_flag AS customer_preferred_cust_flag, c_birth_country AS customer_birth_country
   , c_login AS customer_login, c_email_address AS customer_email_address, d_year AS dyear
   , SUM((ss_ext_list_price - ss_ext_wholesale_cost - ss_ext_discount_amt + ss_ext_sales_price) / 2) AS year_total
   , 's' AS sale_type
  FROM customer, store_sales, date_dim
  WHERE c_customer_sk = ss_customer_sk
   AND ss_sold_date_sk = d_date_sk
  GROUP BY c_customer_id, c_first_name, c_last_name, c_preferred_cust_flag, c_birth_country, c_login, c_email_address, d_year
  UNION ALL
  SELECT c_customer_id AS customer_id, c_first_name AS customer_first_name, c_last_name AS customer_last_name, c_preferred_cust_flag AS customer_preferred_cust_flag, c_birth_country AS customer_birth_country
   , c_login AS customer_login, c_email_address AS customer_email_address, d_year AS dyear
   , SUM((cs_ext_list_price - cs_ext_wholesale_cost - cs_ext_discount_amt + cs_ext_sales_price) / 2) AS year_total
   , 'c' AS sale_type
  FROM customer, catalog_sales, date_dim
  WHERE c_customer_sk = cs_bill_customer_sk
   AND cs_sold_date_sk = d_date_sk
  GROUP BY c_customer_id, c_first_name, c_last_name, c_preferred_cust_flag, c_birth_country, c_login, c_email_address, d_year
  UNION ALL
  SELECT c_customer_id AS customer_id, c_first_name AS customer_first_name, c_last_name AS customer_last_name, c_preferred_cust_flag AS customer_preferred_cust_flag, c_birth_country AS customer_birth_country
   , c_login AS customer_login, c_email_address AS customer_email_address, d_year AS dyear
   , SUM((ws_ext_list_price - ws_ext_wholesale_cost - ws_ext_discount_amt + ws_ext_sales_price) / 2) AS year_total
   , 'w' AS sale_type
  FROM customer, web_sales, date_dim
  WHERE c_customer_sk = ws_bill_customer_sk
   AND ws_sold_date_sk = d_date_sk
  GROUP BY c_customer_id, c_first_name, c_last_name, c_preferred_cust_flag, c_birth_country, c_login, c_email_address, d_year
 )
SELECT t_s_secyear.customer_id, t_s_secyear.customer_first_name, t_s_secyear.customer_last_name, t_s_secyear.customer_preferred_cust_flag
FROM year_total t_s_firstyear, year_total t_s_secyear, year_total t_c_firstyear, year_total t_c_secyear, year_total t_w_firstyear, year_total t_w_secyear
WHERE t_s_secyear.customer_id = t_s_firstyear.customer_id
 AND t_s_firstyear.customer_id = t_c_secyear.customer_id
 AND t_s_firstyear.customer_id = t_c_firstyear.customer_id
 AND t_s_firstyear.customer_id = t_w_firstyear.customer_id
 AND t_s_firstyear.customer_id = t_w_secyear.customer_id
 AND t_s_firstyear.sale_type = 's'
 AND t_c_firstyear.sale_type = 'c'
 AND t_w_firstyear.sale_type = 'w'
 AND t_s_secyear.sale_type = 's'
 AND t_c_secyear.sale_type = 'c'
 AND t_w_secyear.sale_type = 'w'
 AND t_s_firstyear.dyear = 2001
 AND t_s_secyear.dyear = 2001 + 1
 AND t_c_firstyear.dyear = 2001
 AND t_c_secyear.dyear = 2001 + 1
 AND t_w_firstyear.dyear = 2001
 AND t_w_secyear.dyear = 2001 + 1
 AND t_s_firstyear.year_total > 0
 AND t_c_firstyear.year_total > 0
 AND t_w_firstyear.year_total > 0
 AND CASE
  WHEN t_c_firstyear.year_total > 0 THEN t_c_secyear.year_total / t_c_firstyear.year_total
  ELSE NULL
 END > CASE
  WHEN t_s_firstyear.year_total > 0 THEN t_s_secyear.year_total / t_s_firstyear.year_total
  ELSE NULL
 END
 AND CASE
  WHEN t_c_firstyear.year_total > 0 THEN t_c_secyear.year_total / t_c_firstyear.year_total
  ELSE NULL
 END > CASE
  WHEN t_w_firstyear.year_total > 0 THEN t_w_secyear.year_total / t_w_firstyear.year_total
  ELSE NULL
 END
ORDER BY t_s_secyear.customer_id, t_s_secyear.customer_first_name, t_s_secyear.customer_last_name, t_s_secyear.customer_preferred_cust_flag
LIMIT 100;

可以看到celeborn相关日志：

最后输出：

FAQ

报错：

Caused by: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.authorize.AuthorizationException): User: kyuubi is not allowed to impersonate xiaozhch5

解决方法：

修改core-site.xml，新增：

        </property>
                <property>
                <name>hadoop.proxyuser.kyuubi.groups</name>
                <value>*</value>
        </property>
                <property>
                <name>hadoop.proxyuser.kyuubi.hosts</name>
                <value>*</value>
        </property>

0 0 投票数

文章评分