在大数据处理领域,如何高效地管理和分析海量数据是每个企业都面临的挑战。Kyuubi 和 Apache Spark 是广泛应用的大数据处理工具,而 Celeborn 则作为 Spark RSS组件,负责优化spark shuffle。本文介绍kyuubi + spark 结合celeborn在k8s上的入门实践。
构建celeborn镜像
下载celeborn二进制包,执行如下命令:
docker build . -f docker/Dockerfile --tag 192.168.2.22:18082/xiaozhongcheng2022/celeborn:0.5.1
构建spark镜像
下载spark二进制包,解压后,将celeborn的${CELEBORN_HOME}/spark/celeborn-client-spark-3-shaded_2.12-0.5.1.jar拷贝到spark的jars目录下
执行如下命令:
./bin/docker-image-tool.sh -r 192.168.2.22:18082/xiaozhongcheng2022 -t 3.5.2 -b java_image_tag=8-jdk-focal build
构建kyuubi镜像
下载kyuubi二进制包,执行如下命令:
./bin/docker-image-tool.sh -r 192.168.2.22:18082/xiaozhongcheng2022 -t v1.10.0-SNAPSHOT -b BASE_IMAGE=192.168.2.22:18082/xiaozhongcheng2022/spark:3.5.2 build
celeborn on k8s 部署
创建celeborn缓存路径:
mkdir /mnt/celeborn_ratis
mkdir /mnt/disk1
mkdir /mnt/disk2
mkdir /mnt/disk3
mkdir /mnt/disk4
在celeborn二进制包中,进入目录charts/celeborn,修改values.yaml文件:
repository设置为:192.168.2.22:18082/xiaozhongcheng2022/celeborn
tag设置为:0.5.1
masterReplicas设置为:1
workerReplicas设置为:1
celeborn.master.ha.enabled设置为:false
修改完成之后,安装celeborn:
helm install celeborn charts/celeborn
得到输出:
kyuubi on k8s 部署
创建serviceaccount,用于kyuubi和spark部署:
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
name: kyuubi-cluster-role
rules:
- apiGroups: ["coordination.k8s.io"]
resources: ["leases"]
verbs: ["*"]
- apiGroups: ["discovery.k8s.io"]
resources: ["endpointslices"]
verbs: ["*"]
- apiGroups: [""]
resources: ["nodes", "services", "pods", "endpoints", "configmaps", "secrets", "serviceaccounts", "events", "namespaces"]
verbs: ["*"]
- apiGroups: ["apps"]
resources: ["daemonsets", "statefulsets", "deployments"]
verbs: ["*"]
- apiGroups: ["extensions"]
resources: ["deployments"]
verbs: ["*"]
- apiGroups: ["apiextensions.k8s.io"]
resources: ["customresourcedefinitions"]
verbs: ["*"]
- apiGroups: ["networking.k8s.io"]
resources: ["networkpolicies", "ingressclasses", "ingresses"]
verbs: ["*"]
- apiGroups: ["policy"]
resources: ["poddisruptionbudgets"]
verbs: ["*"]
- apiGroups: ["rbac.authorization.k8s.io"]
resources: ["*"]
verbs: ["*"]
- apiGroups: ["apiregistration.k8s.io"]
resources: ["*"]
verbs: ["*"]
- apiGroups: ["monitoring.coreos.com"]
resources: ["*"]
verbs: ["*"]
- apiGroups: [ "flink.apache.org" ]
resources: [ "*" ]
verbs: [ "*" ]
- apiGroups: [ "admissionregistration.k8s.io" ]
resources: [ "*" ]
verbs: [ "*" ]
- apiGroups: [ "batch" ]
resources: [ "*" ]
verbs: [ "*" ]
- apiGroups: [""]
resources: ["persistentvolumeclaims"]
verbs: ["*"]
- apiGroups: ["storage.k8s.io"]
resources: ["storageclasses"]
verbs: ["get", "list", "watch"]
---
apiVersion: v1
kind: ServiceAccount
metadata:
name: kyuubi-service-account
namespace: default
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: kyuubi-cluster-role-binding
roleRef:
kind: Role
name: kyuubi-cluster-role
apiGroup: rbac.authorization.k8s.io
subjects:
- kind: ServiceAccount
name: kyuubi-service-account
namespace: default
将core-site.xml,hdfs-site.xml,hive-site.xml文件放在文件夹hadoop-conf下,
使用spark pod template配置hdfs集群hosts
apiVersion: v1
kind: Pod
spec:
hostAliases:
- ip: 192.168.2.23
hostnames:
- hadoop
将上述文件保存为:
spark-pod-template.yaml
放到hadoop-conf文件下。
创建configmap:
kubectl create configmap hadoop-conf --from-file=hadoop-conf
在kyuubi二进制包中,进入目录charts/kyuubi,修改values.yaml文件
repository修改为:192.168.2.22:18082/xiaozhongcheng2022/kyuubi,
tag修改为:v1.10.0-SNAPSHOT
serviceAccount.create改为:false
serviceAccount.name改为:kyuubi-service-account
在sparkDefaults中设置如下配置:
spark.submit.deployMode=client
spark.shuffle.manager=org.apache.spark.shuffle.celeborn.SparkShuffleManager
spark.shuffle.service.enabled=false
spark.celeborn.master.endpoints=celeborn-master-0.celeborn-master-svc.default:9097
spark.kubernetes.container.image=192.168.2.22:18082/xiaozhongcheng2022/spark:3.5.2
spark.kubernetes.hadoop.configMapName=hadoop-conf
spark.kubernetes.authenticate.driver.serviceAccountName=kyuubi-service-account
spark.kubernetes.driver.podTemplateFile=file:///hadoop-conf/spark-pod-template.yaml
spark.kubernetes.executor.podTemplateFile=file:///hadoop-conf/spark-pod-template.yaml
把hadoop-conf configmap挂载到kyuubi容器中。
volumes:
- name: hadoop-conf
configMap:
name: hadoop-conf
defaultMode: 0777
# Additional volumeMounts for Kyuubi container (templated)
volumeMounts:
- name: hadoop-conf
mountPath: /hadoop-conf
在sparkEnv中设置:
export HIVE_CONF_DIR=/hadoop-conf
添加hdfs集群的hosts映射到kyuubi中。
安装kyuubi:
helm install kyuubi charts/kyuubi
得到输出:
读取hive表
进入kyuubi容器:
kubectl exec -it kyuubi-0 bash
连接kyuubi:
bin/kyuubi-beeline -u "jdbc:kyuubi://kyuubi-thrift-binary.default.svc.cluster.local:10009" -n xiaozhch5
由于我已经提前将tpcds数据导入hive的tpcds库中,所以下面执行tpcds相关查询:
WITH year_total AS (
SELECT c_customer_id AS customer_id, c_first_name AS customer_first_name, c_last_name AS customer_last_name, c_preferred_cust_flag AS customer_preferred_cust_flag, c_birth_country AS customer_birth_country
, c_login AS customer_login, c_email_address AS customer_email_address, d_year AS dyear
, SUM((ss_ext_list_price - ss_ext_wholesale_cost - ss_ext_discount_amt + ss_ext_sales_price) / 2) AS year_total
, 's' AS sale_type
FROM customer, store_sales, date_dim
WHERE c_customer_sk = ss_customer_sk
AND ss_sold_date_sk = d_date_sk
GROUP BY c_customer_id, c_first_name, c_last_name, c_preferred_cust_flag, c_birth_country, c_login, c_email_address, d_year
UNION ALL
SELECT c_customer_id AS customer_id, c_first_name AS customer_first_name, c_last_name AS customer_last_name, c_preferred_cust_flag AS customer_preferred_cust_flag, c_birth_country AS customer_birth_country
, c_login AS customer_login, c_email_address AS customer_email_address, d_year AS dyear
, SUM((cs_ext_list_price - cs_ext_wholesale_cost - cs_ext_discount_amt + cs_ext_sales_price) / 2) AS year_total
, 'c' AS sale_type
FROM customer, catalog_sales, date_dim
WHERE c_customer_sk = cs_bill_customer_sk
AND cs_sold_date_sk = d_date_sk
GROUP BY c_customer_id, c_first_name, c_last_name, c_preferred_cust_flag, c_birth_country, c_login, c_email_address, d_year
UNION ALL
SELECT c_customer_id AS customer_id, c_first_name AS customer_first_name, c_last_name AS customer_last_name, c_preferred_cust_flag AS customer_preferred_cust_flag, c_birth_country AS customer_birth_country
, c_login AS customer_login, c_email_address AS customer_email_address, d_year AS dyear
, SUM((ws_ext_list_price - ws_ext_wholesale_cost - ws_ext_discount_amt + ws_ext_sales_price) / 2) AS year_total
, 'w' AS sale_type
FROM customer, web_sales, date_dim
WHERE c_customer_sk = ws_bill_customer_sk
AND ws_sold_date_sk = d_date_sk
GROUP BY c_customer_id, c_first_name, c_last_name, c_preferred_cust_flag, c_birth_country, c_login, c_email_address, d_year
)
SELECT t_s_secyear.customer_id, t_s_secyear.customer_first_name, t_s_secyear.customer_last_name, t_s_secyear.customer_preferred_cust_flag
FROM year_total t_s_firstyear, year_total t_s_secyear, year_total t_c_firstyear, year_total t_c_secyear, year_total t_w_firstyear, year_total t_w_secyear
WHERE t_s_secyear.customer_id = t_s_firstyear.customer_id
AND t_s_firstyear.customer_id = t_c_secyear.customer_id
AND t_s_firstyear.customer_id = t_c_firstyear.customer_id
AND t_s_firstyear.customer_id = t_w_firstyear.customer_id
AND t_s_firstyear.customer_id = t_w_secyear.customer_id
AND t_s_firstyear.sale_type = 's'
AND t_c_firstyear.sale_type = 'c'
AND t_w_firstyear.sale_type = 'w'
AND t_s_secyear.sale_type = 's'
AND t_c_secyear.sale_type = 'c'
AND t_w_secyear.sale_type = 'w'
AND t_s_firstyear.dyear = 2001
AND t_s_secyear.dyear = 2001 + 1
AND t_c_firstyear.dyear = 2001
AND t_c_secyear.dyear = 2001 + 1
AND t_w_firstyear.dyear = 2001
AND t_w_secyear.dyear = 2001 + 1
AND t_s_firstyear.year_total > 0
AND t_c_firstyear.year_total > 0
AND t_w_firstyear.year_total > 0
AND CASE
WHEN t_c_firstyear.year_total > 0 THEN t_c_secyear.year_total / t_c_firstyear.year_total
ELSE NULL
END > CASE
WHEN t_s_firstyear.year_total > 0 THEN t_s_secyear.year_total / t_s_firstyear.year_total
ELSE NULL
END
AND CASE
WHEN t_c_firstyear.year_total > 0 THEN t_c_secyear.year_total / t_c_firstyear.year_total
ELSE NULL
END > CASE
WHEN t_w_firstyear.year_total > 0 THEN t_w_secyear.year_total / t_w_firstyear.year_total
ELSE NULL
END
ORDER BY t_s_secyear.customer_id, t_s_secyear.customer_first_name, t_s_secyear.customer_last_name, t_s_secyear.customer_preferred_cust_flag
LIMIT 100;
可以看到celeborn相关日志:
最后输出:
FAQ
报错:
Caused by: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.authorize.AuthorizationException): User: kyuubi is not allowed to impersonate xiaozhch5
解决方法:
修改core-site.xml,新增:
</property>
<property>
<name>hadoop.proxyuser.kyuubi.groups</name>
<value>*</value>
</property>
<property>
<name>hadoop.proxyuser.kyuubi.hosts</name>
<value>*</value>
</property>
本文为从大数据到人工智能博主「xiaozhch5」的原创文章,遵循CC 4.0 BY-SA版权协议,转载请附上原文出处链接及本声明。
原文链接:https://lrting.top/backend/14418/