背景:近期公司在调研clickhouse,并摸索从hdfs导入clickhouse的方法,知道datax已经支持hdfs导入clickhouse后,于是搭建了测试机器,测试了整个过程,以下包含全部试错过程,供大家参考。
安装环境:
Linux
Python 2.7.5
datax 3.0
java version : 1.8.0_73
调试步骤
- 找到.son缩写配置内容,在datax/bin下新建job.json,模板如下:
reader : hdfsreader
writer : clickhousewriter
{
"job": {
"setting": {
"speed": {
"byte":10485760
},
"errorLimit": {
"record": 0,
"percentage": 0.02
}
},
"content": [
{
"reader": {
"name": "hdfsreader",
"parameter": {
"hadoopConfig": {
"dfs.nameservices": "hadoopcluster",
"dfs.ha.namenodes.hadoopcluster": "nm1,nm2",
"dfs.namenode.rpc-address.hadoopcluster.nm1": "hadoop1.xxx.xxx:8020",
"dfs.namenode.rpc-address.hadoopcluster.nm2": "hadoop2.xxx.xxx:8020",
"dfs.client.failover.proxy.provider.hadoopcluster": "org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider"
},
"defaultFS": "hdfs://xx.xx.xx.106",
"path": "/user/xx/dw/xx/ads_music_text2ck",
"column": [
{
"index": 0,
"type": "double"
},
{
"index": 1,
"type": "string"
},
{
"index": 2,
"type": "string"
},
{
"index": 3,
"type": "string"
},
{
"index": 4,
"type": "string"
},
{
"index": 5,
"type": "string"
},
{
"index": 6,
"type": "string"
},
{
"index": 7,
"type": "string"
},
{
"index": 8,
"type": "string"
},
{
"index": 9,
"type": "string"
},
{
"index": 10,
"type": "string"
}
],
"fileType": "text",
"encoding": "UTF-8",
"fieldDelimiter": ",",
"nullFormat":"\\N"
}
},
"writer": {
"name": "clickhousewriter",
"parameter": {
"username": "xxxxxx",
"password":"xxxxxx",
"column":["xx_id","xx_code","xx_name","two_class_code","xx_name","xxx","xx","xx","xx","etl_time","dt"],
"connection": [
{
"jdbcUrl": "jdbc:clickhouse://xx.xx.xx.xx:8123/db",
"table":["ads_music_text2ck"]
}
]
}
}
}
]
}
}
如何查看hdfs的默认命名空间是哪个?
[z@bi-fh-11-18 bin]$ hdfs getconf -confKey fs.defaultFS
如何查看hdfs的所有命名空间列表?
[z@bi-fh-11-18 bin]$ hdfs getconf -confKey dfs.nameservices
Hadoop集群的信息:
查hdfs-sit.xml文件
<property>
<name>dfs.nameservices</name>
<value>hadoopcluster</value>
<description>集群的命名空间,即使有多个集群,可以共用配置文件</description>
</property>
<property>
<name>dfs.ha.namenodes.hadoopcluster</name>
<value>nm1,nm2</value>
</property>
<property>
<name>dfs.namenode.rpc-address.hadoopcluster.nm1</name>
<value>hadoop1.xx.xxx:8020</value>
</property>
<property>
<name>dfs.namenode.rpc-address.hadoopcluster.nm2</name>
<value>hadoop2.xx.xxx:8020</value>
</property>
- 运行job.json
[z@bi-fh-11-18 bin]$ python home/xx/datax/bin/datax.py job.json
报错:
经DataX智能分析,该任务最可能的错误原因是:
com.alibaba.datax.common.exception.DataXException: Code:[Common-00], Describe:[您提供的配置文件存在错误信息,请检查您的作业配置 .] - 配置信息错误,您提供的配置文件[/home/xx/datax/plugin/reader/._drdsreader/plugin.json]不存在. 请检查您的配置文件.
原因: 加载了隐藏文件,删除这部分文件
解决措施:
删除datax/plugin/reader下所有.xxxx隐藏文件 **注意:**一定要.*er这种方式匹配文件,否则会匹配到里面的隐藏jar包
[z@bi-fh-11-18 bin]$ find /home/xx/datax/plugin/reader/ -type f -name "._*er" | xargs rm -rf
同理也删除datax/plugin/writer/下所有._xxxx隐藏文件
[z@bi-fh-11-18 bin]$ find /home/**/datax/plugin/writer/ -type f -name "._*er" | xargs rm -rf
- 运行报错:缺少clickhousewriter插件
经DataX智能分析,该任务最可能的错误原因是:
com.alibaba.datax.common.exception.DataXException: Code:[Framework-12], Description:[DataX插件初始化错误, 该问题通常是由于DataX安装错误引起,请联系您的运维解决 .]
- 插件加载失败,未完成指定插件加载:[clickhousewriter, hdfsreader]
排查过程:
检查 /datax/plugin/writer目录,发现没有clickhousewriter
[z@bi-fh-11-18 bin]$ cd /datax/plugin/writer
[z@bi-fh-11-18 bin]$ ll
解决措施:
联系运维,安装clickhousewriter 插件(去github下在打包)
- 再运行报错:
配置ck连接的端口号写的9000,这是安装的同事提供的用于登录ck-client的,json配置上是jdbc连接方式,猜测ck-jdbc默认8123端口。
ERROR RetryUtil - Exception when calling callable, 异常Msg:Code:[DBUtilErrorCode-10],
Description:[连接数据库失败. 请检查您的 账号、密码、数据库名称、IP、Port或者向 DBA 寻求帮助(注意网络环境).].
- 具体错误信息为:java.lang.RuntimeException: ru.yandex.clickhouse.except.ClickHouseUnknownException: ClickHouse exception, code: 1002, host: xx.xx.xx.xx, port: 9000; Port 9000 is for clickhouse-client program
- You must use port 8123 for HTTP.
解决措施:
vi 修改json中ck的连接端口号为8123
- 将ck的端口号由9000改为8123后,运行job.json
select * from ads_music_text2ck where 1=2 具体错误信息为:ru.yandex.clickhouse.except.ClickHouseUnknownException: ClickHouse exception, code: 1002,
host: xx.xx.xx.xx, port: 8123; Code: 60.
DB::Exception: Table default.ads_music_text2ck doesn't exist. (UNKNOWN_TABLE) (version 22.3.2.1)
原因分析:
这是刚开始job.json配置的ck连接没有带库名,导致运行后默认查ck的default库,但是表示建在db库,这里需要写明
"jdbcUrl": "jdbc:clickhouse://xx.xx.xx.xx:8123",
调整为:
"jdbcUrl": "jdbc:clickhouse://xx.xx.xx.xx:8123/db",
- 调整ck连接 后,运行job
Description:[您配置的path格式有误]. - java.io.IOException: Failed on local exception: com.google.protobuf.InvalidProtocolBufferException: Protocol message end-group tag did not match expected tag.;
Host Details : local host is: "bi-xx-xx-xx.xx.xx/127.0.0.1"; destination host is: "xx.xx.xx.xx":9000;
排除步骤:
运行以下命令,发现报错一样
[z@bi-fh-11-18 bin]$ hadoop fs -ls hdfs://xx.xx.xxx.xxx:9000/user/xx/xx/xx/ads_music_text2ck
将端口号9000删除,直接ip拼接路径,可以查出文件,此报错解除
[z@bi-fh-11-18 bin]$ hadoop fs -ls hdfs://xx.xx.xxx.xxx/user/xx/xx/xx/ads_music_text2ck
- 按照第6步的思路,调整job的defaultFS项,去掉hdfs的端口号,后运行job
DataXException: Code:[Framework-03], Description:[DataX引擎配置错误,该问题通常是由于DataX安装错误引起,请联系您的运维解决 .]. - 在有总bps限速条件下,单个channel的bps值不能为空 ,也不能为非正数
排错:
正如中文字面上所说,DataX的配置有问题,单个channel的bps值不能为空,也不能为非正数。
查看datax/conf/core.json
core -> transport -> channel -> speed -> “byte”:,默认值为:-1
解决办法:
修改文件:datax/conf/core.json,
core -> transport -> channel -> speed -> “byte”: 2000000,将单个channel的大小改为2MB即可。
- 重新运行后告警:
您尝试读取的列越界,源文件该行有 [11] 列,您尝试读取第 [12] 列
原因:json的配置读取列的从0开始的,原表有11列,那么应该配置0-11,需要删除多余的列
9.修改后, 运行通过,完美!!!
ck数据验证:条数和内容正确
同时测试了orc格式的hive表
jason 中writer部分参数如下:
"fileType": "orc",
"encoding": "UTF-8",
"fieldDelimiter": "\001",
"nullFormat":"\\N"
orc运行 ok,速度 大约 10w records/s
主流的存储格式也就是text和orc,其余格式的文件大家可以测试下。
2022-04-26 10:28:11.718 [job-0] INFO StandAloneJobContainerCommunicator - Total 62175872 records, 5949152214 bytes | Speed 9.71MB/s, 104240 records/s | Error 0 records, 0 bytes | All Task WaitWriterTime 508.631s | All Task WaitReaderTime 2,326.551s | Percentage 0.00%
2022-04-26 10:28:21.720 [job-0] INFO StandAloneJobContainerCommunicator - Total 63268416 records, 6047950580 bytes | Speed 9.42MB/s, 109254 records/s | Error 0 records, 0 bytes | All Task WaitWriterTime 518.396s | All Task WaitReaderTime 2,363.778s | Percentage 0.00%
2022-04-26 10:28:31.723 [job-0] INFO StandAloneJobContainerCommunicator - Total 64347872 records, 6149996451 bytes | Speed 9.73MB/s, 107945 records/s | Error 0 records, 0 bytes | All Task WaitWriterTime 526.469s | All Task WaitReaderTime 2,403.225s | Percentage 0.00%