datax将hive中的hdfs数据导入到clickhouse测试调错全程

3年前 (2022) 程序员胖胖胖虎阿

525 0 0

背景：近期公司在调研clickhouse，并摸索从hdfs导入clickhouse的方法，知道datax已经支持hdfs导入clickhouse后，于是搭建了测试机器，测试了整个过程，以下包含全部试错过程，供大家参考。

安装环境：

Linux
Python 2.7.5
datax 3.0
java version : 1.8.0_73

调试步骤

找到.son缩写配置内容，在datax/bin下新建job.json，模板如下：
reader : hdfsreader
writer : clickhousewriter

{
    "job": {
        "setting": {
            "speed": {
                "byte":10485760
            },
            "errorLimit": {
                "record": 0,
                "percentage": 0.02
            }
        },
        "content": [
            {
                "reader": {
                    "name": "hdfsreader",
                    "parameter": {
            "hadoopConfig": {
                            "dfs.nameservices": "hadoopcluster",
                            "dfs.ha.namenodes.hadoopcluster": "nm1,nm2",
                            "dfs.namenode.rpc-address.hadoopcluster.nm1": "hadoop1.xxx.xxx:8020",
                            "dfs.namenode.rpc-address.hadoopcluster.nm2": "hadoop2.xxx.xxx:8020",
                            "dfs.client.failover.proxy.provider.hadoopcluster": "org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider"
                        },
            "defaultFS": "hdfs://xx.xx.xx.106",
                        "path": "/user/xx/dw/xx/ads_music_text2ck",
                        "column": [
              {
                                "index": 0,
                                "type": "double"
                            },
              {
                                "index": 1,
                                "type": "string"
                            },
              {
                                "index": 2,
                                "type": "string"
                            },
              {
                                "index": 3,
                                "type": "string"
                            },
              {
                                "index": 4,
                                "type": "string"
                            },
              {
                                "index": 5,
                                "type": "string"
                            },
              {
                                "index": 6,
                                "type": "string"
                            },
              {
                                "index": 7,
                                "type": "string"
                            },
              {
                                "index": 8,
                                "type": "string"
                            },
              {
                                "index": 9,
                                "type": "string"
                            },
              {
                                "index": 10,
                                "type": "string"
                            }
            ],
                        "fileType": "text",
                        "encoding": "UTF-8",
                        "fieldDelimiter": ",",
            "nullFormat":"\\N"
                    }
                },
               "writer": {
          "name": "clickhousewriter",
          "parameter": {
            "username": "xxxxxx",
            "password":"xxxxxx",
            "column":["xx_id","xx_code","xx_name","two_class_code","xx_name","xxx","xx","xx","xx","etl_time","dt"],
            "connection": [
              {
                "jdbcUrl": "jdbc:clickhouse://xx.xx.xx.xx:8123/db",
                "table":["ads_music_text2ck"]
              }
            ]
          }
        }
            }
        ]
    }
}

如何查看hdfs的默认命名空间是哪个？

[z@bi-fh-11-18 bin]$ hdfs getconf -confKey fs.defaultFS

如何查看hdfs的所有命名空间列表？

[z@bi-fh-11-18 bin]$ hdfs getconf -confKey dfs.nameservices

Hadoop集群的信息：
查hdfs-sit.xml文件

<property>
      <name>dfs.nameservices</name>
      <value>hadoopcluster</value>
      <description>集群的命名空间，即使有多个集群，可以共用配置文件</description>
    </property>
    <property>
      <name>dfs.ha.namenodes.hadoopcluster</name>
      <value>nm1,nm2</value>
    </property>
    <property>
      <name>dfs.namenode.rpc-address.hadoopcluster.nm1</name>
      <value>hadoop1.xx.xxx:8020</value>
    </property>
    <property>
      <name>dfs.namenode.rpc-address.hadoopcluster.nm2</name>
      <value>hadoop2.xx.xxx:8020</value>
    </property>

运行job.json

 [z@bi-fh-11-18 bin]$ python home/xx/datax/bin/datax.py job.json

报错：

经DataX智能分析,该任务最可能的错误原因是:
com.alibaba.datax.common.exception.DataXException: Code:[Common-00], Describe:[您提供的配置文件存在错误信息，请检查您的作业配置 .] - 配置信息错误，您提供的配置文件[/home/xx/datax/plugin/reader/._drdsreader/plugin.json]不存在. 请检查您的配置文件.

datax将hive中的hdfs数据导入到clickhouse测试调错全程
原因：加载了隐藏文件，删除这部分文件

解决措施：
删除datax/plugin/reader下所有.xxxx隐藏文件 **注意：**一定要.*er这种方式匹配文件，否则会匹配到里面的隐藏jar包

[z@bi-fh-11-18 bin]$ find /home/xx/datax/plugin/reader/ -type f -name "._*er" | xargs rm -rf

同理也删除datax/plugin/writer/下所有._xxxx隐藏文件

[z@bi-fh-11-18 bin]$ find /home/**/datax/plugin/writer/ -type f -name "._*er" | xargs rm -rf

运行报错：缺少clickhousewriter插件

经DataX智能分析,该任务最可能的错误原因是:
com.alibaba.datax.common.exception.DataXException: Code:[Framework-12], Description:[DataX插件初始化错误, 该问题通常是由于DataX安装错误引起，请联系您的运维解决 .]
- 插件加载失败，未完成指定插件加载:[clickhousewriter, hdfsreader]

排查过程：
检查 /datax/plugin/writer目录，发现没有clickhousewriter

[z@bi-fh-11-18 bin]$ cd /datax/plugin/writer
[z@bi-fh-11-18 bin]$ ll

解决措施：
联系运维，安装clickhousewriter 插件(去github下在打包)

再运行报错：
配置ck连接的端口号写的9000，这是安装的同事提供的用于登录ck-client的，json配置上是jdbc连接方式，猜测ck-jdbc默认8123端口。

ERROR RetryUtil - Exception when calling callable, 异常Msg:Code:[DBUtilErrorCode-10],
 Description:[连接数据库失败. 请检查您的 账号、密码、数据库名称、IP、Port或者向 DBA 寻求帮助(注意网络环境).].  
 -  具体错误信息为：java.lang.RuntimeException: ru.yandex.clickhouse.except.ClickHouseUnknownException: ClickHouse exception, code: 1002, host: xx.xx.xx.xx, port: 9000; Port 9000 is for clickhouse-client program
 - You must use port 8123 for HTTP.

解决措施：

vi 修改json中ck的连接端口号为8123

将ck的端口号由9000改为8123后，运行job.json

select * from ads_music_text2ck where 1=2 具体错误信息为：ru.yandex.clickhouse.except.ClickHouseUnknownException: ClickHouse exception, code: 1002, 
host: xx.xx.xx.xx, port: 8123; Code: 60. 
DB::Exception: Table default.ads_music_text2ck doesn't exist. (UNKNOWN_TABLE) (version 22.3.2.1)

原因分析：
这是刚开始job.json配置的ck连接没有带库名，导致运行后默认查ck的default库，但是表示建在db库，这里需要写明

"jdbcUrl": "jdbc:clickhouse://xx.xx.xx.xx:8123",

调整为：

 "jdbcUrl": "jdbc:clickhouse://xx.xx.xx.xx:8123/db",

调整ck连接后，运行job

Description:[您配置的path格式有误].  - java.io.IOException: Failed on local exception: com.google.protobuf.InvalidProtocolBufferException: Protocol message end-group tag did not match expected tag.; 
Host Details : local host is: "bi-xx-xx-xx.xx.xx/127.0.0.1"; destination host is: "xx.xx.xx.xx":9000;

排除步骤:
运行以下命令，发现报错一样

[z@bi-fh-11-18 bin]$ hadoop fs -ls hdfs://xx.xx.xxx.xxx:9000/user/xx/xx/xx/ads_music_text2ck

将端口号9000删除，直接ip拼接路径，可以查出文件，此报错解除

[z@bi-fh-11-18 bin]$ hadoop fs -ls hdfs://xx.xx.xxx.xxx/user/xx/xx/xx/ads_music_text2ck

按照第6步的思路，调整job的defaultFS项，去掉hdfs的端口号，后运行job

DataXException: Code:[Framework-03], Description:[DataX引擎配置错误，该问题通常是由于DataX安装错误引起，请联系您的运维解决 .].  - 在有总bps限速条件下，单个channel的bps值不能为空 ，也不能为非正数

datax将hive中的hdfs数据导入到clickhouse测试调错全程
排错：
正如中文字面上所说，DataX的配置有问题，单个channel的bps值不能为空，也不能为非正数。
查看datax/conf/core.json
core -> transport -> channel -> speed -> “byte”:，默认值为：-1

解决办法：
修改文件：datax/conf/core.json，
core -> transport -> channel -> speed -> “byte”: 2000000，将单个channel的大小改为2MB即可。

重新运行后告警：

您尝试读取的列越界,源文件该行有 [11] 列,您尝试读取第 [12] 列

原因：json的配置读取列的从0开始的，原表有11列，那么应该配置0-11，需要删除多余的列
datax将hive中的hdfs数据导入到clickhouse测试调错全程

9.修改后，运行通过，完美！！！
datax将hive中的hdfs数据导入到clickhouse测试调错全程
ck数据验证：条数和内容正确

同时测试了orc格式的hive表

jason 中writer部分参数如下：

"fileType": "orc",
"encoding": "UTF-8",
"fieldDelimiter": "\001",
"nullFormat":"\\N"

orc运行 ok，速度大约 10w records/s
主流的存储格式也就是text和orc，其余格式的文件大家可以测试下。

2022-04-26 10:28:11.718 [job-0] INFO  StandAloneJobContainerCommunicator - Total 62175872 records, 5949152214 bytes | Speed 9.71MB/s, 104240 records/s | Error 0 records, 0 bytes |  All Task WaitWriterTime 508.631s |  All Task WaitReaderTime 2,326.551s | Percentage 0.00%
2022-04-26 10:28:21.720 [job-0] INFO  StandAloneJobContainerCommunicator - Total 63268416 records, 6047950580 bytes | Speed 9.42MB/s, 109254 records/s | Error 0 records, 0 bytes |  All Task WaitWriterTime 518.396s |  All Task WaitReaderTime 2,363.778s | Percentage 0.00%
2022-04-26 10:28:31.723 [job-0] INFO  StandAloneJobContainerCommunicator - Total 64347872 records, 6149996451 bytes | Speed 9.73MB/s, 107945 records/s | Error 0 records, 0 bytes |  All Task WaitWriterTime 526.469s |  All Task WaitReaderTime 2,403.225s | Percentage 0.00%