ClickHouse 实战-Ssp overall报表

1. 配置

1.1 内存使用

CLickHouse 目前默认使用的内存为 32G,对应配置为

<max_memory_usage>34359738368</max_memory_usage>

在目前的数据量下符合查询的需要

1.2 安全

ClickHouse 的密码设置为:9036011d8fa79028f99488c2de6fdfb09d66beb5e4b8ca5178ee9ff9179ed6a8,对应配置:

  <password>9036011d8fa79028f99488c2de6fdfb09d66beb5e4b8ca5178ee9ff9179ed6a8</password>

并且限制了 ip 的访问,只有 node14, node15, node16, node17, node110, node111 6 台机器可以访问,并且在 metrika.xml 中配置了两个分布式表的信息,分别是 bip_ck_cluster 和 hive_like_ck_cluster,其中 bip_ck_cluster 配置了 4 台机器作为数据的存储,应对于线上环境的查询,hive_like_ck_cluster 只配置了 node110 和 node111 两台机器作为数据存储,用于测试环境。ClickHouse 的配置文件在修改之后马上生效,无需重启,因此在修改配置时需要确认完整之后再保存。

2. 使用

2.1 部署

目前 ClickHouse 部署在了 4 台服务器上,分别是:

 
- node15
- node16
- node110
- node111

并且在 node17 上也部署了 ClickHouse 环境,但是还未使用。

2.2 登陆

在配置了密码的前提下,在集群环境中登陆 ClickHouse 的命令为:

clickhouse-client -m -h node15 --port 8000 --password 9036011d8fa79028f99488c2de6fdfb09d66beb5e4b8ca5178ee9ff9179ed6a8
clickhouse-client -m -h node16 --password 9036011d8fa79028f99488c2de6fdfb09d66beb5e4b8ca5178ee9ff9179ed6a8
clickhouse-client -m -h node110 --password 9036011d8fa79028f99488c2de6fdfb09d66beb5e4b8ca5178ee9ff9179ed6a8
clickhouse-client -m -h node111 --password 9036011d8fa79028f99488c2de6fdfb09d66beb5e4b8ca5178ee9ff9179ed6a8

其中 node15 因为 9000 端口被占用,改成了 8000 端口。–user 没有写明情况下默认为 default,对应于 users.xml 的 yandex.profile.default 字段和 yandex.users.default 字段。此外登陆无需 ssh 到专门的机器中,只需要在装有 clickhouse-client 的机器上输入命令即可登陆。

2.3 实例

以集群上的 node110 节点为例,目前使用的表为:

  • ssp_report_overall_dm
  • ssp_report_overall_dm_all
  • ssp_report_overall_dm_day
  • ssp_report_overall_dm_day_all
  • ssp_report_overall_dm_day_for_select
  • ssp_report_overall_dm_day_for_select_all
  • ssp_report_overall_dm_for_select
  • ssp_report_overall_dm_for_select_all

重新排序一下,为

  • ssp_report_overall_dm
  • ssp_report_overall_dm_all
  • ssp_report_overall_dm_for_select
  • ssp_report_overall_dm_for_select_all
  • ssp_report_overall_dm_day
  • ssp_report_overall_dm_day_all
  • ssp_report_overall_dm_day_for_select
  • ssp_report_overall_dm_day_for_select_all

以 ssp_report_overall_dm_day 为例,该表为基础表,只存在于单机中,ssp_report_overall_dm_day_all 为分布式表,不存储数据,用于连接分布在四台机器上的 ssp_report_overall_dm_day 表,用于数据上传,在更新时减少数据的波动情况。ssp_report_overall_dm_day_for_select 为查询表,由于数据分布在 4 台机器上,数据量约为总数据的 1/4,ssp_report_overall_dm_day_for_select_all 为总查询表,本身不存储数据,连接着 4 台机器上的数据,查询该表所返回的数据为完整的数据。

ssp_report_overall_dm_day 不长时间存储数据。在数据上传到该表后,通过 INSERT INTO ssp_report_overall_dm_day_for_select_all SELECT * FROM ssp_report_overall_dm_day_all 的方式直接将所有数据导入到查询表后,即刻清空自身的数据。

2.4 建表

如果要重新建立一张表用于其他业务,首先在一台机器上创建本地表和分布式表,如:

CREATE TABLE ssp_report_overall_dm_day (
  publisherid Int32 DEFAULT CAST(0 AS Int32),
  appid Int32 DEFAULT CAST(0 AS Int32),
  countryid Int32 DEFAULT CAST(0 AS Int32),
  carrierid Int32 DEFAULT CAST(0 AS Int32),
  adtype Int32 DEFAULT CAST(0 AS Int32),
  campaignid Int32 DEFAULT CAST(0 AS Int32),
  offerid Int32 DEFAULT CAST(0 AS Int32),
  imageid Int32 DEFAULT CAST(0 AS Int32),
  affsub Nullable(String),
  requestcount Int64 DEFAULT CAST(0 AS Int64),
  sendcount Int64 DEFAULT CAST(0 AS Int64),
  showcount Int64 DEFAULT CAST(0 AS Int64),
  clickcount Int64 DEFAULT CAST(0 AS Int64),
  feereportcount Int64 DEFAULT CAST(0 AS Int64),
  feesendcount Int64 DEFAULT CAST(0 AS Int64),
  feereportprice Float32 DEFAULT CAST(0. AS Float32),
  feesendprice Float32 DEFAULT CAST(0. AS Float32),
  cpcbidprice Float32 DEFAULT CAST(0. AS Float32),
  cpmbidprice Float32 DEFAULT CAST(0. AS Float32),
  conversion Int64 DEFAULT CAST(0 AS Int64),
  allconversion Int64 DEFAULT CAST(0 AS Int64),
  revenue Float32 DEFAULT CAST(0. AS Float32),
  realrevenue Float32 DEFAULT CAST(0. AS Float32),
  l_time DateTime,
  b_date Date,
  publisheramid Int32 DEFAULT CAST(0 AS Int32),
  publisheramname Nullable(String),
  advertiseramid Int32 DEFAULT CAST(0 AS Int32),
  advertiseramname Nullable(String),
  appmodeid Int32 DEFAULT CAST(0 AS Int32),
  appmodename Nullable(String),
  adcategory1id Int32 DEFAULT CAST(0 AS Int32),
  adcategory1name Nullable(String),
  campaignname Nullable(String),
  adverid Int32 DEFAULT CAST(0 AS Int32),
  advername Nullable(String),
  offeroptstatus Int32 DEFAULT CAST(0 AS Int32),
  offername Nullable(String),
  publishername Nullable(String),
  appname Nullable(String),
  iab1name Nullable(String),
  iab2name Nullable(String),
  countryname Nullable(String),
  carriername Nullable(String),
  adtypeid Int32 DEFAULT CAST(0 AS Int32),
  adtypename Nullable(String),
  versionid Int32 DEFAULT CAST(0 AS Int32),
  versionname Nullable(String),
  publisherproxyid Int32 DEFAULT CAST(0 AS Int32),
  data_type Nullable(String),
  feecpctimes Int64 DEFAULT CAST(0 AS Int64),
  feecpmtimes Int64 DEFAULT CAST(0 AS Int64),
  feecpatimes Int64 DEFAULT CAST(0 AS Int64),
  feecpasendtimes Int64 DEFAULT CAST(0 AS Int64),
  feecpcreportprice Float32 DEFAULT CAST(0. AS Float32),
  feecpmreportprice Float32 DEFAULT CAST(0. AS Float32),
  feecpareportprice Float32 DEFAULT CAST(0. AS Float32),
  feecpcsendprice Float32 DEFAULT CAST(0. AS Float32),
  feecpmsendprice Float32 DEFAULT CAST(0. AS Float32),
  feecpasendprice Float32 DEFAULT CAST(0. AS Float32),
  countrycode Nullable(String),
  packagename Nullable(String),
  domain Nullable(String),
  operatingsystem Nullable(String),
  systemlanguage Nullable(String),
  devicebrand Nullable(String),
  devicetype Nullable(String),
  browserkernel Nullable(String),
  b_time DateTime,
  respstatus Int32 DEFAULT CAST(0 AS Int32),
  winprice Float64 DEFAULT 0.,
  winnotices Int64 DEFAULT CAST(0 AS Int64),
  issecondhighpricewin Int32 DEFAULT CAST(0 AS Int32),
  companyid Int32 DEFAULT CAST(0 AS Int32),
  companyname Nullable(String),
  test Int32 DEFAULT CAST(0 AS Int32),
  ruleid Int32 DEFAULT CAST(0 AS Int32),
  smartid Int32 DEFAULT CAST(0 AS Int32),
  proxyid Int32 DEFAULT CAST(0 AS Int32),
  smartname Nullable(String),
  rulename Nullable(String),
  appcompanyid Int32 DEFAULT CAST(0 AS Int32),
  offercompanyid Int32 DEFAULT CAST(0 AS Int32),
  newcount Int64 DEFAULT CAST(0 AS Int64),
  activecount Int64 DEFAULT CAST(0 AS Int64),
  adcategory2id Int32 DEFAULT CAST(0 AS Int32),
  adcategory2name Nullable(String),
  publisherampaid Int32 DEFAULT CAST(0 AS Int32),
  publisherampaname Nullable(String),
  advertiseramaaid Int32 DEFAULT CAST(0 AS Int32),
  advertiseramaaname Nullable(String),
  eventname Nullable(String),
  recommender Int32 DEFAULT CAST(0 AS Int32),
  ratertype Int32 DEFAULT CAST(0 AS Int32),
  raterid Nullable(String)
) ENGINE = MergeTree PARTITION BY (b_date, b_time) ORDER BY (b_date, b_time) SETTINGS index_granularity = 8192;

CREATE TABLE ssp_report_overall_dm_day_all AS ssp_report_overall_dm_day ENGINE = Distributed(bip_ck_cluster, default, ssp_report_overall_dm_day, rand());
  • 其中,bip_ck_cluster 对应于 /etc/clickhouse-server/metrika.xml 配置中的 clickhouse_remote_servers.bip_ck_cluster 配置,第二个参数为默认,第三个参数为要建立的分布式的表名,第四个参数为数据导入时的数据分配函数。

  • 在单机中建好表后,在剩下的机器中都执行一遍上述的 sql,其中本地表为必须,分布式表为非必须,但是至少需要有一台机器有分布式表,否则无法查询其他机器中的数据。

  • 查询表和查询总表非必须,根据业务需求而定