One of the more powerful features of ClickHouse is its introspective capabilities. This can be easily leveraged to understand where load on our multi-tenant clickhouse servers is coming from.
Show me the queries:
The following query gives an at-a-glance overview of what is generating load on the cluster:
withtoDateTime(now()) as target_time,toIntervalDay(3) as interval,(SELECT sum(read_bytes)from clusterAllReplicas('posthog', system,query_log)where event_time >= target_time - interval and event_time <= target_time and type > 1 and is_initial_query) AS total_bytes_readselectmultiIf(http_user_agent = 'Apache-HttpClient/4.5.13 (Java/11.0.17)', 'Metabase?',http_user_agent = 'Go-http-client/1.1', 'Grafana?',http_user_agent = 'python-requests/2.28.1', 'infi_orm',http_user_agent != '', http_user_agent,query = 'SELECT version()', 'version_query',JSONExtractString(log_comment, 'kind') = 'request' and JSONExtractString(log_comment, 'route_id') IN ('api/event/?$', 'api/projects/(?P<parent_lookup_team_id>[^/.]+)/events/?$'), '/api/event',log_comment != '', JSONExtractString(log_comment, 'kind'),query_kind = 'Insert',query LIKE '%FORMAT JSON%' or (not is_initial_query and query like '%`elements_chain` FROM `posthog`.`sharded_events%') or (query like '%min(`_timestamp`) AS `min`, max(`_timestamp`)%'), 'historical-exports','unknown') as query_type,formatReadableSize(sum(read_bytes)) AS read_bytes_,sum(read_bytes) / total_bytes_read AS read_bytes_percentage,count(),sum(query_duration_ms),formatReadableSize(sum(result_bytes)) AS result_bytes_,sum(read_rows),arraySort(arrayDistinct(groupArray(getMacro('replica')))) as hostsfrom clusterAllReplicas('posthog', system, query_log)where event_time >= target_time - interval and event_time <= target_time and type > 1 and is_initial_querygroup by query_typeorder by sum(query_duration_ms) DESC
Advanced
To diagnose further, it's important to understand ClickHouse operations.
Useful dimensions to slice the data on:
query_duration_ms
- How long the query tookformatReadableSize(read_bytes)
- Total number of bytes read from all tables and table functions participated in query.formatReadableSize(memory_usage)
- Memory usage of this query.formatReadableSize(result_bytes)
- How big was the response for this query. Useful fornot is_initial_query
to determine if too much data streaming is going on.ProfileEvents['OSCPUVirtualTimeMicroseconds'])
- How much time was spent by the CPUProfileEvents['ReadCompressedBytes'])
- How much data was read from disk (compressed)ProfileEvents['NetworkSendElapsedMicroseconds'])
- How much time was spent sending data over networkProfileEvents['NetworkReceiveBytes'])
- How much time was spent reading data over networkProfileEvents['NetworkSendBytes'])
- How much data was sent over the network
Full list of all valid measurements can be found in ClickHouse source code.
Other useful expressions:
is_initial_query
- indicates whether this was a main query or pushed down from coordinator. Notelog_comment
is also forwarded.any(log_comment)
- shows the structure of the log commentgetMacro('replica')
- What replica was this on?getMacro('shard')
- What shard was this query made on?getMacro('hostClusterType')
- What cluster was this on? Online or offline?
log_comment
We make use of the log_comment column quite extensively to add metadata to queries in order to make analysis simpler.
log_comment
is set by specifying the log_comment
setting when running a query. It then populates in the column with the same name on the query log table.
Some useful things you'll find in log_comment
include:
JSONExtractString(log_comment, 'kind')
- What is the query from? Eithercelery
orrequest
JSONExtractString(log_comment, 'query_type')
- Type of query e.g.trends_total_volume
orfunnel_conversion
JSONExtractString(log_comment, 'id')
- Request path or task name (depending on kind)JSONExtractString(log_comment, 'route_id')
- what geberic route id was responsible for the query (only set for kind=request)JSONExtractInt(log_comment, 'user_id')
- user_idJSONExtractInt(log_comment, 'team_id')
- team_idJSONExtractString(log_comment, 'access_method')
- What access method was used? Blank indicates it's normal web traffic (only set for kind=request)JSONExtractString(log_comment, 'http_referer')
- HTTP referer (only set for kind=request)JSONExtractString(log_comment, 'http_user_agent')
- HTTP user agent (only set for kind=request)JSONExtractString(log_comment, 'container_hostname')
- Kubernetes pod where the query was initiated fromJSONExtractString(log_comment, 'workload')
- EitherWORKLOAD.ONLINE
orWORKLOAD.OFFLINE
. This determines which node on the shard we prefer to send the query toJSONExtractString(JSONExtractString(log_comment, 'query_settings'), '<setting_name>')
- Any settings passed along with the query can be found within thequery_settings
object. To get the value for a specific setting, you need to callJSONExtractX
twice.JSONExtractString(log_comment, 'filter')
- TheFilter
object for an insight. You'll need to use array functions andJSONExtract
functions in order to access deeply nested filter data