What exactly does m_query_time_cnt mean?

I’m a bit confused about the meaning of m_query_time_cnt:
m_query_time_cnt Float32 COMMENT ‘The statement execution time in seconds was met.’,

At first I thought that m_query_time_cnt and num_queries were the same, but I found out that they are not

In this blog, there is a sql statement to query the average execution time:

# Average Query Execution Time for Last 6 hours 
 
select avg(m_query_time_sum/m_query_time_cnt)  from metrics where period_start>subtractHours(now(),6);

In practice, however, this result does not appear to be correct

SELECT avg(m_query_time_sum / m_query_time_cnt)
FROM metrics
WHERE (period_start >= formatDateTime(yesterday() - 1, '%Y-%m-%d 16:00:00')) AND (period_start < formatDateTime(today() - 1, '%Y-%m-%d 16:00:00')) AND (queryid = '743A2DB74D4CE96F')

Query id: 5ce4af64-4279-4879-bc66-290a84b3c5b5

┌─avg(divide(m_query_time_sum, m_query_time_cnt))─┐
│                               75.04540252685547 │
└─────────────────────────────────────────────────┘

I think the correct sql is

SELECT sum(m_query_time_sum) / sum(num_queries)
FROM metrics
WHERE (period_start >= formatDateTime(yesterday() - 1, '%Y-%m-%d 16:00:00')) AND (period_start < formatDateTime(today() - 1, '%Y-%m-%d 16:00:00')) AND (queryid = '743A2DB74D4CE96F')

Query id: 29eef184-dda9-4729-ace5-6909598c3d91

┌─divide(sum(m_query_time_sum), sum(num_queries))─┐
│                              0.7504540252685546 │
└─────────────────────────────────────────────────┘

This is also true via qan-api

SELECT
    num_queries,
    m_query_time_cnt,
    m_query_time_sum,
    m_query_time_min,
    m_query_time_max,
    m_query_time_p99
FROM metrics
WHERE queryid = '743A2DB74D4CE96F'
ORDER BY period_start DESC
LIMIT 1

Query id: 4fa1b6e4-9c70-4828-b2b9-9e0976e19d1b

Row 1:
──────
num_queries:      100
m_query_time_cnt: 1
m_query_time_sum: 71.1733
m_query_time_min: 0.711733
m_query_time_max: 0.711733
m_query_time_p99: 0.711733

The above query also shows that query_time_avg should be equal to query_time_sum/num_queries

So, I’m a bit confused about the meaning of m_query_time_cnt, what does it do and how is it calculated?

2021.11.08
It seems that m_xx_xx_cnt represents the number of times the statement appears in the slow query log within this collection window

So, using the above result as an example, m_query_time_cnt=1, means that this query statement appeared in the slow query log once during this collection period, and then with my parameter log_slow_rate_limit=100, so num_queries= m_query_time_cnt* log_ slow_rate_limit=1*100=100. Is that right?

The Dashboard RED Method for MySQL Queries - Designed for PMM2 has a panel

SELECT
    num_queries,
    m_query_time_cnt,
    m_query_time_sum,
    m_query_time_min,
    m_query_time_max,
    m_query_time_p99
FROM pmm.metrics
WHERE (period_start = '2021-11-09 05:02:00') AND (period_start < '2021-11-09 16:00:00') AND (1 = 1) AND (service_type = 'mysql') AND (queryid = 'A3F4BE1B0BE03335')

Query id: 6e8a09d7-da0e-4927-ba1f-ca4d503acb0a

Row 1:
──────
num_queries:      14400
m_query_time_cnt: 144
m_query_time_sum: 2.1127
m_query_time_min: 0.000103
m_query_time_max: 0.000222
m_query_time_p99: 0.000212

1 rows in set. Elapsed: 0.008 sec. Processed 8.19 thousand rows, 457.58 KB (1.03 million rows/s., 57.63 MB/s.)

This query returns only one row

SELECT sum(m_query_time_sum) / sum(m_query_time_cnt)
FROM pmm.metrics
WHERE (period_start = '2021-11-09 05:02:00') AND (1 = 1) AND (service_type = 'mysql') AND (queryid = 'A3F4BE1B0BE03335')

Query id: 333f32e1-dab2-4839-adaf-579c774409df

┌─divide(sum(m_query_time_sum), sum(m_query_time_cnt))─┐
│                                 0.014671527677112155 │
└──────────────────────────────────────────────────────┘
1 rows in set. Elapsed: 0.008 sec. Processed 8.19 thousand rows, 326.51 KB (983.38 thousand rows/s., 39.20 MB/s.)

sum(m_query_time_sum) / sum(m_query_time_cnt)= 0.014671527677112155

0.014671527677112155 > (m_query_time_max = 0.000222)
why?

At first I thought that m_xx_xx_cnt represents the number of times the statement appears in the slow query log within this collection window

So, with my parameter log_slow_rate_limit=100, num_queries= m_query_time_cnt * log_ slow_rate_limit=1*100=100.

But I find it doesn’t seem that way either

SELECT
    queryid,
    num_queries,
    m_query_time_cnt,
    m_query_time_sum,
    m_query_time_min,
    m_query_time_max,
    m_query_time_p99
FROM pmm.metrics
WHERE (queryid = '25924BD87338F559') AND (period_start = '2021-11-05 16:00:00') AND (period_start < '2021-11-09 16:00:00')

Query id: bfccd505-e0a6-439e-a5b9-ff474abac76a

┌─queryid──────────┬─num_queries─┬─m_query_time_cnt─┬─m_query_time_sum─┬─m_query_time_min─┬─m_query_time_max─┬─m_query_time_p99─┐
│ 25924BD87338F559 │        3600 │               36 │           0.9135 │         0.000171 │         0.000753 │         0.000753 │
│ 25924BD87338F559 │        3101 │               32 │         2.234573 │         0.000219 │         1.212773 │         1.212773 │
│ 25924BD87338F559 │        6301 │               64 │          4.29576 │         0.000194 │          1.76006 │          1.76006 │
└──────────────────┴─────────────┴──────────────────┴──────────────────┴──────────────────┴──────────────────┴──────────────────┘

3 rows in set. Elapsed: 0.007 sec. Processed 8.19 thousand rows, 237.94 KB (1.16 million rows/s., 33.57 MB/s.)

num_queries, m_query_time_cnt
3101, 32
6301, 64

3 Likes

Hi @Fan
you correct re

num_queries - is a real number of queries (calculated with log_slow_rate_limit)
and m_query_time_cnt - “How many times the metric appeared”
It’s not exactly 100 times the difference in your case because of roundings in intervals.
We need to review the blog post and revisit the implementations and fields usage to decrease confusion.

2 Likes

Thank you @roma.novikov, i get it

1 Like

So is the sql in red dashboard and the sql in the blog actually correct, and if not, has it been corrected?

1 Like

still waiting for your answer.


Using m_query_time_sum/m_query_time_cnt to calc Latency is aboslutely wrong. This blog remains uncorrected.

1 Like

I’m wondering how do you get m_query_time_cnt=1 in your case where num_queries=100 This is the issue.

Generally m_query_time_cnt should be equal to num_queries as ALL queries should provide query_time metric.

I am using metric specific count instead of num_queries for consistency - as to compute avg value for any metric you can divide SUM of that metric by number of queries for which this metric was reported.

1 Like

we use Percona 5.7

# SLOW LOGGING #
slow_query_log_file                                           = {{ mysql_dir }}/logs/slow.log
slow_query_log                                                = {{ slow_query_log|default('ON') }}
long_query_time                                               = {{ long_query_time|default(0) }}
log_slow_rate_limit                                           = {{ log_slow_rate_limit|default(100) }}
log_slow_rate_type                                            = {{ log_slow_rate_type|default('query') }}
log_slow_verbosity                                            = {{ log_slow_verbosity|default('full') }}
log_slow_admin_statements                                     = {{ log_slow_admin_statements|default('ON') }}
log_slow_slave_statements                                     = {{ log_slow_slave_statements|default('ON') }}
slow_query_log_always_write_time                              = {{ slow_query_log_always_write_time|default(1) }}
slow_query_log_use_global_control                             = {{ slow_query_log_use_global_control|default('all') }}
pmm-admin add mysql --disable-collectors="{{ ','.join(mysql_disable_collectors) }}" --query-source=slowlog --username={{ pmm_user }} --password={{ pmm_password }} --size-slow-logs=1GiB --environment=${env} --cluster={{ cmdb_cluster_name }} --replication-set={{ cmdb_cluster_name }} --custom-labels="region=${region},env=${env},ha_type=MGR" $HOSTNAME'_{{ mysql_port }}' 127.0.0.1:{{ mysql_port }}

it seems like that num_queries = m_query_time_cnt*log_slow_rate_limit

1 Like

Right. It could be handling sampling got broken. I appreciate if you can file a bug report!

1 Like
	// How many times query_time was found.
	MQueryTimeCnt float32 `protobuf:"fixed32,19,opt,name=m_query_time_cnt,json=mQueryTimeCnt,proto3" json:"m_query_time_cnt,omitempty"`

MQueryTimeCnt is the number of times Query_time was found in the slow log

# Time: 2022-07-19T11:06:04.333060+08:00
# User@Host: pmm[pmm] @  [127.0.0.1]  Id: 44844658
# Schema:   Last_errno: 1054  Killed: 0
# Query_time: 0.000131  Lock_time: 0.000000  Rows_sent: 0  Rows_examined: 0  Rows_affected: 0
# Bytes_sent: 68  Tmp_tables: 0  Tmp_disk_tables: 0  Tmp_table_sizes: 0
# QC_Hit: No  Full_scan: No  Full_join: No  Tmp_table: No  Tmp_table_on_disk: No
# Filesort: No  Filesort_on_disk: No  Merge_passes: 0
# No InnoDB statistics available for this query
# Log_slow_rate_type: query  Log_slow_rate_limit: 100

so m_query_time_cnt=1 is fine

			case smv[1] == "Log_slow_rate_limit":
				val, _ := strconv.ParseUint(smv[2], 10, 64)
				p.event.RateLimit = uint(val)
NumQueries:           float32(v.TotalQueries),
c.TotalQueries = (c.TotalQueries * rateLimit) + c.outliers

So, it looks like m_query_time_cnt and num_queries are expressing different meanings, and it is normal that they are not equal, unless you use the default value of the parameter

I am using metric specific count instead of num_queries for consistency - as to compute avg value for any metric you can divide SUM of that metric by number of queries for which this metric was reported.

All I can say is, maybe at the beginning of the design, you wanted it to look like this, but that’s not how it’s written in the code. Which design should prevail is something you need to consider internally, but it is clear that the query in the RED dashboard is problematic at this point

As a user, when our company’s R&D said that there was a problem with RED’s query time, I had to go and change it to the correct one, and then ask a question in the forum to see if it was really wrong, and if so, suggest that someone with editing rights correct the blog.

1 Like

Thank you.

We will investigate. I’m just explaining here what is design and what is the bug.

The goal is what sampling (setting low_slow_slave_limit) should still represent data as close as possible to the actual load on the system. It does not only applies to number of queries but for example to network traffic, CPU used etc.

Might be it indeed was implemented differently when we also need to make sure to update documentation

1 Like