The QAN p99 calculation seems to be wrong?

PMM2 has a dashboard called RED Method for MySQL Queries - Designed for PMM2

One of the panels contains an indicator p99Latency, the query statement is

SELECT
    $timeSeries as t,
    sum(m_query_time_sum)/sum(m_query_time_cnt) as AvgLatency,
    avg(m_query_time_p99) as p99Latency
FROM $table
WHERE $timeFilter and service_type='mysql' and labels.value=['$labels_env']
GROUP BY t
ORDER BY t

In QAN, if you click on a query statement, a clickhouse query statement will be run behind it


The actual statement running in clickhouse is easy to catch

SELECT
    SUM(num_queries) AS num_queries,
    SUM(num_queries_with_errors) AS num_queries_with_errors,
    SUM(num_queries_with_warnings) AS num_queries_with_warnings,
    SUM(m_query_time_cnt) AS m_query_time_cnt,
    SUM(m_query_time_sum) AS m_query_time_sum,
    MIN(m_query_time_min) AS m_query_time_min,
    MAX(m_query_time_max) AS m_query_time_max,
    AVG(m_query_time_p99) AS m_query_time_p99,
    ...
FROM metrics
WHERE (period_start >= 1648656000) AND (period_start <= 1648742399) AND (queryid = 'C782C3D2DA2585FF') AND (replication_set IN ('mha_rd_geass')) AND (schema IN ('broker')) AND (service_name IN ('bj1-mysql-broker-prod-01_3306'))
GROUP BY queryid
    WITH TOTALS

Query id: 4ba24d70-b315-4ef8-8cb9-7a5bb7b244e1

Row 1:
──────
num_queries:                  825001
num_queries_with_errors:      0
num_queries_with_warnings:    0
m_query_time_cnt:             8251
m_query_time_sum:             1130.888653561473
m_query_time_min:             0.000635
m_query_time_max:             3.109454
m_query_time_p99:             0.0037712189824731703
...

The above results show that QAN(and RED Method for MySQL Queries Dashboard) uses avg(m_query_time_p99) to calculate query β€œquery time P99” (0.0037712189824731703)

But is this really the right way to calculate it? I think it might not be.

Here is what I think is the correct way to calculate β€œquery time P99”:

SELECT quantileExactWeighted(0.99)(m_query_time_p99, toUInt64(total_queries)) AS p99
FROM
(
    SELECT
        m_query_time_p99,
        sum(num_queries) AS total_queries
    FROM pmm.metrics
    WHERE (period_start >= 1648656000) AND (period_start <= 1648742399) AND (queryid = 'C782C3D2DA2585FF') AND (replication_set IN ('mha_rd_geass')) AND (schema IN ('broker')) AND (service_name IN ('bj1-mysql-broker-prod-01_3306'))
    GROUP BY m_query_time_p99
)

Query id: e4ed1258-35e8-4fad-baae-45bba11d8852

β”Œβ”€β”€β”€β”€β”€β”€p99─┐
β”‚ 0.004027 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

I also did some calculations in python

SELECT
    m_query_time_p99,
    sum(num_queries)
FROM pmm.metrics
WHERE (period_start >= 1648656000) AND (period_start <= 1648742399) AND (queryid = 'C782C3D2DA2585FF') AND (replication_set IN ('mha_rd_geass')) AND (schema IN ('broker')) AND (service_name IN ('bj1-mysql-broker-prod-01_3306'))
GROUP BY m_query_time_p99

Query id: 8bedf2a1-351b-424c-afe2-18fda812247f

β”Œβ”€m_query_time_p99─┬─sum(num_queries)─┐
β”‚         0.001691 β”‚             1400 β”‚
β”‚         0.001471 β”‚             2900 β”‚
β”‚         0.005632 β”‚              900 β”‚
β”‚         0.001472 β”‚             3600 β”‚
β”‚         0.001762 β”‚              800 β”‚
β”‚         0.001396 β”‚             1900 β”‚
β”‚         0.002547 β”‚             1000 β”‚
β”‚         0.001795 β”‚              900 β”‚
...
import numpy as np

s = ((0.001691, 1400)
,(0.001471, 2900)
,(0.005632, 900)
,(0.001472, 3600)
,(0.001762, 800)
,(0.001396, 1900)
,(0.002547, 1000)
,(0.001795, 900)
...
)

arr = list()
for i in s:
    query_time = i[0]
    count = i[1]
    for _ in range(count):
        arr.append(query_time)

a = np.array(arr)

print(np.percentile(a, 99))  
0.004027

I hope someone can answer the question, which calculation is the correct one?

2 Likes