The QAN p99 calculation seems to be wrong?

PMM2 has a dashboard called RED Method for MySQL Queries - Designed for PMM2

One of the panels contains an indicator p99Latency, the query statement is

SELECT
    $timeSeries as t,
    sum(m_query_time_sum)/sum(m_query_time_cnt) as AvgLatency,
    avg(m_query_time_p99) as p99Latency
FROM $table
WHERE $timeFilter and service_type='mysql' and labels.value=['$labels_env']
GROUP BY t
ORDER BY t

In QAN, if you click on a query statement, a clickhouse query statement will be run behind it


The actual statement running in clickhouse is easy to catch

SELECT
    SUM(num_queries) AS num_queries,
    SUM(num_queries_with_errors) AS num_queries_with_errors,
    SUM(num_queries_with_warnings) AS num_queries_with_warnings,
    SUM(m_query_time_cnt) AS m_query_time_cnt,
    SUM(m_query_time_sum) AS m_query_time_sum,
    MIN(m_query_time_min) AS m_query_time_min,
    MAX(m_query_time_max) AS m_query_time_max,
    AVG(m_query_time_p99) AS m_query_time_p99,
    ...
FROM metrics
WHERE (period_start >= 1648656000) AND (period_start <= 1648742399) AND (queryid = 'C782C3D2DA2585FF') AND (replication_set IN ('mha_rd_geass')) AND (schema IN ('broker')) AND (service_name IN ('bj1-mysql-broker-prod-01_3306'))
GROUP BY queryid
    WITH TOTALS

Query id: 4ba24d70-b315-4ef8-8cb9-7a5bb7b244e1

Row 1:
──────
num_queries:                  825001
num_queries_with_errors:      0
num_queries_with_warnings:    0
m_query_time_cnt:             8251
m_query_time_sum:             1130.888653561473
m_query_time_min:             0.000635
m_query_time_max:             3.109454
m_query_time_p99:             0.0037712189824731703
...

The above results show that QAN(and RED Method for MySQL Queries Dashboard) uses avg(m_query_time_p99) to calculate query β€œquery time P99” (0.0037712189824731703)

But is this really the right way to calculate it? I think it might not be.

Here is what I think is the correct way to calculate β€œquery time P99”:

SELECT quantileExactWeighted(0.99)(m_query_time_p99, toUInt64(total_queries)) AS p99
FROM
(
    SELECT
        m_query_time_p99,
        sum(num_queries) AS total_queries
    FROM pmm.metrics
    WHERE (period_start >= 1648656000) AND (period_start <= 1648742399) AND (queryid = 'C782C3D2DA2585FF') AND (replication_set IN ('mha_rd_geass')) AND (schema IN ('broker')) AND (service_name IN ('bj1-mysql-broker-prod-01_3306'))
    GROUP BY m_query_time_p99
)

Query id: e4ed1258-35e8-4fad-baae-45bba11d8852

β”Œβ”€β”€β”€β”€β”€β”€p99─┐
β”‚ 0.004027 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

I also did some calculations in python

SELECT
    m_query_time_p99,
    sum(num_queries)
FROM pmm.metrics
WHERE (period_start >= 1648656000) AND (period_start <= 1648742399) AND (queryid = 'C782C3D2DA2585FF') AND (replication_set IN ('mha_rd_geass')) AND (schema IN ('broker')) AND (service_name IN ('bj1-mysql-broker-prod-01_3306'))
GROUP BY m_query_time_p99

Query id: 8bedf2a1-351b-424c-afe2-18fda812247f

β”Œβ”€m_query_time_p99─┬─sum(num_queries)─┐
β”‚         0.001691 β”‚             1400 β”‚
β”‚         0.001471 β”‚             2900 β”‚
β”‚         0.005632 β”‚              900 β”‚
β”‚         0.001472 β”‚             3600 β”‚
β”‚         0.001762 β”‚              800 β”‚
β”‚         0.001396 β”‚             1900 β”‚
β”‚         0.002547 β”‚             1000 β”‚
β”‚         0.001795 β”‚              900 β”‚
...
import numpy as np

s = ((0.001691, 1400)
,(0.001471, 2900)
,(0.005632, 900)
,(0.001472, 3600)
,(0.001762, 800)
,(0.001396, 1900)
,(0.002547, 1000)
,(0.001795, 900)
...
)

arr = list()
for i in s:
    query_time = i[0]
    count = i[1]
    for _ in range(count):
        arr.append(query_time)

a = np.array(arr)

print(np.percentile(a, 99))  
0.004027

I hope someone can answer the question, which calculation is the correct one?

2 Likes

QAN’s P99 is wrong. Here’s my version. You guys can learn from it

            SELECT
                env,
                dept,
                query_id,
                cluster,
                database,
                username,
                instance_list,
                fingerprint,
                example,
                num_queries,
                query_time_cnt,
                query_time_sum,
                query_time_min,
                query_time_max,
                query_time_p99,
                query_time_avg,
                lock_time_cnt,
                lock_time_sum,
                lock_time_min,
                lock_time_max,
                lock_time_p99,
                lock_time_avg,
                rows_sent_cnt,
                rows_sent_sum,
                rows_sent_min,
                rows_sent_max,
                rows_sent_p99,
                rows_sent_avg,
                bytes_sent_cnt,
                bytes_sent_sum,
                bytes_sent_min,
                bytes_sent_max,
                bytes_sent_p99,
                bytes_sent_avg,
                first_seen,
                last_seen,
                range_start,
                range_end,
                idempotent_mark,
                query_time_p99 AS m_query_time_p99,
                lock_time_p99 AS m_lock_time_p99,
                rows_sent_p99 AS m_rows_sent_p99,
                bytes_sent_p99 AS m_bytes_sent_p99
            FROM
            (
                SELECT
                    '{env}' env,
                    '{dept}' dept,
                    queryid query_id,
                    '{cluster_id}' cluster,
                    '{database}' database,
                    username,
                    groupUniqArray(service_name) AS instance_list,
                    fingerprint,
                    max(example) AS example,
                    sum(total_queries) AS num_queries,
                    sum(m_query_time_cnt) AS query_time_cnt,
                    sum(m_query_time_sum) AS query_time_sum,
                    min(m_query_time_min) AS query_time_min,
                    max(m_query_time_max) AS query_time_max,
                    quantileExactWeighted(0.99)(m_query_time_p99, toUInt64(total_queries)) AS query_time_p99,
                    sum(m_query_time_sum) / num_queries AS query_time_avg,
                    sum(m_lock_time_cnt) AS lock_time_cnt,
                    sum(m_lock_time_sum) AS lock_time_sum,
                    min(m_lock_time_min) AS lock_time_min,
                    max(m_lock_time_max) AS lock_time_max,
                    quantileExactWeighted(0.99)(m_lock_time_p99, toUInt64(total_queries)) AS lock_time_p99,
                    sum(m_lock_time_sum) / num_queries AS lock_time_avg,
                    sum(m_rows_sent_cnt) AS rows_sent_cnt,
                    sum(m_rows_sent_sum) AS rows_sent_sum,
                    min(m_rows_sent_min) AS rows_sent_min,
                    max(m_rows_sent_max) AS rows_sent_max,
                    quantileExactWeighted(0.99)(m_rows_sent_p99, toUInt64(total_queries)) AS rows_sent_p99,
                    sum(m_rows_sent_sum) / num_queries AS rows_sent_avg,
                    sum(m_bytes_sent_cnt) AS bytes_sent_cnt,
                    sum(m_bytes_sent_sum) AS bytes_sent_sum,
                    min(m_bytes_sent_min) AS bytes_sent_min,
                    max(m_bytes_sent_max) AS bytes_sent_max,
                    quantileExactWeighted(0.99)(m_bytes_sent_p99, toUInt64(total_queries)) AS bytes_sent_p99,
                    sum(m_bytes_sent_sum) / num_queries AS bytes_sent_avg,
                    min(first_seen) AS first_seen,
                    max(last_seen) AS last_seen,
                    '{self.range_start}' range_start,
                    '{self.range_end}' range_end,
                    concat(toString(toUnixTimestamp(first_seen)), '_', toString(toUnixTimestamp(last_seen))) AS idempotent_mark
                FROM
                (
                    SELECT
                        queryid,
                        username,
                        service_name,
                        fingerprint,
                        max(example) AS example,
                        sum(num_queries) AS total_queries,
                        sum(m_query_time_cnt) AS m_query_time_cnt,
                        sum(m_query_time_sum) AS m_query_time_sum,
                        min(m_query_time_min) AS m_query_time_min,
                        max(m_query_time_max) AS m_query_time_max,
                        m_query_time_p99,
                        sum(m_lock_time_cnt) AS m_lock_time_cnt,
                        sum(m_lock_time_sum) AS m_lock_time_sum,
                        min(m_lock_time_min) AS m_lock_time_min,
                        max(m_lock_time_max) AS m_lock_time_max,
                        m_lock_time_p99,
                        sum(m_rows_sent_cnt) AS m_rows_sent_cnt,
                        sum(m_rows_sent_sum) AS m_rows_sent_sum,
                        min(m_rows_sent_min) AS m_rows_sent_min,
                        max(m_rows_sent_max) AS m_rows_sent_max,
                        m_rows_sent_p99,
                        sum(m_bytes_sent_cnt) AS m_bytes_sent_cnt,
                        sum(m_bytes_sent_sum) AS m_bytes_sent_sum,
                        min(m_bytes_sent_min) AS m_bytes_sent_min,
                        max(m_bytes_sent_max) AS m_bytes_sent_max,
                        m_bytes_sent_p99,
                        min(period_start) AS first_seen,
                        max(period_start) AS last_seen
                    FROM pmm.{self.ch_table}
                    WHERE {base_predicates}
                    GROUP BY
                        queryid,
                        username,
                        service_name,
                        fingerprint,
                        m_query_time_p99,
                        m_lock_time_p99,
                        m_rows_sent_p99,
                        m_bytes_sent_p99
                )
                GROUP BY
                    queryid,
                    username,
                    fingerprint
            ) AS s WHERE 1=1 AND {threshold}
1 Like

Hi @Fan thank you very much for sharing your findings with the community! As the RED Method is authored by Percona but not officially supported in PMM, we’ll update the dashboard hosted on Dashboards | Grafana Labs as soon as possible. Thanks for your help!!

1 Like

Hi Fan,

I think both statements are wrong in the different way. My computes AVG of p99 for the data points and yours contains as I understand computes p99 of p99 aggregates, which is also not a correct way to do it and arguably you can’t compute real p99 based on the aggregates as you really need more information about distribution.

My choice to use avg was because while it is not correct it is simple and generally provides directionally useful data.

I would be interesting thought to know if there is some work math work which allows to aggregate p99 in a way it is close to the p99 which would be computed over actual data set

1 Like

Yes, I understand what you mean, PMM is supposed to calculate P99 once in a window of 60s (I’m not sure if it’s 60s, I didn’t look at the code), this P99 is supposed to be the value after one aggregation, if you want to calculate the real P99, for example, if SQL runs 1W times, the real P99 should be calculated based on these 1W rows of data. Although my calculation is inaccurate, it is clear that avg is even more inaccurate. As an example:
At some 60s, sql runs 1 time, the run time is 10s, PMM aggregation calculates P99 as 10s
In some 60s, sql runs 1W times, PMM aggregation calculates P99 as 1s

If it is avg(p99), then P99 is 5.5s
The way I do it, it’s obviously more accurate than that.

1 Like

this post also show’s that the Average Query Execution Time

# Average Query Execution Time for Last 6 hours 
 
select avg(m_query_time_sum/m_query_time_cnt)  from metrics where period_start>subtractHours(now(),6);

is wrong too

1 Like