Error in QAN page

Description:

When opening QAN page:

Version:

2.33.0

Hi @PavelVD!
I recommend shortening the time interval and adding filters to your request. Additionally, it may be advisable to update your PMM server.

I’m curious about how many databases you have added to PMM and if you have reached the limit?

I can not add any filters after error

About count of services

Hmm…not even close to the limits we’ve tested with…

I assume that error happens when you load the QAN page…the query view and filters are each generated separately so is the error immediate on page load or after a few seconds and correspond with the spinner dying in the filter window?

You might get more details by looking at either /srv/logs/clickhouse-server.log or pmm-managed.log to help diagnose.

Yes error happen after open QAN page

One interesting record in logs (no other errors or warnings)

WARN[2024-04-19T14:48:23.311+00:00] RPC /platform.Platform/UserStatus done in 35.88774ms with gRPC error: rpc error: code = Unauthenticated desc = Failed to get access token. Please sign in using your Percona Account. request=df0da4f5-fe5b-11ee-b40e-46572d7c83f8
INFO[2024-04-19T14:48:23.319+00:00] Round-trip time: 2.189198ms. Estimated clock drift: 23.158µs. agent_id=/agent_id/c054bfc8-3dee-42cd-83b1-d51777327d0b request=0f8faa84-fb30-11ee-b40e-46572d7c83f8
INFO[2024-04-19T14:48:23.518+00:00] Starting RPC /platform.Platform/ServerInfo … request=df32ae8a-fe5b-11ee-b40e-46572d7c83f8
ERRO[2024-04-19T14:48:23.519+00:00] failed to get SSO details: PMM Server is not connected to Portal component=platform

But we did not connect our PMM instance to Percona platform and do not have internet access from pmm server

Yeah that error is benign…it’s a pre-check to see if you’re connected and if so, will download the updated advisors list based on your account level.

only other place I can think would be the grafana.log…but if something is displayed on the screen it should be logged on the backend…

when you refresh the QAN page, can you do an ls -alrt to see what the recently touched files are as maybe it’s an nginx error (just thinking out loud on that one).

Found additional errors

WARN[2024-04-19T15:11:35.609+00:00] RPC /qan.v1beta1.Profile/GetReport done in 1.105571317s (quite long). request=1c4a3062-fe5f-11ee-9b8e-46572d7c83f8
WARN[2024-04-19T15:11:36.360+00:00] RPC /qan.v1beta1.Filters/Get done in 1.879481338s (quite long). request=1c46adf0-fe5f-11ee-9b8e-46572d7c83f8

about ls -alrt

not changed due update page

-rwxrwxr-x 1 root root 1295 Dec 6 2022 entrypoint.sh
drwxr-xr-x 5 root root 50 Dec 6 2022 ansible
drwxr-xr-x 3 root root 17 Dec 6 2022 dbaas-tools
drwxr-xr-x 1 root root 61 Dec 6 2022 .
dr-xr-xr-x 1 root root 73 Apr 9 18:37 .

oops, sorry, I should have been more explicit:

run ls -alrt in /srv/logs/ so we can see which logs recorded something during the action.

Since no issue found in clickhouse or pmm-managed log I’m next suspect of either nginx.log or qan-api2.log…none of the other logs make sense.

ls -alrt due page refresh

-rw-r--r--  1 root root 10485949 Apr 19 13:47 victoriametrics.log.3
-rw-r--r--  1 root root     1625 Apr 21 09:16 postgresql14.log
-rw-r--r--  1 root root    18032 Apr 21 09:16 pmm-update-perform-init.log
-rw-r--r--  1 root root    14937 Apr 21 09:17 alertmanager.log
-rw-r--r--  1 root root    30917 Apr 21 09:17 vmalert.log
-rw-r--r--  1 root root    40023 Apr 21 09:25 supervisord.log
-rw-r--r--  1 root root 10486221 Apr 21 09:49 victoriametrics.log.2
-rw-r--r--  1 root root 52428818 Apr 22 02:55 pmm-managed.log.2
-rw-r--r--  1 root root 10491994 Apr 22 04:27 qan-api2.log.3
-rw-r--r--  1 root root 10487373 Apr 22 08:50 qan-api2.log.2
-rw-r--r--  1 root root 52428973 Apr 22 10:02 pmm-managed.log.1
-rw-r--r--  1 root root 10485862 Apr 22 11:50 victoriametrics.log.1
-rw-r--r--  1 root root 52428898 Apr 22 12:53 nginx.log.2
-rw-r--r--  1 root root 10492192 Apr 22 13:12 qan-api2.log.1
-rw-r--r--  1 root root 52429044 Apr 22 14:04 nginx.log.1
drwxr-xr-x  2 pmm  pmm      4096 Apr 22 14:04 .
-rw-r--r--  1 root root  4424421 Apr 22 15:07 grafana.log
-rw-r--r--  1 root root  4689937 Apr 22 15:11 pmm-agent.log
-rw-r--r--  1 root root  1263828 Apr 22 15:11 victoriametrics.log
-rw-r--r--  1 root root   310157 Apr 22 15:11 clickhouse-server.log
-rw-r--r--  1 root root  4781062 Apr 22 15:11 qan-api2.log
-rw-r--r--  1 root root 37950760 Apr 22 15:11 pmm-managed.log
-rw-r--r--  1 root root 49877295 Apr 22 15:11 nginx.log

About logs

Victoriametrics

2024-04-22T15:13:37.626Z        info    /home/builder/rpm/BUILD/VictoriaMetrics-pmm-6401-v1.82.1/lib/promscrape/scraper.go:143  SIGHUP received; reloading Prometheus configs from "/etc/victoriametrics-promscrape.yml"
2024-04-22T15:13:37.628Z        info    /home/builder/rpm/BUILD/VictoriaMetrics-pmm-6401-v1.82.1/lib/promscrape/scraper.go:150  nothing changed in "/etc/victoriametrics-promscrape.yml"

pmm-managed.log
Many INFO messages (skip) and only

WARN[2024-04-22T15:16:52.947+00:00] RPC /platform.Platform/UserStatus done in 30.347715ms with gRPC error: rpc error: code = Unauthenticated desc = Failed to get access token. Please sign in using your Percona Account.  request=59518c7e-00bb-11ef-b731-0aba72e23446
ERRO[2024-04-22T15:16:53.199+00:00] failed to get SSO details: PMM Server is not connected to Portal  component=platform

Nginx many success requests and one

10.0.2.100 - - [22/Apr/2024:15:18:43 +0000] "POST /v0/qan/Filters/Get HTTP/2.0" 429 92 "https://pmm.int.zfx.com/graph/d/pmm-qan/pmm-query-analytics?var-environment=All&var-cluster=All&var-replication_set=All&var-database=All&var-schema=All&var-node_name=All&var-service_name=All&var-client_host=All&var-username=All&var-service_type=All&var-node_type=All&var-city=All&var-az=All&var-interval=auto&columns=%5B%22load%22,%22num_queries%22,%22query_time%22%5D&group_by=queryid&order_by=-load&from=now-12h&to=now&page_number=1&page_size=25&totals=false&details_tab=details&orgId=1" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36" "-"

grafana.log

logger=context userId=2 orgId=1 uname=XXX t=2024-04-22T15:23:30.96875913Z level=info msg="Request Completed" method=GET path=/api/live/ws status=-1 remote_addr=127.0.0.1 time_ms=1 duration=1.613344ms size=0 referer= handler=/api/live/ws

qan-api2.log

WARN[2024-04-22T15:25:00.096+00:00] Nothing to save - no metrics buckets.         component=data_ingestion
WARN[2024-04-22T15:25:00.101+00:00] Nothing to save - no metrics buckets.         component=data_ingestion```

Take a look at qan-api2.log and grafana.log at the 22/Apr/2024:15:18:43 +0000 timestamp (when nginx threw the forbidden error)…what was happening +/- a few seconds of that. I guess the easy to search log format would be 2024-04-22T15:18

I see the post attempt in the nginx log which was met with a forbidden and I’d expect one of those two services (QAN api or Grafana) to have logged a corresponding error. Normally authentication is handled by grafana but the QAN api may be doing it’s own auth or at least “first to answer” even if passing auth request on to grafana. If nothing, you may have to put the app into debug mode (will log a TON) to get more info on what’s happening.

One thing that popped into my mind is: did you enable “anonymous” access i.e. you can browse dashboards without logging in at all? I know that QAN doesn’t work well (or at least didn’t use to work well) with unauthenticated users…

All what find in logs after refreash page with error
I checked in private browser tab
We do not have anonymous access - we used strictly SSO auth in PMM

==> /srv/logs/qan-api2.log <==
WARN[2024-04-23T19:34:37.670+00:00] RPC /qan.v1beta1.Filters/Get done in 1.722110453s (quite long).  request=846a831f-01a8-11ef-a92a-0aba72e23446

==> /srv/logs/nginx.log <==
10.0.2.100 - - [23/Apr/2024:19:34:37 +0000] "POST /v0/qan/Filters/Get HTTP/2.0" 429 92 "https://XXXX/graph/d/pmm-qan/pmm-query-analytics?var-environment=All&var-cluster=All&var-replication_set=All&var-database=All&var-schema=All&var-node_name=All&var-service_name=All&var-client_host=All&var-username=All&var-service_type=All&var-node_type=All&var-city=All&var-az=All&var-interval=auto&columns=%5B%22load%22,%22num_queries%22,%22query_time%22%5D&group_by=queryid&order_by=-load&from=now-12h&to=now&page_number=1&page_size=25&totals=false&details_tab=details&orgId=1" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36" "-"

Hmmm…could be a clue!

I’m going to post this thread internally so a few of the devs that know more about the security ins and outs can chime in. Given you use your own SSO (can you share which provider) perhaps there’s a permission mapping that’s getting missed. Strange that it’d only be for filters and not the main query panel but I’ve seen stranger.

Hello @PavelVD,

Do you have comments parsing enabled? Is it possible that your queries have some high cardinality data(e.g. trace ID) in comments? I would recommend to shorten the period to 1-5 mins and check if filters show up.

Sorry for long answer. I`ve made some tests and found that problem exists on any interval longer 8 hours.