Pt-query-digest parse tcpdump and show filtered queries

Hello! I’ve read pt-query-digest documentation but didn’t find out how to do the thing I need

I have tcpdump pcap file and I want to extract all unique dataless sql queries

So I don’t need fingerprint logic, but something really similar without queries merging into subclasses but with uniqueness logic in place

In the end I’d like to receive info something like that:
(client ip) (operations over table: select, update, join etc) (table names in the query)

I guess fingerprinting would be okay but I need IPs to be grouped by table with all types of fingerprints

(Table/tables names) (ips of clients who accessed tables) (Fingerprints corresponding to tables)

I hope someone could help me and I will really appreciate any suggestions. Have a good day and thank you!

Hi @Andrei_Kiselev,
I do not think you can get exactly what you want directly from pt-query-digest. You will be able to extract the information you’re seeking with multiple commands or even write a script to achieve this.

extract all unique dataless sql queries

  • Wonder what do you mean and what parsing logic would you apply here! Do you mean DMLs having data?
  • May be greps with specific commands?

I guess fingerprinting would be okay but I need IPs to be grouped by table with all types of fingerprints

  • Default grouping is based on fingerprinting
  • You probably are looking for –group-by but if you read the documentation there is no grouping based on “ip”. I’d encourage you to try out the options and see what you get.

In the end I’d like to receive info something like that:
(client ip) (operations over table: select, update, join etc) (table names in the query)

  • Here you are looking to group all the queries that parsed from particular IP and what operation did it do
  • You might need to prepare digest from the log with filtering per IP and then further parse the output using awk/sed/grep etc

Go ahead, try out some, share us with your sample inputs and outputs. There is no single answer.

Thanks,
K

Hello, @kedarpercona! Thank you for your reply!

At the end I’ve just used tshark like that:
tshark -r data.pcap -d tcp.port==3306,mysql -Y 'mysql.query and mysql.query != ""' -T fields -e frame.time -e ip.src -e mysql.query > raw-ipqueries.txt

It allows me to gather output in this format:

Apr 13, 2025 12:37:47.331092000 GMT     127.0.0.1       UPDATE clients SET name = ? WHERE ID = ?
Apr 13, 2025 12:37:47.333148000 GMT     127.0.0.1       select id, name from clients where name in (select name from clients_helper);

Then I parse it with python by using sqlglot. At the end I get the output I needed:

{[
  "clients": [
    {
      "names": "localhost",
      "expressions": "select",
      "type": "read",
      "tables": "clients"
    },
    {
      "names": "localhost",
      "expressions": "select,select,select,subquery,subquery",
      "type": "read",
      "tables": "clients"
    }
    ]
  "clients_helper": [
    {
      "names": "localhost",
      "expressions": "select,select,subquery",
      "type": "read",
      "tables": "clients,clients_helper"
    },
    {
      "names": "localhost",
      "expressions": "select,select,subquery",
      "type": "read",
      "tables": "clients,clients_helper"
    }
    ]
  "command_expressions": [
    {
      "names": "localhost",
      "expressions": "call lookup('user2')",
      "type": "call",
      "tables": ""
    },
    {
      "names": "localhost",
      "expressions": "call lookup('user2')",
      "type": "call",
      "tables": ""
    }
    ]
]}

Haven’t figured out how to parse pt-query-digest output to match my requirements, but I really appreciate you for your help!