Nearest value joins

Hi,

I have two tables, both containing measurement values with (unix_)timestamps. What is the best way to left join one of these with the other on the nearest timestamp value?

Currently I am doing something like this:

CREATE TABLE Session
(Session_id BIGINT UNSIGNED AUTO_INCREMENT PRIMARY KEY,
Session_Data INT UNSIGNED NOT NULL);

CREATE TABLE Meas_Data
(Time INT UNSIGNED NOT NULL,
Id DEC(15) UNSIGNED ZEROFILL NOT NULL,
Data DEC(15) UNSIGNED ZEROFILL NOT NULL,
Session_id BIGINT UNSIGNED NOT NULL,
PRIMARY KEY (Time, Id),
FOREIGN KEY (Session_id) REFERENCES Session (Session_id)
ON DELETE NO ACTION ON UPDATE CASCADE);

CREATE TABLE GPS
(Session_id BIGINT UNSIGNED NOT NULL,
UTC_s INT UNSIGNED NOT NULL COMMENT ‘Timestamp’,
UTC_us DEC(3) UNSIGNED NOT NULL COMMENT ‘microseconds’,
Longitude DOUBLE NOT NULL,
Latitude DOUBLE NOT NULL,
HDOP DEC(2,1) UNSIGNED,
Altitude FLOAT,
Fix ENUM(‘Invalid’, ‘2D’, ‘3D’),
COG FLOAT,
Speed FLOAT,
Sat DEC(2) UNSIGNED,
PRIMARY KEY (Session_id, UTC_s),
FOREIGN KEY (Session_id) REFERENCES Transmission_Log (Session_id)
ON DELETE CASCADE ON UPDATE NO ACTION);

CREATE FUNCTION get_gps_distance(mt_ INT UNSIGNED, si_ BIGINT UNSIGNED)
RETURNS INT UNSIGNED
COMMENT ‘Returns the time difference to the nearest GPS fix’
BEGIN
DECLARE rv_ INT UNSIGNED;
SELECT MIN(ABS(mt_ - UTC_s)) INTO rv_ FROM GPS
WHERE TL_id = si_;
RETURN rv_;
END;

SELECT *
FROM Meas_Data NATURAL JOIN Session
LEFT JOIN GPS
ON Session.TL_id = GPS.TL_id
AND ABS(Backoffice_View.Capture_Time - GPS.UTC_s)
= get_gps_distance(Backoffice_View.Capture_Time,
Backoffice_View.TL_id);

When I added the left join, my tests on realistic data experienced a factor of ten slowdown. Is there a better way to do such joins?

Thx

ImRe

ImRe,

What is EXPLAIN for this query ?

In general it is not surprising at all as your nearest distance check may be pretty expensive.

I would also build query as simple query (without stored function) first as otherwise you hide complexity.

Peter,

I rewrote the query with subquery instead of procedure. Here is the explain output

±—±-------------------±------------±------±--------------±--------±--------±----------------------±-----±-------------------------+| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |±—±-------------------±------------±------±--------------±--------±--------±----------------------±-----±-------------------------+| 1 | PRIMARY | meas_data | range | PRIMARY | PRIMARY | 4 | NULL | 308 | Using where || 1 | PRIMARY | GPS | ref | PRIMARY | PRIMARY | 8 | gi2.meas_data.TL_id | 498 | || 2 | DEPENDENT SUBQUERY | GPS | ref | PRIMARY | PRIMARY | 8 | func | 498 | Using where; Using index |±—±-------------------±------------±------±--------------±--------±--------±----------------------±-----±-------------------------+

It looks pretty innocent to me …

Well. Number of rows is pretty large.

300500500 is 75.000.000 which is quite a number. Real number may be much smaller due to using where but it still can be significant.

You can profile your query and see how many rows each part really need to analyze.