OiO.lk Community platform!

Oio.lk is an excellent forum for developers, providing a wide range of resources, discussions, and support for those in the developer community. Join oio.lk today to connect with like-minded professionals, share insights, and stay updated on the latest trends and technologies in the development field.
  You need to log in or register to access the solved answers to this problem.
  • You have reached the maximum number of guest views allowed
  • Please register below to remove this limitation

Athena Query - "NOT IN" slow performance

  • Thread starter Thread starter Jordan
  • Start date Start date
J

Jordan

Guest
Context

I am building a query using Athena and I noted that using "NOT IN" was having a significant impact on the runtime of the query. The query is performing a join between two tables and aims to filter the rows of the result by excluding a set of ids.

The id is NOT unique. It is possible it will be present across multiple rows. Essentially I wrote the subquery as I want to ignore ANY/ALL of the rows with a given id value if the rows meet the condition in the subquery.

In other words, for a given id "24324" which has 5 rows in the database, we want to filter out ALL of the rows if ANY of the individual rows values meet the conditions in the subquery.

My question

Is there a more efficient way to structure this query that I have not considered to reduce the runtime?

The query

Table and column names have been sanitised with placeholder/generic values.

Code:
SELECT *
FROM table_a
INNER JOIN table_b ON table_b.external_id = table_a.external_id 
                   AND table_b.client_name = table_a.client
-- These are the ids we do not want in the final result.
WHERE table_b.id NOT IN (SELECT distinct(table_b.id)
                         FROM table_b
                         WHERE table_b.decision IS NOT NULL 
                            OR table_b.submission_time IS NOT NULL)
<p><strong>Context</strong></p>
<p>I am building a query using Athena and I noted that using "NOT IN" was having a significant impact on the runtime of the query. The query is performing a join between two tables and aims to filter the rows of the result by excluding a set of ids.</p>
<p>The id is NOT unique. It is possible it will be present across multiple rows. Essentially I wrote the subquery as I want to ignore ANY/ALL of the rows with a given id value if the rows meet the condition in the subquery.</p>
<p>In other words, for a given id "24324" which has 5 rows in the database, we want to filter out ALL of the rows if ANY of the individual rows values meet the conditions in the subquery.</p>
<p><strong>My question</strong></p>
<p>Is there a more efficient way to structure this query that I have not considered to reduce the runtime?</p>
<p><strong>The query</strong></p>
<p>Table and column names have been sanitised with placeholder/generic values.</p>
<pre><code>SELECT *
FROM table_a
INNER JOIN table_b ON table_b.external_id = table_a.external_id
AND table_b.client_name = table_a.client
-- These are the ids we do not want in the final result.
WHERE table_b.id NOT IN (SELECT distinct(table_b.id)
FROM table_b
WHERE table_b.decision IS NOT NULL
OR table_b.submission_time IS NOT NULL)
</code></pre>
Continue reading...
 

Latest posts

Top