Performance differences between IN and EXISTS

Very often I hear people wondering if it’s better the EXISTS or the IN operator. Let’s start by saying that the two operators behave differently when NULL values are involved (see this post).

If you are sure that the field used to join the tables does not include NULL values then the IN operator “generally” performs better than EXISTS. I say “generally” because this is not always the case. For example consider the following query:

 
SELECT *
FROM table_a a
WHERE a.cd_field NOT IN (SELECT b.cd_field FROM table_b b);

The previous query is processed more or less as follows:

 
SELECT *
FROM table_a a, (SELECT DISTINCT cd_field FROM table_b) b
WHERE a.cd_field != b.cd_field;

The subquery is evaluated and the duplicates removed from the result. Then it is joined to the outer table. Now lets analyze the same query using NOT EXISTS:

 
SELECT *
FROM table_a a
WHERE NOT EXISTS (SELECT 1 FROM table_b b WHERE a.cd_field = b.cd_field);

The previous query is processed more or less as follows:

    
FOR x IN ( SELECT * FROM table_a ) LOOP
  IF ( NOT EXISTS ( SELECT 1 FROM table_b b WHERE x.cd_field = b.cd_field ) THEN
    --output the record
  END IF
END LOOP

This means that using NOT EXISTS always results in a full scan of the outer table whereas the query which uses NOT IN can exploit an index on table_a.cd_field. In general NOT IN performs better when the result of the subquery is small and there’s an index on the field used by the join. On the other hand, NOT EXISTS is to prefer when the subquery has many rows, the outer query is relatively small and there’s an index on the field used to join the two tables. In this latter case the cost of removing duplicates (DISTINCT) could be greater than the cost of the full scan of the outer table.

Of course this is a very general guideline, you should always conduct some test using indexes and so on to figure out which method to use.