Simplify SQL with Sub-Selects

In the Hackerrank problem titled Challenges, a strategic approach is necessary to obtain the correct solution. When framed properly, a complicated query can be reduced to only a few strategic lines of code. In this article we compare a well formed query to a poorly formed one.

Problem statement:

/*
Print: hacker_id, name, and count(challenges) created by each student. 

Order by:
total challenges descending 
then sort the result by hacker_id
then, exclude those students if the count < maximum number of challenges
*/

Two correct solutions are given below. The first is well formed, easy to understand, and takes only a few lines of code.

Query # 1

The problem statement requires discarding records for hackers whom contributed an equal number of total challenges, when the total contributed is less than the maximum.

Because of this, it will be useful to first produce a single column sub-select query that obtains count(challenges_id) all other hacker_id. Then, we can reference this list to check if a hacker’s total challenges is not in the list (unique), or is in the list (not-unique)

The other sub-select in this query is relatively simple and produces a single value for the maximum number of total challenges from all hackers.

Finally, the outer-most select statement is also very simple.

Psuedo-code for query #1:

SELECT <fields>, COUNT(challenge_id) as challenge_count
HAVING challenge_count = max_challenge_count
OR challenge_count NOT IN <list_of_challenge_counts_by_other_hackers>

Actual code for query #1:

SELECT c.hacker_id, h.name, COUNT(c.challenge_id) AS cnt
FROM Hackers AS h
JOIN Challenges AS c
ON h.hacker_id = c.hacker_id
GROUP BY c.hacker_id, h.name

HAVING cnt = (
  SELECT COUNT(c1.challenge_id)
  FROM Challenges AS c1
  GROUP BY c1.hacker_id
  ORDER BY COUNT(*) DESC LIMIT 1
)

OR cnt NOT IN (
  SELECT COUNT(c2.challenge_id)
  FROM Challenges AS c2
  GROUP BY c2.hacker_id
  HAVING c2.hacker_id <> c.hacker_id
)
ORDER BY cnt DESC, c.hacker_id;

Query # 2

The second query is a brute-force approach that does not take advantage of any strategic principles.

Psuedo-code for query #2:

SELECT <fields>, total_challenges
  FROM hackers 
  JOIN <hacker_id, total_challenges> ON hacker_id
  JOIN <total_challenges, hacker_count> ON hacker_id
  JOIN <max_challenges>
WHERE total_challenges = max_challenges
OR hacker_count = 1

The lack of strategy lead to duplicate code, shown in green and orange highlighted text below. Encapsulating the code by using other techniques will not address the larger problem here.

Actual code for query #2:

select h.hacker_id, h.name, totals.total_challenges
/*, summ_group.hacker_count, summ_group_2.max_challenges*/
from hackers h

join (
    select hacker_id as hacker_id, 
           count(challenge_id) as total_challenges
    from challenges
    group by hacker_id
) totals
on h.hacker_id = totals.hacker_id

join (
    select summ.total_challenges as total_challenges,
            count(summ.hacker_id)as hacker_count
    from(
        select  h.hacker_id as hacker_id, 
                h.name as name, 
                t.total_challenges as total_challenges
        from hackers h
        join (
            select  hacker_id as hacker_id, 
                    count(challenge_id) as total_challenges
            from challenges
            group by hacker_id
        ) t
        on h.hacker_id = t.hacker_id
    ) summ
    group by summ.total_challenges
) summ_group
on summ_group.total_challenges = totals.total_challenges

join( 
    select max(summ.total_challenges) as max_challenges
    from(
        select  h.hacker_id as hacker_id, 
                h.name as name, 
                t.total_challenges as total_challenges
        from hackers h
        join (
            select  hacker_id as hacker_id, 
                    count(challenge_id) as total_challenges
            from challenges
            group by hacker_id
        ) t
        on h.hacker_id = t.hacker_id
    ) summ
) summ_group_2

where summ_group_2.max_challenges = summ_group.total_challenges
or summ_group.hacker_count = 1


order by totals.total_challenges desc, h.hacker_id

This hackerrank problem is a great exercise for developing SQL skills and identifying opportunities for SQL query simplification! Indirect approaches can lead to vastly simplified queries including the use of single-value sub-selects and single-column sub-selects.

Please leave a comment below with your thoughts on this topic and any simplifications you have found for this hackerrank challenge problem.

First Post

So, you have made it to my blog. Here I will give periodic updates explaining some of the minor projects/experiments I am working on. Major projects will be documented separately (and with greater detail) under the ‘Projects’ section of the website. Some of the topics I will regularly visit are: Python, Amateur Radio, Machine Learning, Embedded Systems.

Other topics that may pop up here and there include: Digital Signal Processing