Written by
Dimitri Yatsenko, PhD
Founder • Chief Science & Technology Officer
January 16, 2018

Aggregation functions within restriction conditions

Dimitri Yatsenko, PhD
Founder • Chief Science & Technology Officer

Example 1: global argmax

Restricting a relation to a subset based on the average or maximum value of some attribute seems like it should be an easy operation.

For example, consider the table with heading

Student: (student*, height)

Now let’s select the tallest student.

You may attempt the following in DataJoint:

Student & 'height=max(height)'

or in SQL:

SELECT student

WHERE height=max(height)

However, both fail. Aggregation functions such as max, sum, or avg are not allowed in restrictions. Restrictions apply to each row and all their functions must operate on attributes in that row and they cannot examine other rows.

Therefore, in DataJoint (version 0.10.0+), the solution would be

Student * dj.U().aggr(

   Student, 'max_height=max(height)'

   ) & 'height=max_height'

We can break this expression up into three steps:

m = dj.U().aggr(Student, max_height='max(height)')

s = Student * m

r = s & 'height=max_height'

The first expression uses dj.U() (the relation with one row and zero attributes) to aggregate the maximum height. The resulting relation m has a single non-primary attribute `max_height` and one row with the highest value of height.

The second expression joins max_height to every row in Student. Now every row has both the height and the maximum height and we are ready for the restriction in the final step.

The literal translation into SQL would be

SELECT

   student, height

FROM

   Student NATURAL JOIN

       (

       SELECT

           max(height) as max_height

       FROM

           Student

       ) as m

WHERE

   height = max_height

However, SQL also allows using results of subqueries as scalars in WHERE clauses, allowing a simpler expression

SELECT

   student, height

FROM

   Student

WHERE

   height = (SELECT max(height) FROM Student)

Example 2: Group average

Let’s use the tables Enroll listing students in each course and Course listing all coures.

Enroll: (course*, student*)

Course: (course*, course_description)

Let’s find all students above average height in each course.

Again, a naive query might look something like

1

Enroll * Student & 'height > avg(height)'

This fails for the same reason: aggregation functions are not allowed in restriction expressions.

Again, we must first compute the averages per course and join them into the original expression before restriction:

# extend Enroll with student height

e = Enroll * Student  

# average height in each course

h = Course.aggr(e, avg_height='avg(height)')  

# extend enrollment with average height per course

s = e * h

# restrict enrollment to above-average-height students

r = s & 'height>avg_height'  

Or, more succinctly,

e = Enroll * Student  

r = e * Course.aggr(e, avg_height='avg(height)') \

   & 'height>avg_height'

An efficient equivalent expression in SQL could be:

SELECT

   course, student, height

FROM

   Enroll NATURAL JOIN Student NATURAL JOIN (

       SELECT

           course, avg(height) as avg_height

       FROM

           Enroll NATURAL JOIN Student

       GROUP BY course) as h

WHERE

   height > avg_height

Related posts

Updates Delivered *Straight to Your Inbox*

Join the mailing list for industry insights, company news, and product updates delivered monthly.

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.