Aggregation functions within restriction conditions
Example 1: global argmax
Restricting a relation to a subset based on the average or maximum value of some attribute seems like it should be an easy operation.
For example, consider the table with heading
Student: (student*, height)
Now let’s select the tallest student.
You may attempt the following in DataJoint:
Student & 'height=max(height)'
or in SQL:
SELECT
student
WHERE
height=
max
(height)
However, both fail. Aggregation functions such as max, sum, or avg are not allowed in restrictions. Restrictions apply to each row and all their functions must operate on attributes in that row and they cannot examine other rows.
Therefore, in DataJoint (version 0.10.0+), the solution would be
Student
*
dj.U().aggr(
Student, 'max_height=max(height)'
) & 'height=max_height'
We can break this expression up into three steps:
m
=
dj.U().aggr(Student, max_height
=
'max(height)')
s
=
Student
*
m
r
=
s & 'height=max_height'
The first expression uses dj.U() (the relation with one row and zero attributes) to aggregate the maximum height. The resulting relation m has a single non-primary attribute `max_height` and one row with the highest value of height.
The second expression joins max_height to every row in Student. Now every row has both the height and the maximum height and we are ready for the restriction in the final step.
The literal translation into SQL would be
SELECT
student, height
FROM
Student NATURAL JOIN
(
SELECT
max
(height)
as
max_height
FROM
Student
)
as
m
WHERE
height = max_height
However, SQL also allows using results of subqueries as scalars in WHERE clauses, allowing a simpler expression
SELECT
student, height
FROM
Student
WHERE
height = (
SELECT
max
(height)
FROM
Student)
Example 2: Group average
Let’s use the tables Enroll listing students in each course and Course listing all coures.
Enroll: (course*, student*)
Course: (course*, course_description)
Let’s find all students above average height in each course.
Again, a naive query might look something like
1
Enroll
*
Student & 'height > avg(height)'
This fails for the same reason: aggregation functions are not allowed in restriction expressions.
Again, we must first compute the averages per course and join them into the original expression before restriction:
# extend Enroll with student height
e
=
Enroll
*
Student
# average height in each course
h
=
Course.aggr(e, avg_height
=
'avg(height)')
# extend enrollment with average height per course
s
=
e
*
h
# restrict enrollment to above-average-height students
r
=
s & 'height>avg_height'
Or, more succinctly,
e
=
Enroll
*
Student
r
=
e
*
Course.aggr(e, avg_height
=
'avg(height)') \
& 'height>avg_height'
An efficient equivalent expression in SQL could be:
SELECT
course, student, height
FROM
Enroll NATURAL JOIN
Student NATURAL JOIN
(
SELECT
course, avg(height)
as
avg_height
FROM
Enroll NATURAL JOIN
Student
GROUP
BY
course)
as
h
WHERE
height > avg_height
Related posts
Entrepreneurs of Insight
A Better Data Engine for Brain Science
Data needs direction: five clarifications for database design
Updates Delivered *Straight to Your Inbox*
Join the mailing list for industry insights, company news, and product updates delivered monthly.
