4 Week 3: `GROUP BY`/`HAVING`

4.1 Connecting to our database

Let’s connect to our database.

library(duckdb)

Loading required package: DBI

library(DBI)
library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.2     ✔ tibble    3.2.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.0.4

── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

con <- DBI::dbConnect(duckdb::duckdb(), 
                      "data/GiBleed_5.3_1.1.duckdb")

4.2 `GROUP BY`

Say we want to count, calculate totals, or averages for a particular column by a particular grouping variable. We can use a SELECT/GROUP BY pattern to do this.

There are some requirements to using SELECT/GROUP BY:

Grouping variable should be categorical (such as c.concept_name)
Grouping variable must be in SELECT clause (c.concept_name)

Count the number of persons by gender_source_value:

SELECT gender_source_value, COUNT(person_id) AS person_count
  FROM person
  GROUP BY gender_source_value

2 records
gender_source_value	person_count
F	1373
M	1321

Here, we’re combining SELECT/GROUP_BY with an INNER JOIN:

SELECT c.concept_name AS procedure, COUNT(person_id) AS person_count
  FROM procedure_occurrence AS po
  INNER JOIN concept AS c
  ON po.procedure_concept_id = c.concept_id
  GROUP BY c.concept_name
  ORDER BY person_count DESC

Displaying records 1 - 10
procedure	person_count
Subcutaneous immunotherapy	17520
Cognitive and behavioral therapy	2820
Suture open wound	2487
Bone immobilization	1789
Sputum examination	1659
Direct current cardioversion	1342
Pulmonary rehabilitation	908
Intramuscular injection	908
Allergy screening test	704
Bone density scan	655

We can group by multiple variables. Here is a triple join where we are counting by both gender_source_value and concept_name:

SELECT c.concept_name AS procedure, p.gender_source_value, COUNT(p.person_id) AS person_count
  FROM procedure_occurrence AS po
  INNER JOIN person AS p
  ON p.person_id = po.person_id
  INNER JOIN concept AS c
  ON po.procedure_concept_id = c.concept_id
  GROUP BY c.concept_name, p.gender_source_value
  ORDER BY person_count DESC

Displaying records 1 - 10
procedure	gender_source_value	person_count
Subcutaneous immunotherapy	F	9480
Subcutaneous immunotherapy	M	8040
Cognitive and behavioral therapy	M	1872
Suture open wound	F	1293
Suture open wound	M	1194
Cognitive and behavioral therapy	F	948
Bone immobilization	F	933
Intramuscular injection	F	908
Bone immobilization	M	856
Sputum examination	M	841

4.2.1 Check on Learning

COUNT the number of concept_ids grouped by domain_id in the concept table:

SELECT domain_id, COUNT(------) AS count_domain
  FROM concept
  GROUP BY -------
  ORDER BY count_domain DESC

4.3 `HAVING`

We can filter by these aggregate variables. But we can’t use them in a WHERE clause. There is an additional clause HAVING:

SELECT c.concept_name AS procedure, COUNT(person_id) AS person_count
  FROM procedure_occurrence AS po
  INNER JOIN concept AS c
  ON po.procedure_concept_id = c.concept_id
  GROUP BY c.concept_name
  HAVING person_count > 500
  ORDER BY person_count DESC

Displaying records 1 - 10
procedure	person_count
Subcutaneous immunotherapy	17520
Cognitive and behavioral therapy	2820
Suture open wound	2487
Bone immobilization	1789
Sputum examination	1659
Direct current cardioversion	1342
Pulmonary rehabilitation	908
Intramuscular injection	908
Allergy screening test	704
Bone density scan	655

Why can’t we use WHERE? WHERE is actually evaluated before SELECT/GROUP_BY, so it has no idea that the aggregated variables exist. Remember SQL clause priorities?. WHERE is priority 2, and GROUP BY/HAVING are priorities 3 and 4.

In general, you need to put WHERE before GROUP BY/HAVING. Your SQL statement will not work if you put WHERE after GROUP BY / HAVING.

Here is an example of using both WHERE and HAVING:

SELECT domain_id, COUNT(concept_id) AS count_domain
  FROM concept
  WHERE domain_id != 'Drug'
  GROUP BY domain_id
  HAVING count_domain > 40
  ORDER BY count_domain DESC

3 records
domain_id	count_domain
Condition	86
Measurement	59
Procedure	54

sql_statement <- "EXPLAIN SELECT domain_id, COUNT(concept_id) AS count_domain
  FROM concept
  WHERE domain_id != 'Drug'
  GROUP BY domain_id
  HAVING count_domain > 40
  ORDER BY count_domain DESC"

DBI::dbGetQuery(con, sql_statement)

physical_plan
┌───────────────────────────┐
│         PROJECTION        │
│    ────────────────────   │
│__internal_decompress_strin│
│           g(#0)           │
│             #1            │
│                           │
│          ~1 Rows          │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│          ORDER_BY         │
│    ────────────────────   │
│    count(GiBleed_5.main   │
│ .concept.concept_id) DESC │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│         PROJECTION        │
│    ────────────────────   │
│__internal_compress_string_│
│        hugeint(#0)        │
│             #1            │
│                           │
│          ~1 Rows          │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│           FILTER          │
│    ────────────────────   │
│    (count_domain > 40)    │
│                           │
│          ~1 Rows          │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│         PROJECTION        │
│    ────────────────────   │
│__internal_decompress_strin│
│           g(#0)           │
│             #1            │
│                           │
│          ~7 Rows          │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│       HASH_GROUP_BY       │
│    ────────────────────   │
│         Groups: #0        │
│                           │
│        Aggregates:        │
│        count_star()       │
│                           │
│          ~7 Rows          │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│         PROJECTION        │
│    ────────────────────   │
│         domain_id         │
│                           │
│          ~88 Rows         │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│         PROJECTION        │
│    ────────────────────   │
│__internal_compress_string_│
│        hugeint(#0)        │
│             #1            │
│                           │
│          ~88 Rows         │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│         SEQ_SCAN          │
│    ────────────────────   │
│       Table: concept      │
│   Type: Sequential Scan   │
│                           │
│        Projections:       │
│         domain_id         │
│         concept_id        │
│                           │
│          Filters:         │
│     domain_id!='Drug'     │
│                           │
│          ~88 Rows         │
└───────────────────────────┘

Here’s what happens when you put WHERE after GROUP BY/HAVING:

SELECT domain_id, COUNT(concept_id) AS count_domain
  FROM concept
  GROUP BY domain_id
  HAVING count_domain > 40
  WHERE domain_id != 'Drug'
  ORDER BY count_domain DESC

Here is WHERE/GROUP BY/HAVING combined with an INNER JOIN:

SELECT c.concept_name AS procedure, COUNT(person_id) AS person_count
  FROM procedure_occurrence AS po
  INNER JOIN concept AS c
  ON po.procedure_concept_id = c.concept_id
  WHERE date_part('YEAR', po.procedure_datetime) > 2000
  GROUP BY c.concept_name
  HAVING person_count > 500
  ORDER BY person_count DESC

4 records
procedure	person_count
Direct current cardioversion	1031
Subcutaneous immunotherapy	705
Suture open wound	697
Pulmonary rehabilitation	611

We can group by year by first extracting it from po.procedure_datetime and using an alias year:

SELECT date_part('YEAR', po.procedure_datetime) AS year, COUNT(po.person_id) AS procedure_count
  FROM procedure_occurrence AS po
  INNER JOIN concept AS c
  ON po.procedure_concept_id = c.concept_id
  GROUP BY year
  ORDER BY procedure_count DESC

Displaying records 1 - 10
year	procedure_count
1992	4717
1990	4599
1991	4506
1993	2843
1997	725
1995	690
1996	685
1994	614
1998	598
1999	455

4.4 `IN`/`LIKE`

A couple of twists to WHERE. We can use IN to search on multiple conditions. We put the multiple words in a () separated by commas:

SELECT concept_name, domain_id 
  FROM concept
  WHERE domain_id IN ('Drug', 'Condition')

Displaying records 1 - 10
concept_name	domain_id
celecoxib 200 MG Oral Capsule [Celebrex]	Drug
pneumococcal polysaccharide vaccine, 23 valent	Drug
Alendronate	Drug
Ampicillin 100 MG/ML Injectable Solution	Drug
celecoxib 200 MG Oral Capsule [Celebrex]	Drug
rotavirus, live, monovalent vaccine	Drug
Midazolam	Drug
Diclofenac Sodium 75 MG Delayed Release Oral Tablet	Drug
Diclofenac Sodium 75 MG Delayed Release Oral Tablet	Drug
tetanus and diphtheria toxoids, adsorbed, preservative free, for adult use	Drug

We can use NOT with IN to exclude a list of conditions:

SELECT concept_name, domain_id 
  FROM concept
  WHERE domain_id NOT IN ('Drug', 'Condition')

Displaying records 1 - 10
concept_name	domain_id
Coronary artery bypass graft	Procedure
Intramuscular injection	Procedure
Protein serum/plasma	Measurement
Surgical manipulation of shoulder joint	Procedure
Neonatal screening	Procedure
Honey bee IgE Ab [Units/volume] in Serum	Measurement
Erythrocyte distribution width [Ratio]	Measurement
Removal of subcutaneous contraceptive	Procedure
Cladosporium herbarum IgE Ab [Units/volume] in Serum	Measurement
Common Ragweed IgE Ab [Units/volume] in Serum	Measurement

One note. It is usually faster to make a temporary table with your values and join on that temporary table. We’ll talk more about this below.

LIKE is one way to do wild card searches.

SELECT concept_name, domain_id 
  FROM concept
  WHERE domain_id LIKE 'Dru%'

Displaying records 1 - 10
concept_name	domain_id
celecoxib 200 MG Oral Capsule [Celebrex]	Drug
pneumococcal polysaccharide vaccine, 23 valent	Drug
Alendronate	Drug
Ampicillin 100 MG/ML Injectable Solution	Drug
celecoxib 200 MG Oral Capsule [Celebrex]	Drug
rotavirus, live, monovalent vaccine	Drug
Midazolam	Drug
Diclofenac Sodium 75 MG Delayed Release Oral Tablet	Drug
Diclofenac Sodium 75 MG Delayed Release Oral Tablet	Drug
tetanus and diphtheria toxoids, adsorbed, preservative free, for adult use	Drug

4.5 Creating Temporary Tables

Temporary tables can be very useful when you are trying to merge on a list of concepts, or for storing intermediate results.

Temporary tables only last for the session - they disappear after you disconnect, so don’t use them for permanent storage.

Here is the csv (comma separated value) file that we’re going to load in:

read_csv("data/temp_cost.csv")

Rows: 51 Columns: 3
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): concept_name
dbl (2): procedure_concept_id, cost

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

# A tibble: 51 × 3
   concept_name                          procedure_concept_id  cost
   <chr>                                                <dbl> <dbl>
 1 Cognitive and behavioral therapy                   4043071  3300
 2 Nasal sinus endoscopy                              4010253  3000
 3 Intubation                                         4202832  4000
 4 Removal of subcutaneous contraceptive              4199276  1800
 5 Radiography of humerus                             4047491  2300
 6 Radiologic examination of knee                     4252419  4100
 7 Lung volume reduction surgery                      4323902   730
 8 Sputum examination                                 4151422  4000
 9 Percutaneous coronary intervention                 4216130  1800
10 Injection of epinephrine                           4197460  2600
# ℹ 41 more rows

We use CREATE TEMP TABLE to create a temp table. We will need to specify the data types of the columns before we can add data to it. We are using CREATE OR REPLACE in the below chunk to prevent errors when we run it, just in case we have run it before.

Then we can use COPY from DuckDB to load it in:

CREATE OR REPLACE TEMP TABLE cost(
  concept_name VARCHAR,
  procedure_concept_id INT,
  cost INT
);
COPY cost FROM 'data/temp_cost.csv'

DuckDB also is smart enough to infer the column types and names from the data:

CREATE OR REPLACE TEMP TABLE cost AS
  SELECT * FROM read_csv('data/temp_cost.csv')

Now our table exists in our database, and we can work with it.

SELECT * FROM cost

Displaying records 1 - 10
concept_name	procedure_concept_id	cost
Cognitive and behavioral therapy	4043071	3300
Nasal sinus endoscopy	4010253	3000
Intubation	4202832	4000
Removal of subcutaneous contraceptive	4199276	1800
Radiography of humerus	4047491	2300
Radiologic examination of knee	4252419	4100
Lung volume reduction surgery	4323902	730
Sputum examination	4151422	4000
Percutaneous coronary intervention	4216130	1800
Injection of epinephrine	4197460	2600

DESCRIBE cost

3 records
column_name	column_type	null	key	default	extra
concept_name	VARCHAR	YES	NA	NA	NA
procedure_concept_id	BIGINT	YES	NA	NA	NA
cost	BIGINT	YES	NA	NA	NA

Now we can merge our temporary cost table with procedure_occurrence and calculate the sum cost per year:

SELECT date_part('YEAR', po.procedure_datetime) AS year, SUM(cost) AS sum_cost_month
  FROM procedure_occurrence AS po
  INNER JOIN cost AS c
  ON po.procedure_concept_id = c.procedure_concept_id
  GROUP BY year
  ORDER BY year DESC

Displaying records 1 - 10
year	sum_cost_month
2019	619380
2018	1271370
2017	1201430
2016	1122680
2015	1079390
2014	1103730
2013	1084430
2012	1118890
2011	1041720
2010	1015700

We’ll talk much more about subqueries and Views next time, which are another options to split queries up.

4.5.1 Check on Learning

Modify the query below to calculate average cost per month using AVG(cost) named as average_monthly_cost:

SELECT date_part('YEAR', po.procedure_datetime) AS year, SUM(cost)
  FROM procedure_occurrence AS po
  INNER JOIN cost AS c
  ON po.procedure_concept_id = c.procedure_concept_id
  GROUP BY year
  ORDER BY year DESC

Displaying records 1 - 10
year	sum(“cost”)
2019	619380
2018	1271370
2017	1201430
2016	1122680
2015	1079390
2014	1103730
2013	1084430
2012	1118890
2011	1041720
2010	1015700

4.6 Data Integrity

We talked a little bit last week about database constraints, such as FOREIGN KEY constraints, where we can’t add a row that refers to a foreign key if that foreign key doesn’t exist.

These constraints exist to ensure the data integrity of a database. For example, we don’t want to have rows in procedure_occurrence that have procedure_concept_id that don’t exist in the concept table.

Another way to keep data integrity is to have all operations be ACID compliant transactions. That is, all operations (inserting and removing rows) needs to be done in full before the next set of transactions (which could come from another user) are done to the database.

ACID is short for:

Atomicity - the operation must be all or none
Consistency - the operation must be done the same way
Isolation - the operation is not dependent on other operations, and is done in series, not parallel.
Durability - the operation must be robust to disruptions (like power outages). If a database is interrupted in an update, there must be a rollback mechanism to get the previous version of the data.

Finally, the design of the tables and what information they contain, and how they relate to each other is also important to data integrity. The process of deciding which columns belong to which tables is called normalization.

4.7 Database Design

Database design can be difficult because:

You need to understand the requirements of the data and how it is collected

For example, when is procedure information collected?
Do patients have multiple procedures? (Cardinality)

You need to group like data with like (normalization)

Data that is dependent on a primary key should stay together
For example, person should contain information of a patient such as demographics, but not individual procedure_concept_ids.

You need to have an automated process to add data to the database (Extract Transfer Load, or ETL).
Search processes must be optimized for common operations (indexing)

Of this, steps 1 and 2 are the most difficult and take the most time. They require the designer to interview users of the data and those who collect the data to reflect the business processes. These two steps are called the Data Modeling steps.

These processes are essential if you are designing a transactional database that is collecting data from multiple sources (such as clinicians at time of care) and is updated multiple times a second. For example, bank databases have a rigorous design.

If you want to read more about the data model we’re using, I’ve written up a short bit here: OMOP Data Model.

4.8 Database Administration

Maintaining a database is also known as database administration. Database Admins are responsible for the following:

Making sure that the data maintains its integrity
Ensuring that common queries are optimized for fast loading
General upkeep and optimization. Oftentimes, if multiple people are accessing the data at once, the data may be distributed among multiple machines (load balancing).
Security. We don’t want the wrong people accessing the data.

Being a good admin does not start from scratch. You can’t be a top-tier admin straight out of school. There are a lot of things DB admins learn, but a lot of the optimization happens from experience with managing the data.

Respect your DB Admin and know that they know a lot about how to optimize your queries.

4.9 Always close the connection

When we’re done, it’s best to close the connection with dbDisconnect().

dbDisconnect(con)