── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Say we want to count, calculate totals, or averages for a particular column by a particular grouping variable. For example, suppose we want to group gender_source_value column in the person table and count the number of person_ids for each value of gender_source_value. We can use a SELECT/GROUP BY pattern to do this.
There are some requirements to using SELECT/GROUP BY:
Grouping variable should be categorical (such as c.concept_name)
Grouping variable must be in SELECT clause (c.concept_name)
Count the number of persons by gender_source_value:
SELECT gender_source_value, COUNT(person_id) AS person_countFROM personGROUPBY gender_source_value
2 records
gender_source_value
person_count
F
1373
M
1321
Notice that we use the AS alias to rename COUNT(person_id) to person_count in the column name.
We summarize our column in other ways besides COUNT:
MEAN
MIN
MAX
MEDIAN
For example, we can look at the minimum year_of_birth for each gender in the person table:
COUNT the number of concept_ids grouped by domain_id in the concept table:
SELECT domain_id, COUNT(------) AS count_domainFROM conceptGROUPBY-------ORDERBY count_domain DESC
You can also group by multiple variables. What happens if you group by domain_idandvocabulary_id?
4.3 GROUP BY with JOINs
Recall that table procedure_occurrence records the procedures of each person. Suppose that we do a GROUP BY on each procedure_concept_id and count the number of person_ids to understand how many people were treated for each procedure:
SELECT procedure_concept_id, COUNT(person_id) AS person_countFROM procedure_occurrenceGROUPBY procedure_concept_idORDERBY person_count DESC
Displaying records 1 - 10
procedure_concept_id
person_count
4107731
17520
4043071
2820
4125906
2487
4170947
1789
4151422
1659
4078793
1342
4035793
908
4295880
908
4191853
704
4195803
655
We wish we know what the procedure_concept_id referred to. We need to join it with concept table.
Here, we’re combining SELECT/GROUP_BY with an INNER JOIN:
SELECT c.concept_name ASprocedure, COUNT(person_id) AS person_countFROM procedure_occurrence AS poINNERJOIN concept AS cON po.procedure_concept_id = c.concept_idGROUPBY c.concept_nameORDERBY person_count DESC
Displaying records 1 - 10
procedure
person_count
Subcutaneous immunotherapy
17520
Cognitive and behavioral therapy
2820
Suture open wound
2487
Bone immobilization
1789
Sputum examination
1659
Direct current cardioversion
1342
Intramuscular injection
908
Pulmonary rehabilitation
908
Allergy screening test
704
Bone density scan
655
Even more complicated: We can group by multiple variables. Here is a triple join where we are counting by both gender_source_value and concept_name:
SELECT c.concept_name ASprocedure, p.gender_source_value, COUNT(p.person_id) AS person_countFROM procedure_occurrence AS poINNERJOIN person AS pON p.person_id = po.person_idINNERJOIN concept AS cON po.procedure_concept_id = c.concept_idGROUPBY c.concept_name, p.gender_source_valueORDERBY person_count DESC
Displaying records 1 - 10
procedure
gender_source_value
person_count
Subcutaneous immunotherapy
F
9480
Subcutaneous immunotherapy
M
8040
Cognitive and behavioral therapy
M
1872
Suture open wound
F
1293
Suture open wound
M
1194
Cognitive and behavioral therapy
F
948
Bone immobilization
F
933
Intramuscular injection
F
908
Bone immobilization
M
856
Sputum examination
M
841
4.4HAVING
We can filter by these aggregate variables. But we can’t use them in a WHERE clause. There is an additional clause HAVING:
SELECT c.concept_name ASprocedure, COUNT(person_id) AS person_countFROM procedure_occurrence AS poINNERJOIN concept AS cON po.procedure_concept_id = c.concept_idGROUPBY c.concept_nameHAVING person_count >500ORDERBY person_count DESC
Displaying records 1 - 10
procedure
person_count
Subcutaneous immunotherapy
17520
Cognitive and behavioral therapy
2820
Suture open wound
2487
Bone immobilization
1789
Sputum examination
1659
Direct current cardioversion
1342
Intramuscular injection
908
Pulmonary rehabilitation
908
Allergy screening test
704
Bone density scan
655
Why can’t we use WHERE?
Well, it turns out that SQL clauses have different priorities, which tells the engine how to order the clauses to execute as your queries become bigger. The WHERE clause has higher priority than the GROUP BY clause, which means if you had written WHERE person_count > 500, it would be evaluated before GROUP BY, thus it has no idea person_count exists and throws an error. Here is the full list of SQL clause priorities:
Priority
Clause
Purpose
1
FROM
Choose tables to query and specify how to JOIN them together
2
WHERE
Filter tables based on criteria
3
GROUP BY
Aggregates the Data
4
HAVING
Filters Aggregated Data
5
SELECT
Selects columns in table and calculate new columns
6
ORDER BY
Sorts by a database field
7
LIMIT
Limits the number of records returned
In general, you need to put WHERE to do any filtering before running GROUP BY. Then, after the data is grouped and aggregrated, you can do additional filtereing on the aggregated data via HAVING. Your SQL statement will not work if you put WHERE after GROUP BY / HAVING.
Here is an example of using both WHERE and HAVING:
Here is WHERE/GROUP BY/HAVING combined with an INNER JOIN:
SELECT c.concept_name ASprocedure, COUNT(person_id) AS person_countFROM procedure_occurrence AS poINNERJOIN concept AS cON po.procedure_concept_id = c.concept_idWHERE date_part('YEAR', po.procedure_datetime) >2000GROUPBY c.concept_nameHAVING person_count >500ORDERBY person_count DESC
4 records
procedure
person_count
Direct current cardioversion
1031
Subcutaneous immunotherapy
705
Suture open wound
697
Pulmonary rehabilitation
611
4.4.1 Check on learning
Suppose we were given this join, with the column year extracted from procedure_datatime.
SELECT date_part('YEAR', po.procedure_datetime) ASyear, person_id, procedure_occurrence_idFROM procedure_occurrence AS poINNERJOIN concept AS cON po.procedure_concept_id = c.concept_id
Displaying records 1 - 10
year
person_id
procedure_occurrence_id
1992
343
3554
1990
357
3741
1991
399
3928
1990
406
4115
1992
411
4302
1938
430
4489
1992
442
4676
1984
453
4863
1948
469
5050
1962
488
5237
Build on top of this query: Group by year, and then aggregate by the COUNT of person_id. Finally, filter it so that the year is higher than 1990. Should you be using WHERE or HAVING?
SELECT date_part('YEAR', po.procedure_datetime) ASyear, person_id FROM procedure_occurrence AS poINNERJOIN concept AS cON po.procedure_concept_id = c.concept_id
4.5IN/LIKE
A couple of twists to WHERE. We can use IN to search on multiple conditions. We put the multiple words in a () separated by commas:
SELECT concept_name, domain_id FROM conceptWHERE domain_id IN ('Drug', 'Condition')
Displaying records 1 - 10
concept_name
domain_id
Gastrointestinal hemorrhage, unspecified
Condition
celecoxib 200 MG Oral Capsule [Celebrex]
Drug
pneumococcal polysaccharide vaccine, 23 valent
Drug
Alendronate
Drug
Ampicillin 100 MG/ML Injectable Solution
Drug
celecoxib 200 MG Oral Capsule [Celebrex]
Drug
Other diseases of digestive system
Condition
rotavirus, live, monovalent vaccine
Drug
Midazolam
Drug
Escherichia coli urinary tract infection
Condition
We can use NOT with IN to exclude a list of conditions:
SELECT concept_name, domain_id FROM conceptWHERE domain_id NOTIN ('Drug', 'Condition')
Displaying records 1 - 10
concept_name
domain_id
Coronary artery bypass graft
Procedure
Intramuscular injection
Procedure
Protein serum/plasma
Measurement
Surgical manipulation of shoulder joint
Procedure
Neonatal screening
Procedure
Honey bee IgE Ab [Units/volume] in Serum
Measurement
Erythrocyte distribution width [Ratio]
Measurement
Removal of subcutaneous contraceptive
Procedure
Cladosporium herbarum IgE Ab [Units/volume] in Serum
Measurement
Common Ragweed IgE Ab [Units/volume] in Serum
Measurement
One note. It is usually faster to make a temporary table with your values and join on that temporary table. We’ll talk more about this below.
LIKE is one way to do wild card searches.
SELECT concept_name, domain_id FROM conceptWHERE domain_id LIKE'Dru%'
Rows: 51 Columns: 3
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): concept_name
dbl (2): procedure_concept_id, cost
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# A tibble: 51 × 3
concept_name procedure_concept_id cost
<chr> <dbl> <dbl>
1 Cognitive and behavioral therapy 4043071 3300
2 Nasal sinus endoscopy 4010253 3000
3 Intubation 4202832 4000
4 Removal of subcutaneous contraceptive 4199276 1800
5 Radiography of humerus 4047491 2300
6 Radiologic examination of knee 4252419 4100
7 Lung volume reduction surgery 4323902 730
8 Sputum examination 4151422 4000
9 Percutaneous coronary intervention 4216130 1800
10 Injection of epinephrine 4197460 2600
# ℹ 41 more rows
Now our table exists in our database, and we can work with it.
SELECT*FROMcost
Displaying records 1 - 10
concept_name
procedure_concept_id
cost
Cognitive and behavioral therapy
4043071
3300
Nasal sinus endoscopy
4010253
3000
Intubation
4202832
4000
Removal of subcutaneous contraceptive
4199276
1800
Radiography of humerus
4047491
2300
Radiologic examination of knee
4252419
4100
Lung volume reduction surgery
4323902
730
Sputum examination
4151422
4000
Percutaneous coronary intervention
4216130
1800
Injection of epinephrine
4197460
2600
Now we can merge our temporary cost table with procedure_occurrence and calculate the sum cost per year:
SELECT date_part('YEAR', po.procedure_datetime) ASyear, SUM(cost) AS sum_cost_monthFROM procedure_occurrence AS poINNERJOINcostAS cON po.procedure_concept_id = c.procedure_concept_idGROUPBYyearORDERBYyearDESC
Displaying records 1 - 10
year
sum_cost_month
2019
619380
2018
1271370
2017
1201430
2016
1122680
2015
1079390
2014
1103730
2013
1084430
2012
1118890
2011
1041720
2010
1015700
We’ll talk more about Subqueries and Views next time, which are another options to split queries up.
4.7 Data Integrity
We talked a little bit last week about database constraints, such as FOREIGN KEY constraints, where we can’t add a row that refers to a foreign key if that foreign key doesn’t exist.
These constraints exist to ensure the data integrity of a database. For example, we don’t want to have rows in procedure_occurrence that have procedure_concept_id that don’t exist in the concept table.
Another way to keep data integrity is to have all operations be ACID compliant transactions. That is, all operations (inserting and removing rows) needs to be done in full before the next set of transactions (which could come from another user) are done to the database.
ACID is short for:
Atomicity - the operation must be all or none
Consistency - the operation must be done the same way
Isolation - the operation is not dependent on other operations, and is done in series, not parallel.
Durability - the operation must be robust to disruptions (like power outages). If a database is interrupted in an update, there must be a rollback mechanism to get the previous version of the data.
Finally, the design of the tables and what information they contain, and how they relate to each other is also important to data integrity. The process of deciding which columns belong to which tables is called normalization.
4.8 Database Design
Database design can be difficult because:
You need to understand the requirements of the data and how it is collected
For example, when is procedure information collected?
Do patients have multiple procedures? (Cardinality)
You need to group like data with like (normalization)
Data that is dependent on a primary key should stay together
For example, person should contain information of a patient such as demographics, but not individual procedure_concept_ids.
You need to have an automated process to add data to the database (Extract Transfer Load, or ETL).
Search processes must be optimized for common operations (indexing)
Of this, steps 1 and 2 are the most difficult and take the most time. They require the designer to interview users of the data and those who collect the data to reflect the business processes. These two steps are called the Data Modeling steps.
These processes are essential if you are designing a transactional database that is collecting data from multiple sources (such as clinicians at time of care) and is updated multiple times a second.
If you want to read more about the data model we’re using, I’ve written up a short bit here: OMOP Data Model.
4.9 Database Administration
Maintaining a database is also known as database administration. Database Admins are responsible for the following:
Making sure that the data maintains its integrity
Ensuring that common queries are optimized for fast loading
General upkeep and optimization. Oftentimes, if multiple people are accessing the data at once, the data may be distributed among multiple machines (load balancing).
Security. We don’t want the wrong people accessing the data.
Being a good admin does not start from scratch. You can’t be a top-tier admin straight out of school. There are a lot of things DB admins learn, but a lot of the optimization happens from experience with managing the data.
Respect your DB Admin and know that they know a lot about how to optimize your queries.
4.10 Always close the connection
When we’re done, it’s best to close the connection with dbDisconnect().