2  Week 1: DESCRIBE, SELECT, WHERE

2.1 Our Composable Database System

  • Client: R/RStudio w/ SQL
  • Database Engine: DuckDB
  • Data Storage: single file in data/ folder

2.2 Connecting to our database

To access the data, we need to create a database connection. We use dbConnect() from the DBI package to do this. The first argument specifies the Database engine (duckdb()), and the second provides the file location: "data/data/GiBleed_5.3_1.1.duckdb".

Loading required package: DBI
library(DBI)

con <- DBI::dbConnect(duckdb::duckdb(), 
                      "data/GiBleed_5.3_1.1.duckdb")

Once open, we can use con (our database connection)

Keep in Mind: SQL ignores letter case

These are the same to the database engine:

SELECT person_id FROM person;
select PERSON_ID FROM person;

And so on. Our convention is that we capitalize SQL clauses such as SELECT so you can differentiate them from other information.

2.3 Looking at the Entire Database

One of the first things we can learn is to show the contents of the entire database; we can do this with SHOW TABLES:

SHOW TABLES;
Displaying records 1 - 10
name
care_site
cdm_source
concept
concept_ancestor
concept_class
concept_relationship
concept_synonym
condition_era
condition_occurrence
cost

We can get further information about the tables within our database using DESCRIBE; This will give us more information about individual tables:

DESCRIBE;
Displaying records 1 - 10
database schema name column_names column_types temporary
GiBleed_5 main care_site care_site_id , care_site_name , place_of_service_concept_id , location_id , care_site_source_value , place_of_service_source_value INTEGER, VARCHAR, INTEGER, INTEGER, VARCHAR, VARCHAR FALSE
GiBleed_5 main cdm_source cdm_source_name , cdm_source_abbreviation , cdm_holder , source_description , source_documentation_reference, cdm_etl_reference , source_release_date , cdm_release_date , cdm_version , vocabulary_version VARCHAR, VARCHAR, VARCHAR, VARCHAR, VARCHAR, VARCHAR, DATE , DATE , VARCHAR, VARCHAR FALSE
GiBleed_5 main concept concept_id , concept_name , domain_id , vocabulary_id , concept_class_id, standard_concept, concept_code , valid_start_date, valid_end_date , invalid_reason INTEGER, VARCHAR, VARCHAR, VARCHAR, VARCHAR, VARCHAR, VARCHAR, DATE , DATE , VARCHAR FALSE
GiBleed_5 main concept_ancestor ancestor_concept_id , descendant_concept_id , min_levels_of_separation, max_levels_of_separation INTEGER, INTEGER, INTEGER, INTEGER FALSE
GiBleed_5 main concept_class concept_class_id , concept_class_name , concept_class_concept_id VARCHAR, VARCHAR, INTEGER FALSE
GiBleed_5 main concept_relationship concept_id_1 , concept_id_2 , relationship_id , valid_start_date, valid_end_date , invalid_reason INTEGER, INTEGER, VARCHAR, DATE , DATE , VARCHAR FALSE
GiBleed_5 main concept_synonym concept_id , concept_synonym_name, language_concept_id INTEGER, VARCHAR, INTEGER FALSE
GiBleed_5 main condition_era condition_era_id , person_id , condition_concept_id , condition_era_start_date , condition_era_end_date , condition_occurrence_count INTEGER, INTEGER, INTEGER, DATE , DATE , INTEGER FALSE
GiBleed_5 main condition_occurrence condition_occurrence_id , person_id , condition_concept_id , condition_start_date , condition_start_datetime , condition_end_date , condition_end_datetime , condition_type_concept_id , condition_status_concept_id , stop_reason , provider_id , visit_occurrence_id , visit_detail_id , condition_source_value , condition_source_concept_id , condition_status_source_value INTEGER , INTEGER , INTEGER , DATE , TIMESTAMP, DATE , TIMESTAMP, INTEGER , INTEGER , VARCHAR , INTEGER , INTEGER , INTEGER , VARCHAR , INTEGER , VARCHAR FALSE
GiBleed_5 main cost cost_id , cost_event_id , cost_domain_id , cost_type_concept_id , currency_concept_id , total_charge , total_cost , total_paid , paid_by_payer , paid_by_patient , paid_patient_copay , paid_patient_coinsurance , paid_patient_deductible , paid_by_primary , paid_ingredient_cost , paid_dispensing_fee , payer_plan_period_id , amount_allowed , revenue_code_concept_id , revenue_code_source_value, drg_concept_id , drg_source_value INTEGER, INTEGER, VARCHAR, INTEGER, INTEGER, FLOAT , FLOAT , FLOAT , FLOAT , FLOAT , FLOAT , FLOAT , FLOAT , FLOAT , FLOAT , FLOAT , INTEGER, FLOAT , INTEGER, VARCHAR, INTEGER, VARCHAR FALSE

We’ll look at a few tables in our work:

  • person - Contains personal & demographic data
  • procedure_occurrence - procedures performed on patients and when they happened
  • condition_occurrence - patient conditions (such as illnesses) and when they occurred
  • concept - contains the specific information (names of concepts) that map into all three above tables

We’ll talk much more later about the relationships between these tables.

2.4 SELECT and FROM

If we want to see the contents of a table, we can use SELECT and FROM.

SELECT *          # select all columns
  FROM person     # from the person table
  LIMIT 10;       # return only 10 rows
SELECT * FROM person LIMIT 10;
Displaying records 1 - 10
person_id gender_concept_id year_of_birth month_of_birth day_of_birth birth_datetime race_concept_id ethnicity_concept_id location_id provider_id care_site_id person_source_value gender_source_value gender_source_concept_id race_source_value race_source_concept_id ethnicity_source_value ethnicity_source_concept_id
6 8532 1963 12 31 1963-12-31 8516 0 NA NA NA 001f4a87-70d0-435c-a4b9-1425f6928d33 F 0 black 0 west_indian 0
123 8507 1950 4 12 1950-04-12 8527 0 NA NA NA 052d9254-80e8-428f-b8b6-69518b0ef3f3 M 0 white 0 italian 0
129 8507 1974 10 7 1974-10-07 8527 0 NA NA NA 054d32d5-904f-4df4-846b-8c08d165b4e9 M 0 white 0 polish 0
16 8532 1971 10 13 1971-10-13 8527 0 NA NA NA 00444703-f2c9-45c9-a247-f6317a43a929 F 0 white 0 american 0
65 8532 1967 3 31 1967-03-31 8516 0 NA NA NA 02a3dad9-f9d5-42fb-8074-c16d45b4f5c8 F 0 black 0 dominican 0
74 8532 1972 1 5 1972-01-05 8527 0 NA NA NA 02fbf1be-29b7-4da8-8bbd-14c7433f843f F 0 white 0 english 0
42 8532 1909 11 2 1909-11-02 8527 0 NA NA NA 0177d2e0-98f5-4f3d-bcfd-497b7a07b3f8 F 0 white 0 irish 0
187 8507 1945 7 23 1945-07-23 8527 0 NA NA NA 07a1e14d-73ed-4d3a-9a39-d729745773fa M 0 white 0 irish 0
18 8532 1965 11 17 1965-11-17 8527 0 NA NA NA 0084b0fe-e30f-4930-b6d1-5e1eff4b7dea F 0 white 0 english 0
111 8532 1975 5 2 1975-05-02 8527 0 NA NA NA 0478d6b3-bdb3-4574-9b93-cf448d725b84 F 0 white 0 english 0
  1. Why are there birth_datetime and the month_of_birth, day_of_birth, year_of_birth - aren’t these redundant?

2.5 Try it Out

Look at the first few rows of procedure_occurrence.

SELECT * FROM ____ LIMIT 10;
  1. Why is there a person_id column in this table as well?

2.6 SELECTing a few columns in our table

We can use the SELECT clause to grab specific columns in our data.

SELECT person_id, birth_datetime, gender_concept_id # Columns in our table
  FROM person;                                      # Our Table
SELECT person_id, birth_datetime, gender_concept_id 
  FROM person
  LIMIT 10;
Displaying records 1 - 10
person_id birth_datetime gender_concept_id
6 1963-12-31 8532
123 1950-04-12 8507
129 1974-10-07 8507
16 1971-10-13 8532
65 1967-03-31 8532
74 1972-01-05 8532
42 1909-11-02 8532
187 1945-07-23 8507
18 1965-11-17 8532
111 1975-05-02 8532

2.7 Try it Out

What happens if we ask for a column that doesn’t exist in our data?

SELECT person_id, birth_datetime, gender_concept_id, blah
  FROM person;

2.8 Check on Learning

Add race_concept_id and year_of_birth to your SELECT query:

SELECT person_id, birth_datetime, gender_concept_id, ____, ____
  FROM person;

2.9 WHERE - filtering our table

Adding WHERE to our SQL statement lets us add filtering to our query:

SELECT person_id, gender_source_value, race_source_value, year_of_birth 
  FROM person 
  WHERE year_of_birth < 1980
Displaying records 1 - 10
person_id gender_source_value race_source_value year_of_birth
6 F black 1963
123 M white 1950
129 M white 1974
16 F white 1971
65 F black 1967
74 F white 1972
42 F white 1909
187 M white 1945
18 F white 1965
111 F white 1975

One critical thing to know is that you don’t need to include the columns you’re filtering on in the SELECT part of the statement. For example, we could do the following as well, removing year_of_birth from our SELECT:

SELECT person_id, gender_source_value, race_source_value 
  FROM person 
  WHERE year_of_birth < 2000
Displaying records 1 - 10
person_id gender_source_value race_source_value
6 F black
123 M white
129 M white
16 F white
65 F black
74 F white
42 F white
187 M white
18 F white
111 F white

2.9.1 Single quotes and WHERE

SQL convention: single quotes (‘M’) refer to values, and double quotes refer to columns (“person_id”). If you try to use double quotes in the below, it will look for a column called “M”.

This will trip you up several times if you’re not used to it.

SELECT person_id, gender_source_value, race_source_value 
  FROM person 
  WHERE gender_source_value = 'M'
  LIMIT 10;
Displaying records 1 - 10
person_id gender_source_value race_source_value
123 M white
129 M white
187 M white
40 M white
53 M white
78 M white
69 M asian
248 M white
105 M white
49 M white

Reminder: use single (’’) quotes in your SQL statements to refer to values, not double quotes (“).

Quick Note

For R users, notice the similarity of select() with SELECT. We can rewrite the above in dplyr code as:

person |>
  select(person_id, gender_source_value, race_source_value)

A lot of dplyr was inspired by SQL. In fact, there is a package called dbplyr that translates dplyr statements into SQL. A lot of us use it, and it’s pretty handy.

2.10 COUNT - how many rows?

Sometimes you want to know the size of your result, not necessarily return the entire set of results. That is what COUNT is for.

SELECT COUNT(*) 
  FROM person
  WHERE year_of_birth < 2000;
1 records
count_star()
2694

Similarly, when we want to count the number of person_ids returned, we can use COUNT(person_id):

SELECT COUNT(person_id) 
  FROM person
  WHERE year_of_birth < 2000;
1 records
count(person_id)
2694

Let’s switch gears to the procedure_concept_id table. Let’s count the overall number of procedure_concept_ids in our table:

SELECT COUNT(procedure_concept_id)
  FROM procedure_occurrence;
1 records
count(procedure_concept_id)
37409

Hmmm. That’s quite a lot, but are there repeat procedure_concept_ids?

When you have repeated values in the rows, COUNT(DISTINCT ) can help you find the number of unique values in a column:

SELECT COUNT(DISTINCT procedure_concept_id)
  FROM procedure_occurrence
1 records
count(DISTINCT procedure_concept_id)
51

We can also return the actual DISTINCT values by removing COUNT:

SELECT DISTINCT procedure_concept_id
  FROM procedure_occurrence;
Displaying records 1 - 10
procedure_concept_id
4151422
4125906
44783196
4187458
4163872
4198190
4326177
4163951
40492359
4010253

2.11 Check on Learning

Count the distinct values of gender_source_value in person:

SELECT COUNT(DISTINCT --------------)
  FROM -------;

2.12 Keys: Linking tables together

One of the important properties of data in a relational database is that there are no repeat rows in the database. Each table that meets this restriction has what is called a primary key.

We can use DESCRIBE to get more information (the metadata) about a table. This gives us information about our tables.

DESCRIBE person
Displaying records 1 - 10
column_name column_type null key default extra
person_id INTEGER YES NA NA NA
gender_concept_id INTEGER YES NA NA NA
year_of_birth INTEGER YES NA NA NA
month_of_birth INTEGER YES NA NA NA
day_of_birth INTEGER YES NA NA NA
birth_datetime TIMESTAMP YES NA NA NA
race_concept_id INTEGER YES NA NA NA
ethnicity_concept_id INTEGER YES NA NA NA
location_id INTEGER YES NA NA NA
provider_id INTEGER YES NA NA NA

Scanning the rows, which field/column is the primary key for person?

Try and find the primary key for procedure_occurrence. What is it?

DESCRIBE procedure_occurrence
Displaying records 1 - 10
column_name column_type null key default extra
procedure_occurrence_id INTEGER YES NA NA NA
person_id INTEGER YES NA NA NA
procedure_concept_id INTEGER YES NA NA NA
procedure_date DATE YES NA NA NA
procedure_datetime TIMESTAMP YES NA NA NA
procedure_type_concept_id INTEGER YES NA NA NA
modifier_concept_id INTEGER YES NA NA NA
quantity INTEGER YES NA NA NA
provider_id INTEGER YES NA NA NA
visit_occurrence_id INTEGER YES NA NA NA

We’ll see that keys need to be unique (so they can map to each row). In fact, each key is a way to connect one table to another.

What column is the same in both tables? That is a hint for what we’ll cover next week: JOINing tables.

2.13 Data Types

If you look at the column_type for one of the DESCRIBE statements above, you’ll notice there are different data types:

  • INTEGER
  • TIMESTAMP
  • DATE
  • VARCHAR

Each column of a database needs to be typed. The data type of a column determines what kinds of calculations or operations we can do on them. For example, we can do things like date arithmetic on DATETIME columns, asking the engine to calculate 5 days after the dates.

You can see all of the datatypes that are available in DuckDB here.

2.14 Always close the connection

When we’re done, it’s best to close the connection with dbDisconnect().