Source data stored in a PostgreSQL Database¶
Scenario¶
You would like to extract data directly from a PostgreSQL Database.
Note
Before deciding that you’d like to query a SQL database directly, instead, you may want to consider separately querying your database, saving the result of that query to a file, and then using that file as source for an extract config.
By querying a database directly
It’s no longer easy to audit the data used as the source for an ingest.
It’s much harder, if not impossible, for non-technical users to know what data is used for an ingest
Using a database is inherently volatile e.g. rows may be added/ deleted/ updated.
Because of the above points, it’s important to consider if your needs wouldn’t be better served by using a static result of your query as source data for an ingest instead of directly querying the SQL database.
.sql File Solution¶
A convenient way to load data from a postgres database is by saving your SQL query in a file, loading that file as source data, then connecting to the database and running your query inside of source_data_read_function.
source_data_read_func
will be used to parse the data from the
source_data_url
.
As an example, say you save your query in the file at
my-ingest-package/data/my_query.sql
:
SELECT *
FROM participant p
WHERE p.study_id = 'SD_ME0WME0W'
Your extract config to pull data from your database could then look like:
import pandas as pd
import psycopg2
source_data_url = "file://my-ingest-package/data/my_query.sql
db_url = "postgresql://username:password@my-postgres-db-url.com:5432/postgres"
source_data_read_params = {"connection_url": db_url}
def source_data_read_func(filepath, connection_url, **kwargs):
with open(filepath, "r") as f:
query = f.read().replace("\n", " ")
f.close
conn = psycopg2.connect(connection_url)
df = pd.read_sql(query, conn)
return df
The resulting dataframe would then be the resulting data that comes from your query. You can then continue on writing your extract config as you normally would.