Exporting Data

You have the choice to export to a file(Excel, CSV, JSON, etc.) or a SQL database. Exporting to Excel files and Markdown is primarily used for testing. Exporting to SQL is more commonly used. 

You can specify batch sizes by adding a --batch-size=number at the end of your command.

“Checkpoints” enable incremental exports. A checkpoint will get set after each batch of data that is fetched from the API. The next export will start at the last checkpoint. Checkpoints are only created when exporting to a SQL database. 

Export to an Excel file

Exporting to an Excel file allows you to get a full results-set without setting up a local database. When you export to Excel, there are no checkpoints. It’s all or nothing. If the query fails, the Excel file is not created.

You can run a data export and save it to an Excel file with the DET. Excel files are great for one-time data exports or if you want to use them to analyze and visualize the data. Excel is also a great option if you are not familiar with working with databases.

The downside to Excel files is that they can’t handle large datasets. You can also only do once-off export to an Excel file; there are no “checkpoints” created for incremental updates when the DET is run. 

Note: You can specify batch sizes by adding a --batch-size=number at the end of your command.

When exporting to Excel, the DET will fetch data from the API in batches, and then when the final batch is processed, the output will be written to the Excel file.

Export to a Database

Exporting to a database supports incremental updates. When you export to a database, you can have millions of rows of data, so the DET handles it in batches. 

After each successful batch, the tool writes a checkpoint to a database table that the DET maintains.

By default, each time the commcare-export command runs, it will only import results that were modified after the most recently imported form or case. This means that, for very large datasets, you do not need to export a full set of results every time. 

This can be an incredibly powerful feature for a data pipeline. (See more information at https://dimagi.atlassian.net/wiki/x/BibKfw).

You can configure the DET to pull data on a schedule to automatically keep your data up to date or run it as a one-time data dump.

Exporting to SQL allows you to clean and manipulate the data before you integrate it with data visualization tools like Power BI or Tableau.

By choosing "--output-format sql" and "--output DATABASE_URL" in your command, you can save the output to a SQL database. 

The term DATABASE_URL is a standard URL format for databases in these commands. If, for example, you have a PostgreSQL database called “registrations” that is hosted on db.example.com that you would like to save into instead, the URL might look like postgresql://db.example.com/registrations (you may need to include username and password in the URL).

Checkpoints
  • The first time you run a DET query to a database, it will get all the records that exist and then set the checkpoint at the end.

  • The next time you run the query, it will just start from the last checkpoint and get all the cases modified after that time or forms submitted after that time.

  • This checkpoint data is stored in a table called “commcare-export-runs” in the database.

  • If you change the query file, the DET will know that it’s a new query file because the MD5 hash will not match. This triggers a full refresh of all data in the domain. See https://dimagi.atlassian.net/wiki/x/OTDKfw for more information.

  • A checkpoint will get set after each batch of data that is fetched from the API. Using the --batch-size=number command-line argument will change how much data is fetched in each batch and, therefore, the frequency of the checkpoints.

 

Technical information for exporting to SQL

Note that a field named id is required for use with a SQL database.

Exporting to SQL requires installing additional python libraries. These are listed in the table below.

Supported Databases

Database name

URL format

Maximum field name length

Dependencies

MSSQL

mssql+pyodbc://<username>:<password>@<host>/<database name>?driver=ODBC+Driver+18+for+SQL+Server



128

Requires the 'pyodbc' library:

pip install pyodbc

Also requires the SQL Server drivers to be installed. 

Azure MSSQL

mssql+pyodbc:///?odbc_connect=<quoted connection string>  (see: https://dimagi.atlassian.net/wiki/x/lErKfw )

128

Same as MSSQL above

MySQL

mysql+pymysql://<username>:<password>@<host>/<database name>?charset=utf8mb4



64

Requires the 'pymysql' library:

pip install pymysql



Oracle

oracle+cx_oracle://<username>:<password>@<host>/<database name>



128

Requires the 'cx_oracle' library:

pip install cx_oracle

PostgreSQL

postgresql://<username>:<password>@<host>/<database name>



63

Requires the 'psycopg2' library:

pip install psycopg2


Data types

Data Type Detection

By default, the data export tool will choose the data type of the SQL column based on the incoming data. Strings, integers, and boolean values are supported.

You can also use the command-line option "--strict-types" which will prevent SQL column types from being changed. This may result in errors if incompatible data is received from CommCare HQ.

 

Explicit Data Types

You can also explicitly set data types by adding a column with the heading "Data Type". Allowed values for this column are:

Value

text

boolean

date

datetime

integer

json (Postgres Databases only)

If you specify a data type column, then the column will be created when the table is first created instead of dynamically based on data. Note that if you use explicit data types, you are responsible for ensuring the data going into the database is valid for that column. Specifying these conversions may be necessary for some data types - especially when converting string/text values to other types.

Data Type Conversion

To convert data of one type to another type before it goes into the database, you can provide the following functions in the Map Via column:

Function name

Description

str2bool

Convert a string value to a boolean. The following values result in a True output:

'true', 't', '1' (case doesn't matter)

If the value is already a boolean, it will be output without change. All other values result in a False output.

bool2int

This supports converting boolean values and booleans represented as strings to the integers 1 or 0.

str2num

Convert a string representation of a number to a number.

str2date

Convert a string representation of a date or timestamp to a date or timestamp.

Export to Markdown

Exporting to “Markdown” gives you the results in the command-line. Markdown is primarily used for testing.