Observable Framework View source

DuckDB data loader

Here’s a shell script data loader that uses curl to download a CSV file, then uses the DuckDB CLI to filter it, and finally outputs the result as a Apache Parquet file.


# Use the data loader cache directory to store the downloaded data.

# Download the data (if it’s not already in the cache).
if [ ! -f "$TMPDIR/$CODE.csv" ]; then
  curl "$URL" -o "$TMPDIR/$CODE.csv"

# Generate a Parquet file using DuckDB.
duckdb :memory: << EOF
  FROM read_csv('$TMPDIR/$CODE.csv')
  WHERE true
    AND TIME_PERIOD = 2019 -- a good year in terms of data quality
    AND OBS_VALUE > 0 -- filter out zeros
    AND isced11 = 'ED2' -- lower secondary education
    AND unit = 'PC' -- ignore absolute numbers, keep percentages
    AND language != 'TOTAL' -- ignore total
    AND length(geo) = 2 -- ignore groupings such as EU_27
) TO STDOUT (FORMAT 'parquet', COMPRESSION 'gzip');

To run this data loader, you’ll need to install curl and the DuckDB CLI if they are not already installed on your system.

The example data is statistics about modern foreign language education (educ_uoe_lang01) from Eurostat, the data portal of the statistical office of the European Union. To make it faster to iterate on the data — for example to change the education level or the time period — we save the downloaded data to Framework’s cache folder.

The above data loader lives in educ_uoe_lang01.parquet.sh, so we can load the data as educ_uoe_lang01.parquet. The FileAttachment.parquet method parses the file and returns a promise to an Arrow table.

const languages = FileAttachment("educ_uoe_lang01.parquet").parquet();

We can display this dataset with Inputs.table:


We can also make a quick chart of most-frequently taught modern foreign languages in Europe using Observable Plot; note that the stacked percentages go above 100% in most countries because many pupils learn at least two foreign languages. Ireland (IE) is the only exception as English is not taught as a foreign language.

For reference, here are the codes used for countries and languages:

code country
BE Belgium
BG Bulgaria
CZ Czechia
DK Denmark
DE Germany
EE Estonia
IE Ireland
EL Greece
ES Spain
FR France
HR Croatia
IT Italy
CY Cyprus
LV Latvia
LT Lithuania
LU Luxembourg
HU Hungary
MT Malta
NL Netherlands
AT Austria
PL Poland
PT Portugal
RO Romania
SI Slovenia
SK Slovakia
FI Finland
SE Sweden
IS Iceland
LI Liechtenstein
NO Norway
UK United Kingdom
BA Bosnia and Herzegovina
MK North Macedonia
AL Albania
RS Serbia
code language
BUL Bulgarian
SPA Spanish
CZE Czech
DAN Danish
GER German
EST Estonian
GRE Greek
ENG English
FRE French
GLE Irish
HRV Croatian
ITA Italian
LAV Latvian
LIT Lithuanian
HUN Hungarian
MLT Maltese
DUT Dutch; Flemish
POL Polish
POR Portuguese
RUM Romanian
SLO Slovak
SLV Slovenian
FIN Finnish
SWE Swedish
ARA Arabic
CHI Chinese
JPN Japanese
RUS Russian
OTH Other
UNK Unknown