TIL that DuckDB now supports Avro files
⋅ 1 minute read
In the data engineering pipelines of my current company we use two main file formats:
- Parquet files for analytical workloads (columnar-storage)
- Avro files for transactional / event-based messages (row-storage)
In Querying remote S3 files I wrote about how I use DuckDB to query parquet files stored in S3. Recently, I noticed that DuckDB started supporting reading of Avro files in the same way.
Install extension
To install the extension open DuckDB and run:
INSTALL avro FROM community;
LOAD avro;
Read local files
Local files can be queried with the read_avro
function:
select * FROM read_avro('my_dataset.avro');
Read files stored in S3
Remote files can be queried by providing the S3 URI (after logging into the AWS account ):
select * FROM read_avro('s3://my_bucket/my_remote_dataset.avro');
A few limitations of the extension are listed here .
If you have any thoughts, questions, or feedback about this post, I would love to hear it. Please reach out to me via email.
Tags:#duck-db #data-engineering #sql