TIL that DuckDB now supports Avro files

⋅ 1 minute read


In the data engineering pipelines of my current company we use two main file formats:

  • Parquet files for analytical workloads (columnar-storage)
  • Avro files for transactional / event-based messages (row-storage)

In Querying remote S3 files I wrote about how I use DuckDB to query parquet files stored in S3. Recently, I noticed that DuckDB started supporting reading of Avro files in the same way.

Install extension

To install the extension open DuckDB and run:

INSTALL avro FROM community;
LOAD avro;

Read local files

Local files can be queried with the read_avro function:

select * FROM read_avro('my_dataset.avro');

Read files stored in S3

Remote files can be queried by providing the S3 URI (after logging into the AWS account ):

select * FROM read_avro('s3://my_bucket/my_remote_dataset.avro');

A few limitations of the extension are listed here .


If you have any thoughts, questions, or feedback about this post, I would love to hear it. Please reach out to me via email.

Tags:
#duck-db   #data-engineering   #sql