.. meta:: :description lang=en: xlSlim supports running DuckDB queries within Excel. DuckDB ====== .. _DuckDB: https://duckdb.org/ .. _SQL: https://duckdb.org/docs/sql/introduction.html .. _https: https://duckdb.org/docs/extensions/httpfs/overview .. _`JSON capabilities`: https://duckdb.org/docs/extensions/json .. _`Parquet capabilities`: https://duckdb.org/docs/data/parquet/overview .. _`CSV capabilities`: https://duckdb.org/docs/data/csv/overview .. _JSONPath: https://en.wikipedia.org/wiki/JSONPath .. _PIVOT: https://duckdb.org/docs/sql/statements/pivot .. _CTE: https://duckdb.org/docs/sql/query_syntax/with .. _CASE: https://duckdb.org/docs/sql/expressions/case .. _efficient: https://duckdb.org/2021/06/25/querying-parquet.html .. _S3: https://duckdb.org/docs/extensions/httpfs/s3api `DuckDB`_ is a powerful analytical database that allows you to easily import and analyse large datasets. DuckDB fully supports `SQL`_ with many improvements to make SQL friendlier. xlSlim includes a complete in-process DuckDB which allows DuckDB SQL statements to be run within either an in-memory or on-disk database. DuckDB can read data from many common file formats, see the sections below for :ref:`JSON`, :ref:`Parquet` and :ref:`CSV`. Additionally, DuckDB has functionality for :ref:`Remote`. DuckDB functionality is accessed within Excel by calling xlSlim's :func:`DuckDbQuery` function. .. tip:: When copying text such as SQL commands into Excel cells, add a ' at the start so that Excel treats the input as a string. .. image:: ../_static/duckdb_copy_text_to_excel.gif :width: 400 :alt: Copy text into Excel using '. .. warning:: DuckDB requires a premium licence. See :doc:`/reference/licensing` .. _JSON: Querying JSON files ------------------- Reading data from JSON files can be difficult, particularly if the data is nested. This JSON file contains details of Nobel prize laureates: .. image:: ../_static/duckdb_sample_json.png :width: 400 :alt: DuckDB Nobel Laureates as JSON file. The data is presented as a list of dictionaries with the *prizes* field being a further list of dictionaries. DuckDB allows the laureates names and birth countries to be extracted with a simple SQL SELECT. Notice how DuckDB allows the JSON file to be treated just like a regular database table: .. code-block:: sql SELECT firstname, surname, bornCountry FROM 'laureate.json' LIMIT 5 Passing the SQL into the :func:`DuckDbQuery` function gives the following result: .. image:: ../_static/duckdb_json_top5.png :width: 400 :alt: First 5 names and birth countries in the JSON file. We can now easily find the countries with the most Nobel prize winners: .. code-block:: sql SELECT bornCountry AS Country, COUNT(*) AS Number FROM 'laureate.json' GROUP BY bornCountry ORDER BY COUNT(*) DESC LIMIT 5 .. image:: ../_static/duckdb_json_top_countries.png :width: 400 :alt: Countries with the most prize winners in the JSON file. DuckDB supports the use of `JSONPath`_ to extract values from nested JSON fields. This allows us to extract the category and motivation of the first prize awarded to each person. .. code-block:: sql SELECT firstname, surname, prizes->>'$[0].category' AS Category, prizes->>'$[0].motivation'AS Motivation FROM 'laureate.json' LIMIT 5 .. image:: ../_static/duckdb_json_category_and_motivation.png :width: 400 :alt: Prize category and motivation in the JSON file. Full details of DuckDB's `JSON capabilities`_ are available in their documentation. .. _CSV: Querying CSV files ------------------ Data is often stored in CSV files. Surprisingly, CSV is not a standardised data format and many variations exist. DuckDB is able to handle most `CSV`_ files automatically, detecting the column delimiters, data types, and so on. This CSV file contains 10,000 sales records: .. image:: ../_static/duckdb_sample_csv.png :width: 400 :alt: DuckDB sample sales records in a CSV file. DuckDB allows the region, item type and total revenue for each sale to be extracted with a simple SQL SELECT. Notice how, just like the JSON example above, DuckDB allows the CSV file to be treated as a regular database table: .. code-block:: sql SELECT Region, "Item Type", "Total Revenue" FROM '10000SalesRecords.csv' LIMIT 5 Passing the SQL into the :func:`DuckDbQuery` function gives the following result: .. image:: ../_static/duckdb_csv_top5.png :width: 400 :alt: First 5 sales records in the CSV file. DuckDB has a user-friendly `PIVOT`_ statement that allows us to view the revenues in Asia and Europe broken down by item type: .. code-block:: sql PIVOT '10000SalesRecords.csv' ON Region IN ("Europe", "Asia") USING sum("Total Revenue") GROUP BY "Item Type" ORDER BY "Item Type" .. image:: ../_static/duckdb_csv_sales_pivot.png :width: 400 :alt: Pivot sales in the JSON file. Data from multiple sources can be combined in a single SQL query. We can combine the JSON and CSV data to show the total sales in the countries with the most Nobel laureates: .. code-block:: sql WITH CountrySales AS (SELECT CASE WHEN Country='United States of America' THEN 'USA' ELSE Country END AS Country, Sum("Total Revenue") AS Sales FROM '10000SalesRecords.csv' GROUP BY Country) SELECT cs.Country, SUM(cs.Sales) AS Sales, COUNT(*) AS "Nobel Laureates" FROM CountrySales cs INNER JOIN 'laureate.json' l ON cs.Country = l.bornCountry GROUP BY cs.Country ORDER BY "Nobel Laureates" DESC LIMIT 5 This SQL uses a common table expression (`CTE`_) to create a CountrySales result set that is joined to the JSON data. Note the use of a `CASE`_ expression ensure the country names match in both source data sets. .. image:: ../_static/duckdb_csv_sales_joined_to_json.png :width: 400 :alt: CSV sales joined with JSON data. Full details of DuckDB's `CSV capabilities`_ are available in their documentation. .. _Parquet: Querying Parquet files ---------------------- Apache Parquet is an open source, column-oriented data file format designed for efficient data storage and retrieval. It provides efficient data compression and encoding schemes with enhanced performance to handle complex data in bulk. DuckDB has extensive support for `efficient`_ querying of Parquet files. DuckDB allows very large Parquet datasets to be queried (even data sets that do not fit into memory), multiple files can be queried in parallel. The Parquet file, *titanic.parquet*, contains details of Titanic survivors. DuckDB allows the survival status, cabin class, sex and age of the passengers countries to be extracted with a simple SQL SELECT: .. code-block:: sql SELECT Survived, Pclass, Sex, Age FROM 'titanic.parquet' LIMIT 5 Passing the SQL into the :func:`DuckDbQuery` function gives the following result: .. image:: ../_static/duckdb_parquet_top_survivors.png :width: 400 :alt: First rows in the Parquet file. We can now contrast the ages across cabin classes for survivors versus non-survivors: .. code-block:: sql WITH Survivors AS (SELECT Pclass, Age, Sex FROM 'D:\github\xlslim-code-samples\duckdb\..\data\titanic.parquet' WHERE Survived=1) PIVOT Survivors ON Pclass USING AVG(Age) GROUP BY Sex .. image:: ../_static/duckdb_parquet_survivors.png :width: 400 :alt: Titanic survivors. .. code-block:: sql WITH Survivors AS (SELECT Pclass, Age, Sex FROM 'D:\github\xlslim-code-samples\duckdb\..\data\titanic.parquet' WHERE Survived=0) PIVOT Survivors ON Pclass USING AVG(Age) GROUP BY Sex .. image:: ../_static/duckdb_parquet_nonsurvivors.png :width: 400 :alt: Titanic survivors. Generally, younger passengers were more likely to survive, with the curious exception of first class female passengers. Full details of DuckDB's `Parquet capabilities`_ are available in their documentation. .. _Remote: Querying from remote locations ------------------------------ DuckDB has functionality to directly query data located on `https`_ and in AWS S3. As an example, the *holdings.parquet* file can be queried from https://duckdb.org: .. code-block:: sql SELECT * FROM 'https://duckdb.org/data/holdings.parquet'; .. image:: ../_static/duckdb_parquet_remote.png :width: 400 :alt: Remote Parquet query. Similarly we can attach to the DuckDB stations database in S3 and query the number of stations in each country: .. code-block:: sql ATTACH 's3://duckdb-blobs/databases/stations.duckdb' AS stations_db; SELECT country, COUNT(*) AS "Number of stations" FROM stations_db.stations GROUP BY country ORDER BY country .. image:: ../_static/duckdb_duckdb_remote.png :width: 400 :alt: Remote DuckDB query. .. tip:: Access to AWS S3 data usually requires credentials. See the DuckDB `S3`_ API documentation for details about how to use secrets to provide credentials to S3. Hopefully this guide gives a sense of the power of DuckDB! Please read the DuckDB `SQL`_ documentation for more information about DuckDB's capabilities, including how to attach to SQLite, Postgress or indeed any ODBC databases. .. note:: All the Excel and data files shown are available from github in the `xlslim-code-samples`_ repo. I highly recommend downloading the samples from github. The Excel workbooks contain many tips and tricks. .. _xlslim-code-samples: https://github.com/RusselWebber/xlslim-code-samples/archive/refs/heads/main.zip