Pandas, DuckDB, and Polars on Microsoft Fabric

In this post, we will analyze a relatively small parquet file using various tools. On Microsoft Fabric, we will use basic Pandas, Pandas with multiprocessing, Polars, and DuckDB. We will also use cuDF.pandas on Google Colab. Below is the result.

These are in-memory analytic tools. Pandas is single-threaded, while Polars and DuckDB are multi-threaded and written in Rust and C++, respectively.

Pandas with multiprocessing did surprisingly well. Implementing DuckDB was the hardest as looping is hard unless using SQL Server.

The file that we work with is a table of mortgage applications with 1.6 million rows. As this table is relatively small, using cuDF was ineffective on Google Colab A100 so we do not include it here. Perhaps in another post, I will test all tools again on a much larger file.

We filter the applicant_income_000s >= k for each k within 0 and 10000, and we compute the mean of the other two columns. We use the magic comand %%timeit to find the average time.

GRAPH

fig,ax1=plt.subplots()

ax1.plot(x_values,rates, label='pull through', color='green')
ax1.set_xlabel('Income amount in 000s')
ax1.set_ylabel('Acceptance rate')
plt.legend()
ax2=ax1.twinx()

ax2.plot(x_values,loan_avgs, label='average loan', color='purple')
ax2.set_ylabel('Average Loan amount in 000s')

plt.legend(loc='lower left')
plt.show()

STANDARD PANDAS

def pull_through_rate_panda(k, df):
    averages = df[df["applicant_income_000s"]>=k][["pull_through","loan_amount_000s"]].mean()
    return averages
averages = [pull_through_rate_panda(k,data) for k in range(10000)]

PANDAS WITH MULTIPROCESSING

pool = Pool()
x_values = range(0,10000,1)
results=pool.map(pull_through_rate_panda,x_values)
pool.close()
pool.join()

POLARS

def pull_through_rate(k):
    averages = data.filter(pl.col('applicant_income_000s') >= k).select([
        pl.col('pull_through'),
        pl.col('loan_amount_000s')
    ]).mean()
    return averages['pull_through'][0],averages['loan_amount_000s'][0]

DUCKDB

query = '''
WITH RECURSIVE k_values(k) AS (
    SELECT 0
    UNION ALL
    SELECT k+1 from k_values
    WHERE k < 9999
),
averages AS (
    SELECT
    k,
    AVG(CASE WHEN t.applicant_income_000s >= k THEN t.pull_through ELSE NULL END) AS rate,
    AVG(CASE WHEN t.applicant_income_000s >= k THEN t.loan_amount_000s ELSE NULL END) AS loan
    FROM k_values
    LEFT JOIN my_table t ON t.applicant_income_000s >= k
    GROUP BY k
)
SELECT rate, loan
FROM averages
ORDER by k;
'''
results_df = conn.execute(query).fetch_df()

Sophelen Research

Pandas, DuckDB, and Polars on Microsoft Fabric

Leave a comment Cancel reply

Pandas, DuckDB, and Polars on Microsoft Fabric

Share this:

Leave a comment Cancel reply