Summary statistics

Summary statistics are the basic tools of data analysts. This chapter will demonstrate how to do all the simple summary statistics using Polars. Run this code using cargo run -r --example 4_1_1_summary.

Important

Reminder: The “income” variable is synthetic (random value between £10,000 and £100,000). This means that the distribution will not provide real-world summary results.

Setup

First, lets connect to the partitioned parquet files and replace the region code with their name:

use polars::prelude::{pivot::pivot_stable, *};

// Connect to LazyFrame
let args = ScanArgsParquet::default();
let lf = LazyFrame::scan_parquet(PlPath::from_str("./data/large/partitioned"), args).unwrap();

// Modify var
let lf = lf
    .filter(col("keep_type").eq(lit(1))) // Usual resident
    .filter(col("income").is_not_null())
    .with_column(col("region").replace_strict(
        lit(Series::from_iter(vec![
            "E12000001",
            "E12000002",
            "E12000003",
            "E12000004",
            "E12000005",
            "E12000006",
            "E12000007",
            "E12000008",
            "E12000009",
            "W92000004",
        ])),
        lit(Series::from_iter(vec![
            "North East",
            "North West",
            "Yorkshire and The Humber",
            "East Midlands",
            "West Midlands",
            "East of England",
            "London",
            "South East",
            "South West",
            "Wales",
        ])),
        None,
        Some(DataType::String),
    ));

Simple statistics

You can do simple statistics functions like mean(), sum() or median() by limiting the LazyFrame to one variable and calling the appropreate function. Here is how you would get a median of income:

// Simple statistics (single point)
let mean_income: DataFrame = lf
    .clone()
    .select([col("income")])
    .median()
    .collect()
    .unwrap();

println!("Mean income:\n\n{mean_income}\n");

shape: (1, 1)
┌─────────┐
│ income  │
│ ---     │
│ f64     │
╞═════════╡
│ 55008.5 │
└─────────┘

You can get different statistics with other functions like min(), quantile() or max(). For the quantiles, you have to provide a QuantileMethod.

// Multiple statistics (calculated)
let income_stats = lf
    .clone()
    .select([
        (len() / lit(100_000)).alias("count (x100,000)"),
        col("income")
            .mean()
            .round(2, RoundMode::HalfAwayFromZero)
            .alias("mean"),
        col("income").min().alias("min"),
        col("income")
            .quantile(lit(0.01), QuantileMethod::Nearest)
            .alias("p01"),
        col("income")
            .quantile(lit(0.25), QuantileMethod::Nearest)
            .alias("p25"),
        col("income").median().alias("median"),
        col("income")
            .quantile(lit(0.75), QuantileMethod::Nearest)
            .alias("p75"),
        col("income")
            .quantile(lit(0.99), QuantileMethod::Nearest)
            .alias("p99"),
        col("income").max().alias("max"),
    ])
    .unpivot(UnpivotArgsDSL {
        on: Selector::Empty,
        index: Selector::Empty,
        variable_name: Some("statistic".into()),
        value_name: Some("value".into()),
    })
    .collect()
    .unwrap();

println!("Table of summary statistics about income:\n\n{income_stats}\n");

shape: (9, 2)
┌──────────────────┬──────────┐
│ statistic        ┆ value    │
│ ---              ┆ ---      │
│ str              ┆ f64      │
╞══════════════════╪══════════╡
│ count (x100,000) ┆ 294.0    │
│ mean             ┆ 55023.78 │
│ min              ┆ 10000.0  │
│ p01              ┆ 10917.0  │
│ p25              ┆ 32573.0  │
│ median           ┆ 55008.5  │
│ p75              ┆ 77492.0  │
│ p99              ┆ 99097.0  │
│ max              ┆ 99999.0  │
└──────────────────┴──────────┘

Statistics by category

To get statistics by a variable group, you have to use group_by() with agg(). This example provides the mean income by region:

// Simple statistics by category
let mean_income_by_region = lf
    .clone()
    .group_by([col("region")])
    .agg([col("income").mean().round(2, RoundMode::HalfAwayFromZero)])
    .collect()
    .unwrap();

println!("Mean income by region:\n\n{mean_income_by_region}\n");

shape: (10, 2)
┌──────────────────────────┬──────────┐
│ region                   ┆ income   │
│ ---                      ┆ ---      │
│ str                      ┆ f64      │
╞══════════════════════════╪══════════╡
│ South East               ┆ 55102.39 │
│ East of England          ┆ 54976.91 │
│ North West               ┆ 55063.74 │
│ Wales                    ┆ 55075.05 │
│ Yorkshire and The Humber ┆ 54972.74 │
│ South West               ┆ 55108.22 │
│ East Midlands            ┆ 54697.23 │
│ London                   ┆ 54858.61 │
│ North East               ┆ 55177.93 │
│ West Midlands            ┆ 55320.6  │
└──────────────────────────┴──────────┘

Weighted statistics

If you have survey weights, like the synthetic weight variable found on the 1% samples of the Census, you can create statistics by that weight, with formulas. In this example, a custom function called weighted_quantile provides quantiles that respect the survey weights. It provides an approximation of the 100%, using only the 1% sample.

// Connect to LazyFrame
let args = ScanArgsParquet::default();
let lf =
    LazyFrame::scan_parquet(PlPath::from_str("./data/parquet/census_0.parquet"), args).unwrap();

// Filter and format
let lf = lf
    .filter(col("keep_type").eq(lit(1))) // Usual resident
    .filter(col("income").is_not_null())
    .with_column(col("sex").replace_strict(
        lit(Series::from_iter(vec!["1", "2"])),
        lit(Series::from_iter(vec!["Female", "Male"])),
        None,
        Some(DataType::String),
    ))
    .with_column(col("weight").cast(DataType::Float64));

// Calculate weighted quantile
fn weighted_quantile(col: Expr, wt: Expr, percentile: Expr) -> Expr {
    col.sort_by(
        [(wt.clone().cast(DataType::Float64).cum_sum(false)
            / wt.clone().cast(DataType::Float64).sum()
            - percentile)
            .abs()],
        SortMultipleOptions::default(),
    )
    .first()
    .alias("median")
}

// Weighted statistics
let income_stats_wt = lf
    .clone()
    .sort(["income"], SortMultipleOptions::new())
    .select([
        ((col("weight").sum()) / lit(100_000))
            .round(1, RoundMode::HalfAwayFromZero)
            .alias("count (x100,000)"),
        (((col("income") * col("weight")).sum()) / col("weight").sum())
            .round(2, RoundMode::HalfAwayFromZero)
            .alias("mean"),
        col("income").min().alias("min"),
        weighted_quantile(col("income"), col("weight"), lit(0.01)).alias("p01"),
        weighted_quantile(col("income"), col("weight"), lit(0.25)).alias("p25"),
        weighted_quantile(col("income"), col("weight"), lit(0.50)).alias("median"),
        weighted_quantile(col("income"), col("weight"), lit(0.75)).alias("p75"),
        weighted_quantile(col("income"), col("weight"), lit(0.99)).alias("p99"),
        col("income").max().alias("max"),
    ])
    .unpivot(UnpivotArgsDSL {
        on: Selector::Empty,
        index: Selector::Empty,
        variable_name: Some("statistic".into()),
        value_name: Some("value".into()),
    })
    .collect()
    .unwrap();

println!("Table of weighted summary statistics about income:\n\n{income_stats_wt}\n");

shape: (9, 2)
┌──────────────────┬──────────┐
│ statistic        ┆ value    │
│ ---              ┆ ---      │
│ str              ┆ f64      │
╞══════════════════╪══════════╡
│ count (x100,000) ┆ 293.0    │
│ mean             ┆ 55015.96 │
│ min              ┆ 10000.0  │
│ p01              ┆ 10917.0  │
│ p25              ┆ 32565.0  │
│ median           ┆ 55000.0  │
│ p75              ┆ 77471.0  │
│ p99              ┆ 99093.0  │
│ max              ┆ 99999.0  │
└──────────────────┴──────────┘

It also works with the group_by() and agg() function. In this example, weighted statistics grouped by sex are computed and then pivoted for nicer printing:

// Weighted statistics (by sex)
let income_stats_wt_by_sex = lf
    .clone()
    .sort(["sex", "income"], SortMultipleOptions::new())
    .group_by(["sex"])
    .agg([
        ((col("weight").sum()) / lit(100_000))
            .round(1, RoundMode::HalfAwayFromZero)
            .alias("count (x100,000)"),
        (((col("income") * col("weight")).sum()) / col("weight").sum())
            .round(2, RoundMode::HalfAwayFromZero)
            .alias("mean"),
        col("income").min().alias("min"),
        weighted_quantile(col("income"), col("weight"), lit(0.01)).alias("p01"),
        weighted_quantile(col("income"), col("weight"), lit(0.25)).alias("p25"),
        weighted_quantile(col("income"), col("weight"), lit(0.50)).alias("median"),
        weighted_quantile(col("income"), col("weight"), lit(0.75)).alias("p75"),
        weighted_quantile(col("income"), col("weight"), lit(0.99)).alias("p99"),
        col("income").max().alias("max"),
    ])
    .collect()
    .unwrap()
    .unpivot(
        [
            "count (x100,000)",
            "mean",
            "min",
            "p01",
            "p25",
            "median",
            "p75",
            "p99",
            "max",
        ],
        ["sex"],
    )
    .unwrap();

let income_stats_wt_by_sex = pivot_stable(
    &income_stats_wt_by_sex,
    ["sex"],
    Some(["variable"]),
    Some(["value"]),
    false,
    None,
    None,
)
.unwrap();

println!("Table of weighted summary statistics about income:\n\n{income_stats_wt_by_sex}\n");

shape: (9, 3)
┌──────────────────┬──────────┬──────────┐
│ variable         ┆ Female   ┆ Male     │
│ ---              ┆ ---      ┆ ---      │
│ str              ┆ f64      ┆ f64      │
╞══════════════════╪══════════╪══════════╡
│ count (x100,000) ┆ 139.9    ┆ 153.1    │
│ mean             ┆ 54948.81 ┆ 55077.33 │
│ min              ┆ 10000.0  ┆ 10000.0  │
│ p01              ┆ 10908.0  ┆ 10921.0  │
│ p25              ┆ 32483.0  ┆ 32635.0  │
│ median           ┆ 55007.0  ┆ 54993.0  │
│ p75              ┆ 77351.0  ┆ 77569.0  │
│ p99              ┆ 99049.0  ┆ 99126.0  │
│ max              ┆ 99999.0  ┆ 99999.0  │
└──────────────────┴──────────┴──────────┘

Keyboard shortcuts

Data Analysis in Rust

Summary statistics

Setup

Simple statistics

Statistics by category

Weighted statistics