Filter

This chapter will explore how to filter rows from your data. You can run these examples with cargo run --example 3_1_1_filter.

Filters

Simple filters

To have access to data, lets connect to the 100% sample Census data parquet file:

use polars::prelude::*;

// Connect to LazyFrame
let args = ScanArgsParquet::default();
let lf = LazyFrame::scan_parquet(PlPath::from_str("./data/large/partitioned"), args).unwrap();

To filter this data, you can use filter() with expressions. In this example, keep_type is filtered to “usual resident”, region is filtered to the region of London (“E12000007”), age_group is filtered to 45 years old and older (>= 5) and records with null values in income are dropped:

//Filtering the data in multiple steps
let lf_filt_mult = lf
    .clone()
    .filter(col("keep_type").eq(lit(1))) // Usual resident
    .filter(col("region").eq(lit("E12000007"))) // London
    .filter(col("age_group").gt_eq(lit(5))) // Aged 45+
    .filter(col("income").is_not_null());

To reference a column in a filter, it must be called with the col() function. For values (i.e. literals), they must be referenced with the lit() function. They can be compared using equality comparison such as:

Equality	Definition
`eq()`	`==`
`eq_missing()`	`==` & `None == None`
`neq()`	`!=`
`neq_missing()`	`!=` (including when one side of the equation is `None`)
`lt()`	`<`
`gt()`	`>`
`gt_eq()`	`>=`
`lt_eq()`	`<=`
`not()`	`!`
`is_null()`	`== None`
`is_not_null()`	`!= None`

Instead of the multiple .filter() you can use one .filter() and chain the commands with .and(), .or() and .xor(). The above example can be created using this filter:

// Filtering the data in one step
let lf_filt_one = lf.clone().filter(
    col("keep_type")
        .eq(lit(1)) // Usual resident
        .and(col("region").eq(lit("E12000007"))) // London
        .and(col("age_group").gt_eq(lit(5))) // Aged 45+
        .and(col("income").is_not_null()),
);

Complex filters

The and() and or() functions are especially important when crafting more complex filters. For example, you can craft this filter to collect individuals 55 and over from the North East and individuals 54 and under from the North West:

// ((region == "E12000001" & age_group >= 6) | (region == "E12000002" & age_group < 6))
let expr = (col("region")
    .eq(lit("E12000001")) // North East
    .and(col("age_group").gt_eq(lit(6)))) // 55 and over
.or(col("region")
    .eq(lit("E12000002")) // North West
    .and(col("age_group").lt_eq(lit(6)))); // 54 and under

println!("{expr}"); // You can print it

You can print the expression to see how it’s being evaluated. This is especially useful when you use an IDE that can highlight bracket and parenthesis pairs.

[([([(col("region")) == ("E12000001")]) & ([(col("age_group")) >= (dyn int: 6)])]) | ([([(col("survyear")) == ("E12000002")]) & ([(col("age_group")) <= (dyn int: 6)])])]

You can then apply the expression with .filter():

// Apply the expression to a LazyFrame
let lf_filt_complex = lf.clone().filter(expr);

Value is in a list

With the is_in crate feature, you can see if a col() is within a list of lit(). The right side of the expression takes a Polars::Series, that can be built using Series::from_iter(vec![<vals>]). In this example, industry is subset to Manufacturing (2), Construction (4), Transport and communication (6) or Public administration, education and health (8).

// Using `is_in` crate feature with literals
let lf_filt_is_in = lf
    .clone()
    .filter(col("industry").is_in(lit(Series::from_iter(vec![2, 4, 6, 8])).implode(), false));

Lazy evaluation optimization

Filtering is a perfect example to show how LazyFrame use optimized queries, especially when using partitioned parquet files, as created in the Parquet chapter. This example can be run with cargo run -r --example 3_1_2_filter_opt (release mode is important for simple benchmarking).

Note

Optimization also works when connecting to data on the Cloud.

First, lets enable verbose output to understand what Polars is doing.

unsafe {
    env::set_var("POLARS_VERBOSE", "1");
}

Note

Note that this code is wrapped in Rust’s unsafe {}. This is not uncommon in Rust, as it’s a way to explicitly force the user to acknowledge that the code they are using could potentially cause memory issues or not be thread-safe. As explained in the documentation of set_var, “This function is safe to call in a single-threaded program.” and “This function is also always safe to call on Windows, in single-threaded and multi-threaded programs.”. Therefore, this function is only unsafe in multi-treaded programs, on Linux or MacOS. This program is single threaded, so no concerns here! I recommend exploring the unsafe rust documentation a bit to familiarize yourself with safety in Rust.

Next lets connect to the ./data/large/census.parquet file that contains over 60 million rows of Census data, in one extremely compressed parquet file (approximately 13 MB). Lets filter this file to the region of london (region = “E12000007”), for those aged 45 to 54 (age_group = 5), and non-null values for income. Remember, this code creates and execution plan, but does not yet execute it.

// Connect to LazyFrame (one large parquet file)
let args = ScanArgsParquet::default();
let lf_one =
    LazyFrame::scan_parquet(PlPath::from_str("./data/large/census.parquet"), args).unwrap();

// Filter it
let lf_one = lf_one
    .filter(col("region").eq(lit("E12000007"))) // London
    .filter(col("age_group").eq(lit(5))) // Age 45 to 54
    .filter(col("income").is_not_null());

Second, lets connect to the ./data/large/partitioned partitioned dataset, that was partitioned by region and by age_group. Overall, the files in this partitioned dataset folder will contain over 60 million rows of Census data. This data is split among 70 extremely compressed parquet files, equalling a total of approximately 40 MB. Similar to the large parquet file, this file will be filtered to the region of London (region = “E12000007”), for those aged 45 to 54 (age_group = 5), and non-null values for income. Again, nothing is executed at this point.

// Connect to LazyFrame (partitioned parquet file)
let args = ScanArgsParquet::default();
let lf_part =
    LazyFrame::scan_parquet(PlPath::from_str("./data/large/partitioned"), args).unwrap();

// Filter it
let lf_part = lf_part
    .filter(col("region").eq(lit("E12000007"))) // London
    .filter(col("age_group").eq(lit(5))) // Age 45 to 54
    .filter(col("income").is_not_null());

Now, lets collect both files into memory and get info about execution time:

let before = std::time::Instant::now();
let _ = lf_one.collect().unwrap();
println!("Elapsed time: {:.2?}", before.elapsed());

let before = std::time::Instant::now();
let _ = lf_part.collect().unwrap();
println!("Elapsed time: {:.2?}", before.elapsed());

This is when the lazy query is executed and, since verbose was enabled, we can understand what steps are being taken.

For the verbose output of the ./data/large/census.parquet file you will see that there is 1 source, and the entirety of the source needs to be read:

[MultiScanTaskInit]: 1 sources, reader name: parquet, ReaderCapabilities(ROW_INDEX | PRE_SLICE | NEGATIVE_PRE_SLICE | PARTIAL_FILTER | FULL_FILTER | MAPPED_COLUMN_PROJECTION), n_readers_pre_init: 1, max_concurrent_scans: 1
[MultiScanTaskInit]: predicate: Some("<predicate>"), skip files mask: None, predicate to reader: Some("<predicate>")

For the verbose output of the ./data/large/partitioned files you will see that while there are 70 files, 69 of them can be skipped before being read:

[MultiScanTaskInit]: 70 sources, reader name: parquet, ReaderCapabilities(ROW_INDEX | PRE_SLICE | NEGATIVE_PRE_SLICE | PARTIAL_FILTER | FULL_FILTER | MAPPED_COLUMN_PROJECTION), n_readers_pre_init: 19, max_concurrent_scans: 16
[MultiScan]: Predicate pushdown allows skipping 69 / 70 files

For the large file, the entirety of the file has to be scanned (e.g. each row has to be verified for all filters) but for the partitioned parquet file, only one file is scanned, and the filter is applied only to the rows in that one smaller file. The partitioned parquet file allows for filters that are in the partitioned columns (e.g. region and age_group) to skip entire files.

This gives really great time improvements for queries that contain filters for those variables. In the above example, collecting the data from both files can show some significant time differences. While the time differs, you can see an improvment of between 2x and 5x the speed. For extremly large queries (billions of rows) this can have massive advantages.