Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Cloud

Polars can connect to cloud storage solution such as AWS S3, Azure Blob Storage and Google Cloud Storage. This methods allows for lazy evaluation of cloud objects. In this example, we will show how to connect to AWS S3, set up in the optional s3 bucket section of the data chapter.

important

Reminder: make sure that the minio server is running (minio server ./data/minio) before running these examples.

Cloud options

To connect to the cloud of your choice, you have to set up the cloud options: use with_aws and AmazonS3ConfigKey for S3 buckets, use with_gcp and GoogleConfigKey for Google Cloud Storage, and use with_azure and AzureConfigKey for Azure Blob Storage. Run this code using cargo run -r --example 2_5_1_read_cloud.

For the default configuration for the minio serer, connect using with_aws:

use polars::prelude::*;

let cloud_options = cloud::CloudOptions::default().with_aws(vec![
    (cloud::AmazonS3ConfigKey::AccessKeyId, "minioadmin"),
    (cloud::AmazonS3ConfigKey::SecretAccessKey, "minioadmin"),
    (cloud::AmazonS3ConfigKey::Region, "us-east-1"),
    (cloud::AmazonS3ConfigKey::Bucket, "census"),
    (cloud::AmazonS3ConfigKey::Endpoint, "http://127.0.0.1:9000"),
]);

Reading

For .csv files, in the same way as was shown for the CSV data stored locally, you can get a LazyFrame from LazyCsvReader with data on the cloud, by passing the cloud_options created above to with_cloud_options():

// Connect to LazyFrame (no data is brought into memory)
let lf = LazyCsvReader::new(PlPath::from_str("s3://census/census.csv"))
    .with_cloud_options(Some(cloud_options.clone()))
    .finish()
    .unwrap();

println!("{:?}", println!("{}", lf.limit(5).collect().unwrap()));
shape: (5, 21)
┌─────────────────┬────────┬───────┬──────┬───┬─────┬───────────┬────────┬───────┐
│ id              ┆ social ┆ birth ┆ econ ┆ … ┆ sex ┆ keep_type ┆ income ┆ chunk │
│ ---             ┆ ---    ┆ ---   ┆ ---  ┆   ┆ --- ┆ ---       ┆ ---    ┆ ---   │
│ str             ┆ i64    ┆ i64   ┆ i64  ┆   ┆ i64 ┆ i64       ┆ i64    ┆ i64   │
╞═════════════════╪════════╪═══════╪══════╪═══╪═════╪═══════════╪════════╪═══════╡
│ PTS000000348231 ┆ 2      ┆ 1     ┆ -8   ┆ … ┆ 1   ┆ 1         ┆ 59292  ┆ 47    │
│ PTS000000059235 ┆ 1      ┆ 1     ┆ -8   ┆ … ┆ 1   ┆ 1         ┆ 25731  ┆ 47    │
│ PTS000000060206 ┆ 1      ┆ 1     ┆ -8   ┆ … ┆ 2   ┆ 1         ┆ 88277  ┆ 47    │
│ PTS000000468982 ┆ 3      ┆ 1     ┆ -8   ┆ … ┆ 2   ┆ 1         ┆ 82954  ┆ 47    │
│ PTS000000224308 ┆ 2      ┆ 1     ┆ -8   ┆ … ┆ 2   ┆ 1         ┆ 82315  ┆ 47    │
└─────────────────┴────────┴───────┴──────┴───┴─────┴───────────┴────────┴───────┘

For .parquet files, in the same way as was shown for the Parquet data stored locally, you can get a LazyFrame from scan_parquet with data on the cloud, by adding cloud_options to the ScanArgsParquet. This works for both individual parquet files or partitioned parquet files.

note

For partitioned parquet files on the cloud, the / at the end of s3://census/partitioned/ is required (unlike on local data).

// Connect to LazyFrame (no data is brought into memory)
let args = ScanArgsParquet {
    cloud_options: Some(cloud_options.clone()),
    ..Default::default()
};
let lf = LazyFrame::scan_parquet(PlPath::from_str("s3://census/partitioned/"), args).unwrap();

// Print first 5 rows
println!("{}", lf.limit(5).collect().unwrap());
shape: (5, 21)
┌─────────────────┬────────┬───────┬──────┬───┬─────┬───────────┬────────┬───────┐
│ id              ┆ social ┆ birth ┆ econ ┆ … ┆ sex ┆ keep_type ┆ income ┆ chunk │
│ ---             ┆ ---    ┆ ---   ┆ ---  ┆   ┆ --- ┆ ---       ┆ ---    ┆ ---   │
│ str             ┆ i64    ┆ i64   ┆ i64  ┆   ┆ i64 ┆ i64       ┆ i64    ┆ i64   │
╞═════════════════╪════════╪═══════╪══════╪═══╪═════╪═══════════╪════════╪═══════╡
│ PTS000000067966 ┆ 2      ┆ 1     ┆ -8   ┆ … ┆ 1   ┆ 1         ┆ 92400  ┆ 47    │
│ PTS000000068645 ┆ 3      ┆ 1     ┆ -8   ┆ … ┆ 1   ┆ 1         ┆ 30008  ┆ 47    │
│ PTS000000067503 ┆ 3      ┆ 1     ┆ -8   ┆ … ┆ 2   ┆ 1         ┆ 82058  ┆ 47    │
│ PTS000000501140 ┆ 3      ┆ 1     ┆ -8   ┆ … ┆ 2   ┆ 1         ┆ 47252  ┆ 47    │
│ PTS000000501337 ┆ 3      ┆ 1     ┆ -8   ┆ … ┆ 1   ┆ 1         ┆ 40227  ┆ 47    │
└─────────────────┴────────┴───────┴──────┴───┴─────┴───────────┴────────┴───────┘

Writing

Writing to the cloud is similar to writing to local data. Instead of providing a std::fs::File writer, you provide a CloudWriter from polars. Run this code using cargo run -r --example 2_5_1_read_cloud. To write, you must have DataFrame in memory:

use polars::prelude::*;
use tokio::runtime::Runtime;

// Read file form local
let lf = LazyCsvReader::new(PlPath::from_str("./data/csv/census_0.csv"))
    .with_has_header(true)
    .finish()
    .unwrap();

// Bring it into memory (by converting it to DataFrame)
let mut df = lf.collect().unwrap();

You can then write a .csv or a .parquet to the cloud using the CloudWriter and the cloud_options created previously:

// Write `census_0.csv`
let mut cloudfile = Runtime::new()
    .unwrap()
    .block_on(cloud::BlockingCloudWriter::new(
        "s3://census/census_0.csv",
        Some(&cloud_options),
    ))
    .unwrap();
CsvWriter::new(&mut cloudfile).finish(&mut df).unwrap();

// Write `census_0.parquet`
let mut cloudfile = Runtime::new()
    .unwrap()
    .block_on(cloud::BlockingCloudWriter::new(
        "s3://census/census_0.parquet",
        Some(&cloud_options),
    ))
    .unwrap();
ParquetWriter::new(&mut cloudfile).finish(&mut df).unwrap();

You can also write a partitioned parquet file to the cloud with write_partitioned_dataset by passing the same cloud_options:

// Write partitioned `census_0.parquet` on "region" and "age_group"
// `write_partitioned_dataset` is considered unstable
write_partitioned_dataset(
    &mut df,
    PlPath::from_str("s3://census/census_0_part/").as_ref(),
    vec!["region".into(), "age_group".into()],
    &ParquetWriteOptions::default(),
    Some(&cloud_options),
    4294967296,
)
.unwrap();