Home / howto

How To: Common Icechunk Operations#

This page gathers common Icechunk operations into one compact how-to guide. It is not intended as a deep explanation of how Icechunk works.

Creating and Opening Repos#

Creating and opening repos requires creating a Storage object. See the Storage guide for all the details.

Create a New Repo#

storage = icechunk.s3_storage(bucket="my-bucket", prefix="my-prefix", from_env=True)
repo = icechunk.Repository.create(storage)

Open an Existing Repo#

repo = icechunk.Repository.open(storage)

Specify Custom Config when Opening a Repo#

There are many configuration options available to control the behavior of the repository and the storage backend. See Configuration for all the details.

config = icechunk.RepositoryConfig.default()
config.caching = icechunk.CachingConfig(num_bytes_chunks=100_000_000)
repo = icechunk.Repository.open(storage, config=config)

Deleting a Repo#

Icechunk doesn't provide a way to delete a repo once it has been created. If you need to delete a repo, just go to the underlying storage and remove the directory where you created the repo.

Reading, Writing, and Modifying Data with Zarr#

Read and write operations occur within the context of a transaction. The general pattern is

session = repo.writable_session(branch="main")
# interact with the repo via session.store
# ...
session.commit(message="wrote some data")

Info

In the examples below, we just show the interaction with the store object. Keep in mind that all sessions need to be concluded with a .commit().

Alternatively, you can also use the .transaction function as a context manager, which automatically commits when the context exits.

with repo.transaction(branch="main", message="wrote some data") as store:
    # interact with the repo via store

Create a Group#

group = zarr.create_group(session.store, path="my-group", zarr_format=3)

Create an Array#

array = group.create("my_array", shape=(10, 20), dtype='int32')

Write Data to an Array#

array[2:5, :10] = 1

Read Data from an Array#

data = array[:5, :10]

Resize an Array#

array.resize((20, 30))

Add or Modify Array / Group Attributes#

array.attrs["standard_name"] = "time"

View Array / Group Attributes#

dict(array.attrs)

Delete a Group#

del group["subgroup"]

Delete an Array#

del group["array"]

Reading and Writing Data with Xarray#

Write an in-memory Xarray Dataset#

ds.to_zarr(session.store, group="my-group", zarr_format=3, consolidated=False)

Append to an existing datast#

ds.to_zarr(session.store, group="my-group", append_dim='time', consolidated=False)

Write an Xarray dataset with Dask#

Writing with Dask or any other parallel execution framework requires special care. See Parallel writes and Xarray for more detail.

from icechunk.xarray import to_icechunk
to_icechunk(ds, session)

Read a dataset with Xarray#

Reading can be done with a read-only session.

session = repo.readonly_session("main")
ds = xr.open_zarr(session.store, group="my-group", zarr_format=3, consolidated=False)

Transactions and Version Control#

For more depth, see Transactions and Version Control.

Create a Snapshot via a Transaction#

snapshot_id = session.commit("commit message")

Resolve a Commit Conflict#

The case of no actual conflicts:

try:
    session.commit("commit message")
except icechunk.ConflictError:
    session.rebase(icechunk.ConflictDetector())
    session.commit("committed after rebasing")

Or if you have conflicts between different commits and want to overwrite the other changes:

try:
    session.commit("commit message")
except icechunk.ConflictError:
    session.rebase(icechunk.BasicConflictSolver(on_chunk_conflict=icechunk.VersionSelection.UseOurs))
    session.commit("committed after rebasing")

Commit with Automatic Rebasing#

This will automatically retry the commit until it succeeds

session.commit("commit message", rebase_with=icechunk.ConflictDetector())

List Snapshots#

for snapshot in repo.ancestry(branch="main"):
    print(snapshot)

Check out a Snapshot#

session = repo.readonly_session(snapshot_id=snapshot_id)

Create a Branch#

repo.create_branch("dev", snapshot_id=snapshot_id)

List all Branches#

branches = repo.list_branches()

Check out a Branch#

session = repo.writable_session("dev")

Reset a Branch to a Different Snapshot#

repo.reset_branch("dev", snapshot_id=snapshot_id)

Create a Tag#

repo.create_tag("v1.0.0", snapshot_id=snapshot_id)

List all Tags#

tags = repo.list_tags()

Check out a Tag#

session = repo.readonly_session(tag="v1.0.0")

Delete a Tag#

repo.delete_tag("v1.0.0")

Repo Maintenance#

For more depth, see Data Expiration.

Run Snapshot Expiration#

from datetime import datetime, timedelta
expiry_time = datetime.now() - timedelta(days=10)
expired = repo.expire_snapshots(older_than=expiry_time)

Run Garbage Collection#

results = repo.garbage_collect(expiry_time)

Usage in async contexts#

Most methods in Icechunk have an async counterpart, named with an _async postfix. For more info, see Async Usage.

results = await repo.garbage_collect_async(expiry_time)