How To: Common Icechunk Operations#
This page gathers common Icechunk operations into one compact how-to guide. It is not intended as a deep explanation of how Icechunk works.
Creating and Opening Repos#
Creating and opening repos requires creating a Storage object. See the Storage guide for all the details.
Create a New Repo#
storage = icechunk.s3_storage(bucket="my-bucket", prefix="my-prefix", from_env=True)
repo = icechunk.Repository.create(storage)
Open an Existing Repo#
Specify Custom Config when Opening a Repo#
There are many configuration options available to control the behavior of the repository and the storage backend. See Configuration for all the details.
config = icechunk.RepositoryConfig.default()
config.caching = icechunk.CachingConfig(num_bytes_chunks=100_000_000)
repo = icechunk.Repository.open(storage, config=config)
Deleting a Repo#
Icechunk doesn't provide a way to delete a repo once it has been created. If you need to delete a repo, just go to the underlying storage and remove the directory where you created the repo.
Reading, Writing, and Modifying Data with Zarr#
Read and write operations occur within the context of a transaction. The general pattern is
session = repo.writable_session(branch="main")
# interact with the repo via session.store
# ...
session.commit(message="wrote some data")
Info
In the examples below, we just show the interaction with the store object. Keep in mind that all sessions need to be concluded with a .commit().
Alternatively, you can also use the .transaction function as a context manager, which automatically commits when the context exits.
with repo.transaction(branch="main", message="wrote some data") as store:
# interact with the repo via store
Create a Group#
Create an Array#
Write Data to an Array#
Read Data from an Array#
Resize an Array#
Add or Modify Array / Group Attributes#
View Array / Group Attributes#
Delete a Group#
Delete an Array#
Reading and Writing Data with Xarray#
Write an in-memory Xarray Dataset#
Append to an existing datast#
Write an Xarray dataset with Dask#
Writing with Dask or any other parallel execution framework requires special care. See Parallel writes and Xarray for more detail.
Read a dataset with Xarray#
Reading can be done with a read-only session.
session = repo.readonly_session("main")
ds = xr.open_zarr(session.store, group="my-group", zarr_format=3, consolidated=False)
Transactions and Version Control#
For more depth, see Transactions and Version Control.
Create a Snapshot via a Transaction#
Resolve a Commit Conflict#
The case of no actual conflicts:
try:
session.commit("commit message")
except icechunk.ConflictError:
session.rebase(icechunk.ConflictDetector())
session.commit("committed after rebasing")
Or if you have conflicts between different commits and want to overwrite the other changes:
try:
session.commit("commit message")
except icechunk.ConflictError:
session.rebase(icechunk.BasicConflictSolver(on_chunk_conflict=icechunk.VersionSelection.UseOurs))
session.commit("committed after rebasing")
Commit with Automatic Rebasing#
This will automatically retry the commit until it succeeds
List Snapshots#
Check out a Snapshot#
Create a Branch#
List all Branches#
Check out a Branch#
Reset a Branch to a Different Snapshot#
Create a Tag#
List all Tags#
Check out a Tag#
Delete a Tag#
Repo Maintenance#
For more depth, see Data Expiration.
Run Snapshot Expiration#
from datetime import datetime, timedelta
expiry_time = datetime.now() - timedelta(days=10)
expired = repo.expire_snapshots(older_than=expiry_time)
Run Garbage Collection#
Usage in async contexts#
Most methods in Icechunk have an async counterpart, named with an _async postfix. For more info, see Async Usage.