Kedro#
Using Kedro with UV#
Init the virtual environment
uv venv
source .venv/bin/activate
Edit the kedro’s
pyproject.toml
######## Comment those lines !
# dynamic = [ "dependencies", "version",]
# [tool.setuptools.dynamic.dependencies]
# file = "requirements.txt"
######## Add this below [project]
version = "1.0.0"
Add
requirements.txtdependencies to thepyproject.toml
uv add -r requirements.txt
Debugging kedro pipelines#
Access to kedro from a notebook#
One easy way to debug kedro pipelines is via Notebook.
You can launch a notebook in your environment by using
kedro jupyter
or, if like me you work in VSCode, you can do the following:
from the kedro project root directory
jupyter server
In VSCode, use a notebook, and Connect to an existing jupyter server with the address http://localhost:8888.
You will find the kedro kernel running in the dropdown list of available kernels.
If you didn’t set your jupyter server password before, the URL may have a token, which consumes times to copy/paste. You can also define a permanent password for your jupyter server so you don’t have to change the address of the server every time with jupyter server password.
Running a pipeline#
The freshly instantiated jupyter kernels has some pre-existing variables (see also: https://docs.kedro.org/en/stable/notebooks_and_ipython/kedro_and_notebooks.html):
catalog
context
pipelines
session
The pipelines variables contains ones that were declared in the pipeline registry.
In order to run a pipeline you can use:
session.run(pipeline_name)
However, if ran multiple times, you will encounter this error: KedroSessionError: A run has already been completed as part of the active KedroSession.
In order to re-run a same pipeline multiple times, you need to run it on different sessions:
from kedro.framework.session import KedroSession
with KedroSession.create(project_path=my_project_path) as session:
session.run(pipeline_name)
Running a chunk of pipeline, gathering the output as Python objects#
You can have a lot of different nodes in one pipeline.
This becomes very boring and time-consuming to write every node output as an intermediary dataset in the data catalog.
In order to avoid that, you can use the following solution:
In the notebook, you can run a pipeline to get a specific node output:
with KedroSession.create(project_path=project_path) as session:
output = session.run(pipeline_name, to_outputs=["min_timestamp", "max_timestamp"])
The output will be a dictionnary output_name: output_object.
This works aswell for spark dataframes, meaning that you can directly use:
output["spark_dataframe"].show()
Please note that if you want the notebook to be in sync with your codebase, you will have to use the
%reload_kedromagic!
This debugging process is highly effective as there is no need to restart the notebook kernel in order to have an up-to-date codebase. This means that you can use a notebook in parallel of your development in order to check the impact of the modifications of your code to the actual data.
Working Spark 3.4/kedro environment#
Warning: If you are on Windows, using WSL2 as the micromamba/conda source, make sure to DISABLE any antivirus during the install process (avoiding kernel heap memory corruption). And re-enable it after (so the controlled access folder policy works again, so you will be able to access to files in jupyter).
micromamba env create -f environmnent.yml
environment.yml
name: wp
channels:
- conda-forge
- bioconda
- defaults
dependencies:
- pip
- python==3.10.*
- pyspark==3.4.1
- h3-pyspark>=1.2.0
- shapely>=2.0.*
- kedro~=0.18.0
- java-jdk>=8.0.0
- black
- flake8
- jupyter
- mypy
- pip:
- kedro-datasets[spark,pandas]==1.8.*
- pytest
- pytest-cov
- pytest-mock
- databricks-cli
- kedro-viz
- hypothesis