Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document workflow to incrementally create a minimal Kedro project from scratch #2512

Closed
yetudada opened this issue Apr 13, 2023 · 12 comments · Fixed by #4305
Closed

Document workflow to incrementally create a minimal Kedro project from scratch #2512

yetudada opened this issue Apr 13, 2023 · 12 comments · Fixed by #4305
Assignees
Labels
Component: Documentation 📄 Issue/PR for markdown and API documentation Issue: Feature Request New feature or improvement to existing feature

Comments

@yetudada
Copy link
Contributor

Description

This issue was flagged in #410 and the user question is, "How do I use Kedro with an existing project?". This user journey assumes that you already have "a project that has a data/ directory, a src/ directory and more" and you want to turn this into a Kedro project and the question becomes, "What are the most minimal files and folders that I will require to turn my work into a Kedro project?"

Context

This user problem was raised in #410 and we've come across it in a user interview; the user was trying to convert existing script into a Kedro project and tried to follow the Spaceflights tutorial to do this. They did not start from the default project template and had to troubleshoot why things were not working.

Possible Implementation

We propose a kedro init command to add the most minimal series of files and folders to an existing project so that you can use Kedro. I also think you would still need a workflow to guide users on how to write Kedro pipelines from a src directory. It would also be great to see this made into a video for publishing on YouTube.

@astrojuanlu
Copy link
Member

astrojuanlu commented Jul 5, 2023

I'd love to see this happening, one way or another. This was one of my first desires as a Kedro newbie, and would solve some pain points around installation too. In fact, this issue is essentially the same as gh-681, and it has taken multiple forms (gh-1722, gh-2360).

As @amandakys already pointed out, this overlaps with gh-2388.

For well-formed Python libraries, the only steps needed are

However, I wouldn't assume that most of our users can produce a well-formed Python library in the first place, so more work would be needed. Tools like flit point you in the right direction:

$ mkdir test-kedro-init
$ cd test-kedro-init/
$ touch main_code.py
$ touch Untitled.ipynb
$ touch README.md
$ flit init
Module name [main_code]: 
Author [Juan Luis]: 
Author email [[email protected]]: 
Home page: 
Choose a license (see http://choosealicense.com/ for more info)
1. MIT - simple and permissive
2. Apache - explicitly grants patent rights
3. GPL - ensures that code based on this is shared with the same terms
4. Skip - choose a license later
Enter 1-4 [4]: 4

Written pyproject.toml; edit that file to add optional extra info.
$ ls
README.md       Untitled.ipynb  main_code.py    pyproject.toml

but then leave the last mile to the user:

$ pip install .
...
    flit_core.common.NoDocstringError: Flit cannot package module without docstring, or empty docstring. Please add a docstring to your module (/private/tmp/test-kedro-init/main_code.py).

@noklam
Copy link
Contributor

noklam commented Aug 23, 2023

I create a minimal non-standard kedro project here: https://github.com/noklam/minimal_kedro/tree/main/my_weird_src/minimal_kedro

I try to do 1 change per commit to show how I remove/move things around. At any commit you should be still able to do kedro run.

The PR show diff between the project and the original pandas-iris starter.

Cc @amandakys

@amandakys
Copy link

Thanks Nok! This is great. Just putting in some background for what we discussed.

Many parts of Kedro's project template are more configurable than we publicise. It is possible to make Kedro work with all sorts of modifications to the project template, but this might require certain workarounds, or knowledge of where to change the appropriate config. As a team, we are also not clear on this.

By creating a project that challenges all the assumptions of our project template, i.e. fewer directories, files in different places etc, expose what elements of the project template that the framework fundamentally depends on, and which parts are more flexible. With this knowledge we can then move forward with developing a solution to help our user's adopt Kedro into their projects in the least intrusive way possible. The first step of which might just be better documenting this flexibility and what is/isn't absolutely necessary.

Some key takeways:

  • the conf folder and the base & local env folders can be removed by setting the conf source to "." moving the configuration files into the root direcory
  • almost all directories in the project template can be removed, and src can be renamed
  • the pipeline_registry.py and settings.py need to have a specific place/name More control over folder structure #2553
  • pyproject.toml has fields that are mandatory, but whose value is not used/meaningful i.e. project_name

Next steps:

  • further discussion on Nok's findings and whether the behaviour his test project shows is desirable/intentional
  • discussion on how to increase awareness of the possibility of a minimal kedro project that isnt standalone datacatalog, but "full featured"

@noklam
Copy link
Contributor

noklam commented Sep 14, 2023

Next steps:

  • further discussion on Nok's findings and whether the behaviour his test project shows is desirable/intentional
  • discussion on how to increase awareness of the possibility of a minimal kedro project that isnt standalone datacatalog, but "full featured"

@amandakys Do you want to resume on this discussion?

@merelcht merelcht added the Stage: Technical Design 🎨 Ticket needs to undergo technical design before implementation label Sep 21, 2023
@noklam
Copy link
Contributor

noklam commented Sep 28, 2023

http://wiseideas.au/kedro-light/usage was mentioned today. This get me think about what will be included in a KedroProject class, currently a "project" is convention base + pyproject.toml + settings.py.

This is out-of-scope but thinking what's the minimal thing to create a KedroProject just from code could be a nice inspiration. Internally we have something called TestProject.

@astrojuanlu
Copy link
Member

Where does the KedroProject class idea come from? I don't follow how it's related to the current discussion (but I like the idea)

@astrojuanlu
Copy link
Member

astrojuanlu commented Oct 13, 2023

@astrojuanlu
Copy link
Member

Discussed this again with @merelcht @yetudada @amandakys with input from @NeroOkwa and @stephkaiser:

  • Agreed that at least there should be a catalog
    • Potentially one could use Kedro without a catalog, but it feels very niche
  • "New users" = New to Kedro, but they could be beginners or intermediate & experts (further evidence on Research summary of insights for improving Kedro's value #2902)
  • Two journeys: kedro new (targeted towards beginners) or manual flow (targeted towards intermediate & expert users)
  • For the "manual flow": starting point can be
    • Nothing (not really, they already have something, otherwise they can use kedro new)
    • An existing notebook (docs we already wrote ✔️ + how to continue from there, hence "part 2")
    • An existing Python library or Cookiecutter template ("part 3", not written yet)
    • In fact the journey is (0) Notebook -> (1) Notebook + Kedro library components -> (2) Notebook + Kedro library components + Python package -> (3) Kedro Framework project
  • Minimal project files initial proposal: @noklam's https://github.com/noklam/minimal_kedro/ (commit history explains removals)
    • pipeline_registry.py can use kedro.framework.project.find_pipelines only if pipelines are in pipeline.py or follow the naming and structure convention (to be confirmed)
    • project_name is seemingly not used anywhere, mostly an artifact from kedro new? (to be confirmed)
    • We need to decide what do we do with __main__.py, because it's required for some packaging workflows
  • The default of kedro new + <Enter> <Enter> ... should be the minimal project template (make sure the last iteration of the new project creation flow achieves that cc @amandakys Iterate on feedback on Project Creation Flow  #3054)

@merelcht
Copy link
Member

Other than some of the open issues linked here, is there any more work to be done for this or can this issue be closed as being "as good as completed"?

@noklam
Copy link
Contributor

noklam commented Mar 14, 2024

I think we can close this issue, there is a clear answer to the question: https://github.com/noklam/minimal_kedro/

We can leave the discussion on the open issue, i.e. should we make this further possible? (pipeline_registry and settings), no obvious demand at the moment.

@astrojuanlu
Copy link
Member

Or we turn it into a docs issue in the spirit of #2512 (comment)

@astrojuanlu astrojuanlu added Component: Documentation 📄 Issue/PR for markdown and API documentation and removed Stage: Technical Design 🎨 Ticket needs to undergo technical design before implementation labels Mar 15, 2024
@astrojuanlu astrojuanlu changed the title What are the most minimal files/folders that I need to convert an existing project into a Kedro one? Document workflow to incrementally create a minimal Kedro project from scratch Mar 15, 2024
@astrojuanlu
Copy link
Member

Looks like some people are enjoying https://github.com/astrojuanlu/kedro-init, maybe we could just include it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Component: Documentation 📄 Issue/PR for markdown and API documentation Issue: Feature Request New feature or improvement to existing feature
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

5 participants