Blog post: The story of `kedro-telemetry` - from start to now #125

yetudada · 2024-01-08T11:13:46Z

Introduction

I thought it'd be cool to share a detailed story of our kedro-telemetry journey – where we started, the challenges we faced, and how far we've come. It’s been quite a ride, and I think it’s essential for all of us to understand the backstory, especially as we keep improving this tool.

What were the early days?

Remember how nervous we were about starting telemetry? We all saw those threads on Reddit and Hackernews about other open-source projects getting heat for how they handled user data. Plus, being privacy nerds ourselves, we wanted to ensure we were doing right by our users. And, of course, there was that added pressure of Kedro being an enterprise-owned open-source project back then – we didn’t want any missteps to affect our reputation.

Therefore, we had serious brainstorming sessions with our InfoSec and Legal teams to ensure we were GDPR compliant. This was a challenge because we tried to interpret the law and how it applied to us. The legal team that we work with now is in LF AI & Data.

GDPR will always apply to us because we have users in the EU ✅

What design and architectural decisions did we make?

User consent and transparency: Emphasising an opt-in/opt-out mechanism for user participation in telemetry via the CLI and recording the user's decision in a .telemetry file that was not committed to git. This meant that users were asked to opt-in to kedro-telemetry; if they said yes, then the decision only applied to their project where .telemetry was present, and the decision applied to the Kedro CLI, Kedro-Viz CLI and Kedro-Viz UI.
Scope of data collection: Deliberately limit data to project, user, and feature statistics, avoiding personal user data. This included anonymising project and user data with hashing.
Directional insights over exact figures: Given the opt-in nature, we aimed for broad trends rather than precise user data. This insight was learned from the Great Expectations product team because they also struggle to derive exact insights.
Internal user identification: We developed a methodology for identifying internal users while respecting their autonomy in opting for telemetry. It used a hashmap of username to identify internal users only because we could hash internal username. This methodology is inactive now - talk to @datajoely.
Separation from Kedro Framework: To ensure users could remove telemetry without impacting their core experience with Kedro.
Documentation: Allowing users to access detailed documentation on data collection by reading our data collection methodology.

Opt-in/opt-out workflow of `kedro-telemetry`

What data do we collect?

kedro-telemetry has evolved to collect more data as we have had more questions about our users. It's easiest to see aspects of this as a table and describe additional collection points. When users opt-in to using kedro-telemetry, kedro-telemetry will collect project and user metadata, record usage of the Kedro Framework and Kedro-Viz CLI and track all feature usage of the Kedro-Viz UI. Identifying project (project name and package name) and user (computer name and username) metadata is hashed for anonymity requirements.

Description	Example Input	What we receive
CLI command (masked arguments)	`kedro run --pipeline=ds --env=test`	`kedro run --pipeline *** --env ***`
(Hashed) Package name	my-project	1c7cd944c28cd888904f3efc2345198507...
(Hashed) Project name	my_project	a6392d359362dc9827cf8688c9d634520e...
(Hashed) Username	my_username	ec3759e2c570d302e65ea20a7d985...
`kedro` project version	0.17.6	0.17.6
`kedro-telemetry` version	0.1.2	0.1.2
Python version	3.8.10 (default, Jun 2 2021, 10:49:15)	3.8.10 (default, Jun 2 2021, 10:49:15)
Operating system used	darwin	darwin
Number of datasets	7	7
Number of pipelines	2	2
Number of nodes	12	12

What was the original data collection strategy for `kedro-telemetry`?

Here's what the first version of kedro-telemetry proposed doing:

What analytics tools does `kedro-telemetry` integrate with?

To facilitate in-depth data analysis, kedro-telemetry employs Heap Analytics and Snowflake databases as data stores. This integration allows us to process complex datasets and glean valuable insights into how users interact with Kedro, influencing our development strategies.

The text was updated successfully, but these errors were encountered:

yetudada · 2024-01-08T15:58:10Z

@idanov Had a great point about reading more into GDPR to understand the design of kedro-telemetry.

There's one important thing though, we follow an opt-in based consent due to GDPR. Here's the differences: https://termly.io/resources/articles/opt-in-vs-opt-out/

And here there's a nice table to compare both: https://seersco.com/articles/opt-in-vs-opt-out-consent/

Opt-in flow in the context of data collection means that the user has to explicitly give their consent before we start collecting any data.

Opt-out flow means that the user has has the right to withdraw their consent at any time, but we might still start collecting data by default even without their initial consent.

GDPR requires that users must be given the option to enable cookies out of their free will. Since there are various types of cookies serving different purposes, such as advertising cookies and analytics cookies, the user must have separate opt-in checkboxes for different cookie categories based on their purposes. In short, the GDPR requires consent to be opt-in.

GDPR defines consent as “freely given, specific, informed and unambiguous” given by a “clear affirmative action.” It is not acceptable to assign consent through the data subject’s silence or by supplying “pre-ticked boxes.”

stichbury · 2024-01-09T10:36:08Z

I'm really pleased to see this, thanks @yetudada for the writeup! I spotted a gap for some telemetry content last year (#59) and figured we could rank strongly for it if we write something, so this is ideal. I'm reading/reviewing today 👀

stichbury · 2024-01-12T14:59:29Z

Moved this into the kedro-devrel repo so I can use it to form the basis of a blog post.

yetudada mentioned this issue Jan 8, 2024

Improving our understanding of our users with kedro-telemetry kedro-org/kedro-plugins#510

Open

stichbury transferred this issue from kedro-org/kedro-plugins Jan 12, 2024

stichbury added the Blog post creation Blog posts (ideas and execution) label Jan 12, 2024

stichbury changed the title ~~The story of kedro-telemetry - from start to now~~ Blog post: The story of kedro-telemetry - from start to now Jan 12, 2024

astrojuanlu added this to Kedro DevRel Apr 15, 2024

astrojuanlu removed this from Kedro Framework Apr 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Blog post: The story of `kedro-telemetry` - from start to now #125

Blog post: The story of `kedro-telemetry` - from start to now #125

yetudada commented Jan 8, 2024

yetudada commented Jan 8, 2024

stichbury commented Jan 9, 2024

stichbury commented Jan 12, 2024

Blog post: The story of kedro-telemetry - from start to now #125

Blog post: The story of kedro-telemetry - from start to now #125

Comments

yetudada commented Jan 8, 2024

Introduction

What were the early days?

What design and architectural decisions did we make?

Opt-in/opt-out workflow of kedro-telemetry

What data do we collect?

What was the original data collection strategy for kedro-telemetry?

What analytics tools does kedro-telemetry integrate with?

yetudada commented Jan 8, 2024

stichbury commented Jan 9, 2024

stichbury commented Jan 12, 2024

Blog post: The story of `kedro-telemetry` - from start to now #125

Blog post: The story of `kedro-telemetry` - from start to now #125

Opt-in/opt-out workflow of `kedro-telemetry`

What was the original data collection strategy for `kedro-telemetry`?

What analytics tools does `kedro-telemetry` integrate with?