Tutorial on how to write migration jobs #426

Ash-2k3 · 2024-12-22T18:44:33Z

Fix #364

Link to Google doc - https://docs.google.com/document/d/1kOTBlrrCKu436A7pL4KnzWYHGoBYTxYhVI-VG3pRMJo/edit?tab=t.0

Ash-2k3 · 2024-12-26T17:21:45Z

@seanlip PTAL at this PR, thanks!

seanlip

Thanks @Ash-2k3! Sending a partial review, will continue later.

seanlip · 2024-12-27T02:17:06Z

Tutorial-Learn-how-to-make-a-backend-change-that-has-an-impact-on-server-data.md

+
+# Scenario
+
+In this tutorial, we will address an issue where the `user_bio` field in `UserSettingsModel` allows users to enter bios of unrestricted length. While this provided users with flexibility in expressing themselves, it has become necessary to enforce a length limit of 200 characters. This change ensures consistency and allows UI designers to reliably allocate space for displaying bios, improving the overall user experience.


Change the "While this ... overall user experience." part to "For the purposes of this tutorial, imagine that the technical team has decided to enforce a length limit of 200 characters for this field, in order to ensure consistency and allow UI designers to reliably allocate space for displaying bios."

Done, Thanks1

seanlip · 2024-12-27T02:21:28Z

Tutorial-Learn-how-to-make-a-backend-change-that-has-an-impact-on-server-data.md

+
+Our goal is to identify which storage model stores the fields shown on this page. There are multiple ways to approach this:
+
+1. **Code Exploration**: Review the relevant files (e.g., `gae_models.py`) to manually inspect the fields. Keep in mind that this approach may not be ideal for new contributors who are just beginning to familiarize themselves with the codebase. It is typically more suitable for those who have already spent significant time working with and understanding the structure of the codebase.  


You are switching suddenly from first/second-person to third-person here. It might be better to address the contributor directly, e.g. "Note that, if you're new to the codebase, ...".

Also I'm not really sure you even need the first point here. I think you can say something like "...which storage model stores the fields shown on this page. If you know which one it is already, you can inspect the fields directly. Otherwise, here is how to find out ..." (feel free to reword/simplify as needed).

Sounds Good, I have replaced it with what you suggested. Thanks!

seanlip · 2024-12-27T02:22:21Z

Tutorial-Learn-how-to-make-a-backend-change-that-has-an-impact-on-server-data.md

+
+# Procedure
+
+## Section 1: Navigating and Understanding the Preferences Page


Give this a clearer title that makes it clear when the section is "done". How would the developer know that the
"understanding" part has been completed?

Wdyt about "Understanding Preferences Page and Identifying Key Changes" ? I think it provides a clearer sense of when this step is "done" — once the key changes are identified.

Then have the section title be "Identify the Key Changes to Make". "Understand" is woolly.

That said, I wonder if "Identify the parts of the code that need to be changed" is better, or perhaps "Stop the problem from getting worse"? Conceptually the latter is what you are actually doing, and it complements the Beam job. Or maybe "Stop the error from occurring for new data" ... something like that.

Done, I have renamed it - Prevent New Data Violations - Identify Code Changes for Bio Length Limit. Thanks

seanlip · 2024-12-27T02:22:52Z

Tutorial-Learn-how-to-make-a-backend-change-that-has-an-impact-on-server-data.md

+> [!IMPORTANT]
+> Practice 1: Familiarize yourself with the codebase architecture at Oppia. Understanding the structure of the codebase will help you navigate through various layers of code at Oppia more efficiently. Follow this guide: [Overview of the Oppia Codebase](https://github.com/oppia/oppia/wiki/Overview-of-the-Oppia-codebase).
+
+![Screenshot of Preference Page](images/TutorialMigrationJob/preferencePage.png)


Both in the alt text and image filename, I think it's the "Preferences" page.

Thanks for catching this, done!

Tutorial-Learn-how-to-make-a-backend-change-that-has-an-impact-on-server-data.md

seanlip · 2024-12-27T02:29:26Z

Tutorial-Learn-how-to-make-a-backend-change-that-has-an-impact-on-server-data.md

+
+For this tutorial, we will choose the **truncation** approach, ensuring all user bios conform to the 200-character limit.
+
+***Note**: While this approach is simple and sufficient for the purposes of this tutorial, it may lead to a sub-optimal user experience as it truncates user input without providing feedback or allowing edits. This is not necessarily a best practice for real-world applications but serves as an illustrative example for learning.*


I think you'd want to explain that, in practice, for decisions like this that aren't clear, the developer should compile a list of the different options and their pros/cons and then discuss with the product team and technical leads what the best thing to do is (for changes that affect user experience). Then you can explain separately that, for the purposes of this tutorial, the team leads chose option 1.

Also, why is there a trailing * at the end of this paragraph? There are also three stars before "Note" and two after it.

Would it also be better to use things like https://github.com/orgs/community/discussions/16925 for notes?

I have made the changes. Regarding the formatting of notes, I think it's fine as it is rn for the sake of consistency across other tutorials.

But I do think what you are suggesting will look better. Should we open an issue to track this ? We can first finalise what are the formats we want to follow for practise questions, notes, callouts etc and then apply them to all tutorials. wdyt ?

Please address all parts of reviewers' comments when responding. You have missed the second paragraph entirely, AFAICT.

I think you can just apply the "Note" callout if it looks better. I'm not sure we need a long discussion for that. You can file an issue to update it for the other tutorials.

Text between two stars is rendered as bold text and between stars it's rendered as Italic font. If you see the preview of this portion, you'll notice it.

I didn't skip it, but I thought it's covered in my second para (where I asked about formatting of notes). I should have been clear but.

Regarding the notes, I would like to get the major comments approved first (The content changes), then I will update the format of the notes (Will do a self review and also make any changes as needed to the format)

seanlip · 2024-12-27T02:31:29Z

Tutorial-Learn-how-to-make-a-backend-change-that-has-an-impact-on-server-data.md

+> Practice 8: Take a notebook and try drafting a rough workflow of what our job would do, using boxes for the steps and arrows to connect different steps. 
+> 
+> Hint: 
+> - Read Everything First: Start by reading all the necessary data at the beginning of the job. This ensures that you have all the required information before performing any operations.


Make "Read Everything First." bold and put a period at the end of it, instead of a colon. Similarly for the other bullet points.

Please use sentence case: "Read everything first." "Process data in steps." "Write everything last."

This helps make clearer that it's a sentence/instruction and not a slogan.

Done, but I don't really understand why ending with full stop is better than with semi column.

It's actually a colon, not a semicolon.

TBH, I think that specific distinction is more a matter of style -- the "slogan" comment was more for the sentence case vs capital letters. For me, having periods means that one can scan just the bold parts ("Read everything first. Process data in steps. ...") and it's still coherent, whereas a colon means that both parts should be read together. But this specific case is more of a preference and not really a hard-and-fast rule.

seanlip · 2024-12-27T02:32:07Z

Tutorial-Learn-how-to-make-a-backend-change-that-has-an-impact-on-server-data.md

+4. **Reusable Patterns**: Follow established patterns and conventions for audit jobs in the Oppia codebase.
+
+> [!IMPORTANT]
+> Practice 9: Based on the above explanation and thought process. Can you write down the Audit Job for our use case.


The first sentence of this is a sentence fragment; please update.

Ah thanks for catching this, rephrased it.

Thanks. Please also change "Audit Job" to "audit job" if you don't capitalize it elsewhere in normal usage.

seanlip · 2024-12-27T02:32:30Z

Tutorial-Learn-how-to-make-a-backend-change-that-has-an-impact-on-server-data.md

+2. Simulate truncation logic for these records without saving the changes.  
+3. Provide a detailed report of all affected records, ensuring we are confident in the data to be modified before running the actual migration job.
+
+#### **Thought Process for the Audit Job**


This isn't a thought process, it's more like a list of factors to consider. A process is more like an ordered series of steps, e.g. applying each of these in turn to the job you're trying to write.

Renamed the heading. Do you think it's better now ?

Yes, I think so.

One other thought when reading: I think the best practice for audit jobs is to use the DATASTORE_UPDATES_ALLOWED property and have the audit/main beam jobs be forced to do exactly the same thing by subclassing -- you can see examples of this in the codebase. If the author follows that then that should make it easier to ensure that the jobs don't deviate from each other. Can you make the tutorial align with this rule? That might also help shorten this section since you just need to explain how to write the job while making appropriate use of that property. (I know there are jobs in the codebase which don't follow this, but we actually need to make them do so -- this is on my list for dev workflow but let's not exacerbate the current problem.)

Done, I have made some changes to section 2 and section 3.

seanlip · 2024-12-27T02:33:30Z

Tutorial-Learn-how-to-make-a-backend-change-that-has-an-impact-on-server-data.md

+1. **Simulating Logic**: The audit job must simulate the exact same steps as the main Beam job to ensure consistency in logic and results.  
+2. **Read-Only Operations**: Unlike the main job, an audit job should not persist any changes to the datastore. This avoids unintended side effects during testing.  
+3. **Detailed Reporting**: The job should generate a detailed report or log indicating the records that require updates. This transparency helps validate the scope and correctness of the job.  
+4. **Reusable Patterns**: Follow established patterns and conventions for audit jobs in the Oppia codebase.


I think it would help if you gave a bit more detail on what these are.

The update is still pretty vague and doesn't actually give more detail on what these are. See previous comment about DATASTORE_UPDATES_ALLOWED which could probably help you nail this down a lot more.

Thanks, updated this section, PTAL!

Tutorial-Learn-how-to-make-a-backend-change-that-has-an-impact-on-server-data.md

seanlip · 2024-12-27T03:19:26Z

Tutorial-Learn-how-to-make-a-backend-change-that-has-an-impact-on-server-data.md

+The `AuditTruncateUserBioJob` is implemented alongside the main Beam job in the `user_bio_truncation_jobs.py` file. Here’s how it can be implemented:
+
+```python
+class AuditTruncateUserBioJob(base_jobs.JobBase):


One thing we generally want is for the audit and main jobs to use the same functions (rather than duplicating code). Could you give guidance on structuring the jobs to ensure that?

Made some changes to the code for that.

seanlip · 2024-12-27T03:22:22Z

Tutorial-Learn-how-to-make-a-backend-change-that-has-an-impact-on-server-data.md

+     * **Bios exceeding the limit**: Are truncated to 200 characters.  
+     * **Empty bios**: Remain unaffected.
+
+### **Conclusion**


Can you explain, somewhere in this tutorial, the "sequence of operations"? I.e. they need to do the domain-layer fix to stop new bios > 200 chars getting added. When do they make this fix, when do they run the job on the server, when are things deployed, etc. -- what is the sequence of operations?

I think this is something that probably needs to be explained at the outset of the tutorial, because fixing an issue that affects existing data is an entire process and that process needs to happen in the right order.

Sounds good, but I think mentioning it at the beginning would be better. My reasoning behind this is that folks should be aware of the entire process before even making code changes (If that makes sense).

Ash-2k3 · 2025-01-10T23:54:59Z

Just an update, I will address comments on this PR by Sunday (12 Jan)

Edit - will address by EOW.

Ash-2k3 · 2025-01-19T15:05:33Z

Just an update, I have addressed most of the comments. Two are yet remaining, I ll have to think through for addressing them.

seanlip · 2025-01-20T14:29:11Z

Thanks -- I've left replies to the ones you responded to. PTAL.

…-on-server-data.md

Ash-2k3 · 2025-02-02T16:02:32Z

@seanlip Thanks for the review, I have addressed your comments PTAL.

(Formatting of Notes is not yet down, I am putting it for last once the content gets finalised, I will do a self review and fix any formatting issues.)

seanlip

Hi @Ash-2k3 -- took a pass, on a quick skim it looks good in general!

Could you please update the notes and fix the remaining small issues, then reassign for final review? Thanks.

seanlip · 2025-02-04T17:23:04Z

Tutorial-Learn-how-to-make-a-backend-change-that-has-an-impact-on-server-data.md

  - [Section 4: Testing the Beam Job](#section-4-testing-the-beam-job)
  - [Section 5: Run and Validate the Job](#section-5-run-and-validate-the-job)
    - [**Conclusion**](#conclusion)
      - [We Value Your Feedback](#we-value-your-feedback)

 # Introduction

-In this tutorial, you will learn how to safely implement backend changes that impact server data, an essential skill for any developer working with applications that store data. We’ll cover key concepts like modifying data models, writing and testing Beam Jobs (Audit and Migration), and documenting a reliable launch process. These skills are fundamental for maintaining data integrity, ensuring data consistency, and avoiding disruptions during backend updates.
+In this tutorial, you will learn how to safely implement backend changes that impact server data, an essential skill for any developer working with data-storing applications. We’ll cover key concepts like modifying data models, writing, and testing Beam jobs (Audit and Migration). These skills are fundamental for maintaining data integrity, ensuring data consistency, and avoiding disruptions during backend updates.


The sentence starting "We'll" seems incorrect ... it says "we'll cover key concepts like (1) modifying data models, (2) writing, and (3) testing Beam jobs".

"(2) writing" is a general skill and is probably not what you mean to say, but it's what's implied by the comma.

Ah, the comma shouldn't have been here. Removed it, thanks for catching this!!

Probably more correct to say "We’ll cover key concepts like modifying data models, as well as writing and testing Beam jobs...", since otherwise the first comma needs to be an "and" since you really only have two things here.

seanlip · 2025-02-04T17:24:51Z

Tutorial-Learn-how-to-make-a-backend-change-that-has-an-impact-on-server-data.md

+1. **Modify the Backend Layer to Prevent Future Violations**
+	- Update the backend validation to ensure that new or updated bios cannot exceed 200 characters.
+	- This ensures that, once we fix existing data, no new invalid entries will be introduced.
+2. **Implement the Data Migration and Audit Jobs**


For readability, consider having single lines separating each of these numbered points (i.e. add a new line above this line, and similarly below).

SG, Done, thanks!

seanlip · 2025-02-04T17:29:37Z

Tutorial-Learn-how-to-make-a-backend-change-that-has-an-impact-on-server-data.md

+> Practice 8: Take a notebook and try drafting a rough workflow of what our job would do, using boxes for the steps and arrows to connect different steps. 
+> 
+> Hint: 
+> - Read Everything First: Start by reading all the necessary data at the beginning of the job. This ensures that you have all the required information before performing any operations.


It's actually a colon, not a semicolon.

TBH, I think that specific distinction is more a matter of style -- the "slogan" comment was more for the sentence case vs capital letters. For me, having periods means that one can scan just the bold parts ("Read everything first. Process data in steps. ...") and it's still coherent, whereas a colon means that both parts should be read together. But this specific case is more of a preference and not really a hard-and-fast rule.

Ash-2k3 · 2025-02-23T08:12:46Z

Thanks for the review @seanlip, I have addressed your comments and also changed the formatting of notes, PTAL, Thanks!

(And also sorry for the delay here.)

seanlip

One small note, otherwise no concerns. Thanks!

Ash-2k3 · 2025-02-23T19:21:18Z

Addressed the last comment, PTAL @seanlip, thanks!

seanlip

Thanks! LGTM.

Ash-2k3 added 4 commits December 23, 2024 00:13

Draft tutorial on backend migration

6347b1f

Add TOC

550aee4

Add images

963c8f7

Add link to the tutorial in the sidebar

9d2399a

Ash-2k3 changed the title ~~Draft tutorial on backend migration~~ Tutorial on how to write migration jobs Dec 26, 2024

Ash-2k3 marked this pull request as ready for review December 26, 2024 17:21

Ash-2k3 assigned seanlip Dec 26, 2024

seanlip requested changes Dec 27, 2024

View reviewed changes

seanlip reviewed Dec 27, 2024

View reviewed changes

seanlip assigned Ash-2k3 and unassigned seanlip Dec 27, 2024

Ash-2k3 added 2 commits January 19, 2025 20:33

Address review comments

09ec567

Address review comments

7aa94a1

Ash-2k3 added 3 commits February 2, 2025 14:02

Update Tutorial-Learn-how-to-make-a-backend-change-that-has-an-impact…

e854d5e

…-on-server-data.md

Update Tutorial-Learn-how-to-make-a-backend-change-that-has-an-impact…

13600ee

…-on-server-data.md

Update Tutorial-Learn-how-to-make-a-backend-change-that-has-an-impact…

42c163e

…-on-server-data.md

Ash-2k3 assigned seanlip and unassigned Ash-2k3 Feb 2, 2025

seanlip requested changes Feb 4, 2025

View reviewed changes

seanlip assigned Ash-2k3 and unassigned seanlip Feb 4, 2025

Ash-2k3 added 2 commits February 23, 2025 13:30

Change formatting of Notes

e519f45

Address review comments.

e629b3d

Ash-2k3 assigned seanlip and unassigned Ash-2k3 Feb 23, 2025

seanlip reviewed Feb 23, 2025

View reviewed changes

seanlip assigned Ash-2k3 and unassigned seanlip Feb 23, 2025

Address review comments.

458b789

Ash-2k3 assigned seanlip and unassigned Ash-2k3 Feb 23, 2025

seanlip approved these changes Feb 24, 2025

View reviewed changes

seanlip merged commit 63a160e into oppia:develop Feb 24, 2025
3 checks passed


		# Scenario

		In this tutorial, we will address an issue where the `user_bio` field in `UserSettingsModel` allows users to enter bios of unrestricted length. While this provided users with flexibility in expressing themselves, it has become necessary to enforce a length limit of 200 characters. This change ensures consistency and allows UI designers to reliably allocate space for displaying bios, improving the overall user experience.


		Our goal is to identify which storage model stores the fields shown on this page. There are multiple ways to approach this:

		1. Code Exploration: Review the relevant files (e.g., `gae_models.py`) to manually inspect the fields. Keep in mind that this approach may not be ideal for new contributors who are just beginning to familiarize themselves with the codebase. It is typically more suitable for those who have already spent significant time working with and understanding the structure of the codebase.


		# Procedure

		## Section 1: Navigating and Understanding the Preferences Page


		For this tutorial, we will choose the truncation approach, ensuring all user bios conform to the 200-character limit.

		*Note: While this approach is simple and sufficient for the purposes of this tutorial, it may lead to a sub-optimal user experience as it truncates user input without providing feedback or allowing edits. This is not necessarily a best practice for real-world applications but serves as an illustrative example for learning.*

Tutorial on how to write migration jobs #426

Tutorial on how to write migration jobs #426

Conversation

Ash-2k3 commented Dec 22, 2024 • edited Loading

Ash-2k3 commented Dec 26, 2024

seanlip left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Ash-2k3 Jan 19, 2025 • edited Loading

Choose a reason for hiding this comment

seanlip Jan 20, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

seanlip Jan 20, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Ash-2k3 commented Jan 10, 2025 • edited Loading

Ash-2k3 commented Jan 19, 2025

seanlip commented Jan 20, 2025

Ash-2k3 commented Feb 2, 2025

seanlip left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Ash-2k3 commented Feb 23, 2025

seanlip left a comment

Choose a reason for hiding this comment

Ash-2k3 commented Feb 23, 2025

seanlip left a comment

Choose a reason for hiding this comment

Ash-2k3 commented Dec 22, 2024 •

edited

Loading

Ash-2k3 Jan 19, 2025 •

edited

Loading

seanlip Jan 20, 2025 •

edited

Loading

seanlip Jan 20, 2025 •

edited

Loading

Ash-2k3 commented Jan 10, 2025 •

edited

Loading