-
Notifications
You must be signed in to change notification settings - Fork 100
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tutorial on how to write migration jobs #426
Conversation
@seanlip PTAL at this PR, thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @Ash-2k3! Sending a partial review, will continue later.
|
||
# Scenario | ||
|
||
In this tutorial, we will address an issue where the `user_bio` field in `UserSettingsModel` allows users to enter bios of unrestricted length. While this provided users with flexibility in expressing themselves, it has become necessary to enforce a length limit of 200 characters. This change ensures consistency and allows UI designers to reliably allocate space for displaying bios, improving the overall user experience. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Change the "While this ... overall user experience." part to "For the purposes of this tutorial, imagine that the technical team has decided to enforce a length limit of 200 characters for this field, in order to ensure consistency and allow UI designers to reliably allocate space for displaying bios."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, Thanks1
|
||
Our goal is to identify which storage model stores the fields shown on this page. There are multiple ways to approach this: | ||
|
||
1. **Code Exploration**: Review the relevant files (e.g., `gae_models.py`) to manually inspect the fields. Keep in mind that this approach may not be ideal for new contributors who are just beginning to familiarize themselves with the codebase. It is typically more suitable for those who have already spent significant time working with and understanding the structure of the codebase. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are switching suddenly from first/second-person to third-person here. It might be better to address the contributor directly, e.g. "Note that, if you're new to the codebase, ...".
Also I'm not really sure you even need the first point here. I think you can say something like "...which storage model stores the fields shown on this page. If you know which one it is already, you can inspect the fields directly. Otherwise, here is how to find out ..." (feel free to reword/simplify as needed).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds Good, I have replaced it with what you suggested. Thanks!
|
||
# Procedure | ||
|
||
## Section 1: Navigating and Understanding the Preferences Page |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Give this a clearer title that makes it clear when the section is "done". How would the developer know that the
"understanding" part has been completed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wdyt about "Understanding Preferences Page and Identifying Key Changes" ? I think it provides a clearer sense of when this step is "done" — once the key changes are identified.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then have the section title be "Identify the Key Changes to Make". "Understand" is woolly.
That said, I wonder if "Identify the parts of the code that need to be changed" is better, or perhaps "Stop the problem from getting worse"? Conceptually the latter is what you are actually doing, and it complements the Beam job. Or maybe "Stop the error from occurring for new data" ... something like that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, I have renamed it - Prevent New Data Violations - Identify Code Changes for Bio Length Limit. Thanks
> [!IMPORTANT] | ||
> Practice 1: Familiarize yourself with the codebase architecture at Oppia. Understanding the structure of the codebase will help you navigate through various layers of code at Oppia more efficiently. Follow this guide: [Overview of the Oppia Codebase](https://github.com/oppia/oppia/wiki/Overview-of-the-Oppia-codebase). | ||
|
||
 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both in the alt text and image filename, I think it's the "Preferences" page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for catching this, done!
Tutorial-Learn-how-to-make-a-backend-change-that-has-an-impact-on-server-data.md
Show resolved
Hide resolved
|
||
For this tutorial, we will choose the **truncation** approach, ensuring all user bios conform to the 200-character limit. | ||
|
||
***Note**: While this approach is simple and sufficient for the purposes of this tutorial, it may lead to a sub-optimal user experience as it truncates user input without providing feedback or allowing edits. This is not necessarily a best practice for real-world applications but serves as an illustrative example for learning.* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you'd want to explain that, in practice, for decisions like this that aren't clear, the developer should compile a list of the different options and their pros/cons and then discuss with the product team and technical leads what the best thing to do is (for changes that affect user experience). Then you can explain separately that, for the purposes of this tutorial, the team leads chose option 1.
Also, why is there a trailing * at the end of this paragraph? There are also three stars before "Note" and two after it.
Would it also be better to use things like https://github.com/orgs/community/discussions/16925 for notes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have made the changes. Regarding the formatting of notes, I think it's fine as it is rn for the sake of consistency across other tutorials.
But I do think what you are suggesting will look better. Should we open an issue to track this ? We can first finalise what are the formats we want to follow for practise questions, notes, callouts etc and then apply them to all tutorials. wdyt ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please address all parts of reviewers' comments when responding. You have missed the second paragraph entirely, AFAICT.
I think you can just apply the "Note" callout if it looks better. I'm not sure we need a long discussion for that. You can file an issue to update it for the other tutorials.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Text between two stars is rendered as bold text and between stars it's rendered as Italic font. If you see the preview of this portion, you'll notice it.
I didn't skip it, but I thought it's covered in my second para (where I asked about formatting of notes). I should have been clear but.
Regarding the notes, I would like to get the major comments approved first (The content changes), then I will update the format of the notes (Will do a self review and also make any changes as needed to the format)
> Practice 8: Take a notebook and try drafting a rough workflow of what our job would do, using boxes for the steps and arrows to connect different steps. | ||
> | ||
> Hint: | ||
> - Read Everything First: Start by reading all the necessary data at the beginning of the job. This ensures that you have all the required information before performing any operations. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make "Read Everything First." bold and put a period at the end of it, instead of a colon. Similarly for the other bullet points.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use sentence case: "Read everything first." "Process data in steps." "Write everything last."
This helps make clearer that it's a sentence/instruction and not a slogan.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, but I don't really understand why ending with full stop is better than with semi column.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's actually a colon, not a semicolon.
TBH, I think that specific distinction is more a matter of style -- the "slogan" comment was more for the sentence case vs capital letters. For me, having periods means that one can scan just the bold parts ("Read everything first. Process data in steps. ...") and it's still coherent, whereas a colon means that both parts should be read together. But this specific case is more of a preference and not really a hard-and-fast rule.
4. **Reusable Patterns**: Follow established patterns and conventions for audit jobs in the Oppia codebase. | ||
|
||
> [!IMPORTANT] | ||
> Practice 9: Based on the above explanation and thought process. Can you write down the Audit Job for our use case. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The first sentence of this is a sentence fragment; please update.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah thanks for catching this, rephrased it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. Please also change "Audit Job" to "audit job" if you don't capitalize it elsewhere in normal usage.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
2. Simulate truncation logic for these records without saving the changes. | ||
3. Provide a detailed report of all affected records, ensuring we are confident in the data to be modified before running the actual migration job. | ||
|
||
#### **Thought Process for the Audit Job** |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't a thought process, it's more like a list of factors to consider. A process is more like an ordered series of steps, e.g. applying each of these in turn to the job you're trying to write.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Renamed the heading. Do you think it's better now ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think so.
One other thought when reading: I think the best practice for audit jobs is to use the DATASTORE_UPDATES_ALLOWED
property and have the audit/main beam jobs be forced to do exactly the same thing by subclassing -- you can see examples of this in the codebase. If the author follows that then that should make it easier to ensure that the jobs don't deviate from each other. Can you make the tutorial align with this rule? That might also help shorten this section since you just need to explain how to write the job while making appropriate use of that property. (I know there are jobs in the codebase which don't follow this, but we actually need to make them do so -- this is on my list for dev workflow but let's not exacerbate the current problem.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, I have made some changes to section 2 and section 3.
1. **Simulating Logic**: The audit job must simulate the exact same steps as the main Beam job to ensure consistency in logic and results. | ||
2. **Read-Only Operations**: Unlike the main job, an audit job should not persist any changes to the datastore. This avoids unintended side effects during testing. | ||
3. **Detailed Reporting**: The job should generate a detailed report or log indicating the records that require updates. This transparency helps validate the scope and correctness of the job. | ||
4. **Reusable Patterns**: Follow established patterns and conventions for audit jobs in the Oppia codebase. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would help if you gave a bit more detail on what these are.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The update is still pretty vague and doesn't actually give more detail on what these are. See previous comment about DATASTORE_UPDATES_ALLOWED which could probably help you nail this down a lot more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, updated this section, PTAL!
Tutorial-Learn-how-to-make-a-backend-change-that-has-an-impact-on-server-data.md
Outdated
Show resolved
Hide resolved
The `AuditTruncateUserBioJob` is implemented alongside the main Beam job in the `user_bio_truncation_jobs.py` file. Here’s how it can be implemented: | ||
|
||
```python | ||
class AuditTruncateUserBioJob(base_jobs.JobBase): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One thing we generally want is for the audit and main jobs to use the same functions (rather than duplicating code). Could you give guidance on structuring the jobs to ensure that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Made some changes to the code for that.
* **Bios exceeding the limit**: Are truncated to 200 characters. | ||
* **Empty bios**: Remain unaffected. | ||
|
||
### **Conclusion** |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you explain, somewhere in this tutorial, the "sequence of operations"? I.e. they need to do the domain-layer fix to stop new bios > 200 chars getting added. When do they make this fix, when do they run the job on the server, when are things deployed, etc. -- what is the sequence of operations?
I think this is something that probably needs to be explained at the outset of the tutorial, because fixing an issue that affects existing data is an entire process and that process needs to happen in the right order.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good, but I think mentioning it at the beginning would be better. My reasoning behind this is that folks should be aware of the entire process before even making code changes (If that makes sense).
Just an update, I will address comments on this PR by Sunday (12 Jan) Edit - will address by EOW. |
Just an update, I have addressed most of the comments. Two are yet remaining, I ll have to think through for addressing them. |
Thanks -- I've left replies to the ones you responded to. PTAL. |
…-on-server-data.md
…-on-server-data.md
…-on-server-data.md
@seanlip Thanks for the review, I have addressed your comments PTAL. (Formatting of Notes is not yet down, I am putting it for last once the content gets finalised, I will do a self review and fix any formatting issues.) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @Ash-2k3 -- took a pass, on a quick skim it looks good in general!
Could you please update the notes and fix the remaining small issues, then reassign for final review? Thanks.
- [Section 4: Testing the Beam Job](#section-4-testing-the-beam-job) | ||
- [Section 5: Run and Validate the Job](#section-5-run-and-validate-the-job) | ||
- [**Conclusion**](#conclusion) | ||
- [We Value Your Feedback](#we-value-your-feedback) | ||
|
||
# Introduction | ||
|
||
In this tutorial, you will learn how to safely implement backend changes that impact server data, an essential skill for any developer working with applications that store data. We’ll cover key concepts like modifying data models, writing and testing Beam Jobs (Audit and Migration), and documenting a reliable launch process. These skills are fundamental for maintaining data integrity, ensuring data consistency, and avoiding disruptions during backend updates. | ||
In this tutorial, you will learn how to safely implement backend changes that impact server data, an essential skill for any developer working with data-storing applications. We’ll cover key concepts like modifying data models, writing, and testing Beam jobs (Audit and Migration). These skills are fundamental for maintaining data integrity, ensuring data consistency, and avoiding disruptions during backend updates. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The sentence starting "We'll" seems incorrect ... it says "we'll cover key concepts like (1) modifying data models, (2) writing, and (3) testing Beam jobs".
"(2) writing" is a general skill and is probably not what you mean to say, but it's what's implied by the comma.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, the comma shouldn't have been here. Removed it, thanks for catching this!!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably more correct to say "We’ll cover key concepts like modifying data models, as well as writing and testing Beam jobs...", since otherwise the first comma needs to be an "and" since you really only have two things here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SG, Done!
1. **Modify the Backend Layer to Prevent Future Violations** | ||
- Update the backend validation to ensure that new or updated bios cannot exceed 200 characters. | ||
- This ensures that, once we fix existing data, no new invalid entries will be introduced. | ||
2. **Implement the Data Migration and Audit Jobs** |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For readability, consider having single lines separating each of these numbered points (i.e. add a new line above this line, and similarly below).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SG, Done, thanks!
> Practice 8: Take a notebook and try drafting a rough workflow of what our job would do, using boxes for the steps and arrows to connect different steps. | ||
> | ||
> Hint: | ||
> - Read Everything First: Start by reading all the necessary data at the beginning of the job. This ensures that you have all the required information before performing any operations. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's actually a colon, not a semicolon.
TBH, I think that specific distinction is more a matter of style -- the "slogan" comment was more for the sentence case vs capital letters. For me, having periods means that one can scan just the bold parts ("Read everything first. Process data in steps. ...") and it's still coherent, whereas a colon means that both parts should be read together. But this specific case is more of a preference and not really a hard-and-fast rule.
Thanks for the review @seanlip, I have addressed your comments and also changed the formatting of notes, PTAL, Thanks! (And also sorry for the delay here.) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One small note, otherwise no concerns. Thanks!
Addressed the last comment, PTAL @seanlip, thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! LGTM.
Fix #364
Link to Google doc - https://docs.google.com/document/d/1kOTBlrrCKu436A7pL4KnzWYHGoBYTxYhVI-VG3pRMJo/edit?tab=t.0