-
Notifications
You must be signed in to change notification settings - Fork 19
Bat Team
The Bat Team focuses on removing obstacles for regular sprint work. Each week two engineers are on-call and during that week they, along with the engineering lead, form the Bat Team. Their mission is to:
- Respond to production incidents and update the team and users as necessary
- Triage product manager and support requests related to production issues (primarily through resolving support requests in the #appeals-batteam channel)
- Monitor Slack for Sentry alerts
-
Improve the code base to reduce technical debt (if time permits, by picking up issues labeled
tech-improvement
from their team's sprint backlog)
When an issue arises, the Bat Team assumes initial ownership. It is their responsibility to evaluate the importance of the issue relative to the other active issues at that point in time and commit to the appropriate level of support or clearly transfer ownership to someone else (engineer on the relevant sprint team, Engineering Lead, PM, etc).
For example, a PagerDuty alert that Caseflow is down takes priority over a support ticket to remedy a data error.
For example, a work stoppage support ticket (an employee can't do any work) may take priority over a support ticket about fixing a contention on an appeal
As you serve on Bat Team, you will improve at making these judgment calls. If there is any ambiguity, reach out to the Engineering Lead.
The name "Bat Team" refers to the "batman" military role. It is a growing software industry practice and an experiment in agile development.
The Bat Team listens in the #appeals-batteam channel. They should be the only engineers actively participating in that channel during the week, since the whole goal of the Bat Team is removing distractions from the regular sprint team work.
Daily async standup: The team does async Slack standup reports each day, listing what happened yesterday, their plans for today, and any blockers they are experiencing. The role of the engineering lead is to remove those blockers.
Friday afternoon Bat Team handoff meeting: Both the current Bat Team and the upcoming Bat Team should attend a handoff meeting on Friday afternoon (check the VA Appeals calendar) in order to reflect on the week and share recommendations and context about any carryover issues. During this time, one member should write down notes in the Bat Team Running Handoffs Doc, so future Bat Team members have easy access and trends can be tracked long-term. When you start your Bat Shift, you should read over the Handoffs doc since your last shift for important updates.
The most important priority for Bat Team members is to respond and facilitate resolution of production Caseflow incidents as they arise. A production incident is any unplanned interruption to Caseflow or a reduction in quality or a failure of a dependency that impacts Caseflow. These will most likely be detected as Datadog or PagerDuty alerts on #appeals-app-alerts, by a Support team member responding to a user request, or by users themselves on Slack or email.
If a production incident has been identified:
- Assign a severity level to the production incident (low, high or extreme) based on the postmortem guidelines.
- If the incident is low: the Bat Team should triage to the appropriate sprint team and hand-off the accountability and responsibility of resolving the issue to an engineer on the sprint team. Reach out to the Engineering Lead if there are questions.
- If the incident is high or extreme: the Bat Team must immediately notify the Engineering Lead, who will become accountable for the resolution of the issue. The Bat Team is expected to spend a minimum amount of time (~1 hour depending on capacity) investigating the issue, and after that should hand-off to the Engineering Lead, who will then assemble a SWAT team in #appeals-swat to resolve the issue.
- the Caseflow Status Page must be updated by either a Bat Team member or the engineering lead. The person who identified the production incident should post in #appeals-support and tag the current Bat Team members and the engineering lead and receive acknowledgement before updating.
RACI chart for the team roles involved in a production incident
Note: this is provided as a guide and may not apply uniformly
Severity | Responsible | Accountable | Consulted | Informed |
---|---|---|---|---|
Low | Sprint team engineers | Sprint team engineer | PM, Caseflow Eng Team | Caseflow Support Team |
High, Extreme | SWAT Team | Engineering Lead | Caseflow Team | Caseflow Support Team |
After a production issue has occurred, the engineering lead will schedule a postmortem meeting for the engineering team and will document the incident in the post-mortems folder in the appeals-deployment repository.
Join the #appeals-batteam channel. Support and product folks will be in there and will bring issues to our attention. The Bat Team is expected to respond and facilitate resolution to all requests from Support on this channel. This does not mean the Bat Team must be the only ones to fix issues, but they should be the first responders and loop in others only as necessary.
If you want to acknowledge a slack message but don't have time to look at it immediately, use the 🦇 emoji response.
If you are actively looking into a slack message, add the 👀 emoji response.
When a slack message has been adequately addressed (question answered, or GitHub issue created, etc), add the ✅ emoji.
Prioritize pairing during investigations and starting a Slack call in the #appeals-batteam channel. This gives everyone a chance to learn about parts of Caseflow that they might not know as well.
If a Support ticket cannot be resolved within 7 days, a Github issue must be created to track long-term engineering resolution following the guidelines below.
If a Github ticket needs to be created to follow up on a support ticket, to address a flakey test or alert (see below), or for anything else that comes up during Bat Team, please create a Github issue in the caseflow repository and include the appropriate labels as well as relevant links (to Slack conversations, test results, Sentry, etc.).
-
type (required):
bug
,sentry-alert
,tech-improvement
-
sprint team:
echo
,foxtrot
,tango
,delta
-
product:
caseflow-queue
,hearings
, etc.
Do not use the batteam
Github label. This label is not standardized and does not provide extra information
These tickets will be picked up during the product team's backlog grooming process and be prioritized in sprints accordingly.
Each week, Bat Team members should divide responsibility for monitoring the team's alert Slack channels and monitor for Sentry alerts. These come from the production Caseflow system.
When a new Sentry alert appears in Slack, it should be investigated asap. If you cannot investigate it immediately, emoji tag it with the 🦇 emoji.
Prioritize pairing during investigations. This gives everyone a chance to learn about parts of Caseflow that they might not know as well.
If a Github ticket already exists for the underlying issue, the Sentry alert should be ignored for a month.
If a Github ticket does not yet exist, create a Github ticket, with a link to the Sentry incident
in the ticket description. Add the sentry-alert
label to the new ticket, and any appropriate product-specific labels.
The key evaluation is whether this incident reflects an immediate production issue, particularly affecting data integrity, or whether it can be picked up during normal sprint planning. If it's an immediate production issue, you should escalate to the tech lead for the affected feature, and consult with them about next steps. If it's an outage of some kind, we should convene folks in #appeals-swat. The Bat Team should do just enough investigation to determine further action.
Mark the Sentry alert in Slack with the green checkmark emoji when it has been triaged, and you can ignore the alert in Sentry for a month.
Drop links to PRs in the #appeals-batteam channel with the :git-pull-request: emoji. We review each others' PRs, which typically fix tests or Sentry-alert-related bugs.
A signficant blocker for sprint teams is flakey tests. Right now we track them on a single GitHub issue. See Flakey Test Remedies for possible pitfalls.
We spend time-boxed effort to fix the tests, and if we can't do it in the 1 or 2 hours we allot ourselves, we skip the tests with a reason that refers to the CircleCI URL with the flake and a note about Bat Team efforts.
If you skip the test, consider adding the reason why to the Engineering Huddle agenda if you think it merits wider conversation.
To serve on Bat Team you will need to:
- Read through this wiki fully, as updates may have been made since you last read it
- Read the First Responder Manual
- Review Bat Team Tips
Make sure you also have:
- SSM access to production environment
- A working CAG or GFE/VPN system with PIV card
With great power comes great responsibility. As the Bat Team focuses on resolving production issues, you will also need the below set up. Please be very careful whenever accessing production.
- A production Caseflow account. You must be using CAG/GFE to access. Slack or email Kate Brown ([email protected]) if you are having trouble.
- System Admin access to production Caseflow and Global Admin access. Create a PR to add yourself (certification prod should be sufficient). Example: https://github.com/department-of-veterans-affairs/appeals-deployment/pull/2351
Both Bat Team members should be active on & monitoring the following channels:
- #appeals-batteam (requests from Support)
- #appeals-app-alerts (many alerts, pay particularly close attention to production alerting from Datadog/PagerDuty)
- #appeals-support (low volume, discussion channel with Support team. Also gets StatusPage alerts)
The following Slack channels will receive automated alerts. Bat Team should divide the following list amongst themselves based on volume level (updated 2019/12/20 based on Messages Posted stats) and share the list with #appeals-batteam.
- #appeals-certification (v. quiet)
- #appeals-demo (v. quiet)
- #appeals-dispatch (v. quiet)
- #appeals-reader (1-10/day)
- #appeals-hearings (1-10/day)
- #appeals-idt (1-10/day)
- #appeals-queue-alerts (15-20/day)
- #appeals-efolder (15-20/day)
See more recent Slack stats per channel
Due to high number of alerts we can't do anything with, we are not actively watching these channels
- #appeals-intake-alerts
As of 12/19/19, the Caseflow Delta team will take over monitoring of #appeals-job-alerts.
Use #caseflow-vbms-dev
channel to post the error in question. Provide error GUID. Example can be found here
- Take a look at VACOLS database tables
- If this question cannot be answered by the Caseflow team, create a Github issue in
dsva-vacols
repo
- Look for a Sentry error in
#appeals-dispatch
channel. - If the error is related to
AASM::InvalidTransition
, that means the state transition is invalid. - Find the associated dispatch task in the
dispatch_tasks
table. - Choose the appropriate state for the task by referencing the
aasm
machine inDispatch::Task
model and update the task manually using production console. - Example of a similar problem can be found here.
For Sentry alerts about RedistributedCase::CannotRedistribute, check to see if it is currently in the correct VACOLS location (i.e., not 81 "case storage"):
vacols_id=<paste the id from the sentry alert page>
pp VACOLS::Priorloc.where(lockey: vacols_id).order(:locdout).pluck(:locdout, :locstto, :locstrcv)
- locdout: date location was changed
- locstto: location code it was changed to
- locstrcv: css_id for user that changed location. If it was Caseflow, I think the value will be DSUSER
If the last location entry is assigned to a user, then no action is needed. It's typically assigned to a judge and the judge should see it in their "Cases to assign" queue.
Also see Hunter's fix prescribed in 3b below.
If an Automatic Case Distribution job fails to run and errors with ActiveRecord::RecordNotUnique how do I fix it?
- Look for the Sentry error and identify the
case_id
(orvacols_id
) value. - If it's an integer, it's a legacy (VACOLS) case.
- Review the tasks assigned to the
LegacyAppeal
with that vacols_id.
appeal = LegacyAppeal.find_by(vacols_id: the_case_id)
puts appeal.structure_render(:id, :status)
3a. If it does not have any tasks where the status
is assigned
, it may need to be re-distributed. This should now be handled automatically using the string -redistributed
within the case_id
-- if not, let Team Echo know. Any cases that cannot be automatically distributed will be reported as a CannotRedistribute
error.
- a. Update the existing
DistributedCase
with thatcase_id
value to append the string-attempt1
to the value. The judge should then be able to re-run the Distribution job via the "Request more cases" button.
the_case_id = # case ID value identified above
DistributedCase.find_by(case_id: the_case_id).update!(case_id: "#{the_case_id}-attempt1")
# should return true
DistributedCase.find_by(case_id: the_case_id)
# should return nil
3b. If the LegacyAppeal
has assigned tasks, the VACOLS location may just need to be updated, which will pull it out of the eligible-for-distribution pool.
- If the appeal has open hearing tasks, the location should be updated to 'Caseflow'.
appeal = LegacyAppeal.find_by(vacols_id: the_case_id) AppealRepository.update_location!(appeal, LegacyAppeal::LOCATION_CODES[:caseflow])
It is always a good idea to keep track of the last vacols case you have handled so we don't miss any.
If you have missed a bunch overnight, here is a way to batch them all
vacols_ids = []
appeals = vacols_ids.map { |vacols_id| LegacyAppeal.find_by(vacols_id: vacols_id) }
appeals_with_active_tasks = appeals.select do |appeal|
appeal.tasks.active.where.not(type: TrackVeteranTask.name).any?
end.map(&:vacols_id)
appeals_without_active_tasks = vacols_ids - appeals_with_active_tasks
# Confirm this by checking out task trees and location codes:
puts appeals.map { |appeal| appeal.structure_render(:status) }
appeals_with_active_tasks.map { |vacols_id| LegacyAppeal.find_by(vacols_id: vacols_id).location_code }
# Fix the appeals without active tasks
appeals_without_active_tasks.each do |vacols_id|
DistributedCase.find_by(case_id: vacols_id).update!(case_id: "#{vacols_id}-attempt1")
end
# Fix appeals with active tasks
appeals_with_active_tasks.each do |vacols_id|
AppealRepository.update_location!(LegacyAppeal.find_by(vacols_id: vacols_id), LegacyAppeal::LOCATION_CODES[:caseflow])
end
You can skip specific exception classes from generating Slack messages via the lamba.
If a user removes the final issue on an appeal, they will trigger cancellation of the case. See this slack thread. If an appeal in this state is brought to support/batteam, this state has most likely been caused accidentally by an attorney or intentionally by another person to cancel the appeal.
To determine whether this was done accidentally (and is in need of fixing) or intentionally (and the user needs to be notified as such) we need to confirm who removed the final request issue. Relevant Slack conversation.
Example:
uuid = ""
appeal = Appeal.find_by(uuid: uuid)
puts appeal.structure_render(:status, :closed_at)
# Note the time that the AttorneyTask, JudgeDecisionReviewTask, and RootTask were all cancelled
pp appeal.request_issues.pluck(:id, :closed_status, :closed_at)
# Confirm the most recently closed issue has the same timestamp as the cancelled tasks
remover_id = RequestIssuesUpdate.where(review: appeal, after_request_issue_ids: []).order(updated_at: :desc).first.user_id
User.find(remover_id)
If the person who removed the last issue was not the attorney, this was most likely an intentional case cancellation. In this situation, we would instruct the user to confirm with LRP that is case was intentionally cancelled.
In the case that the attorney did this accidentally, the css_id
of the attorney and the person who removed the last issue would be the same. The process to fix this is to uncancel the Root, JudgeDecisionReview, and AttorneyTask, and instruct the User to add the corrected issue before closing the other one.
Example code:
uuid = ""
appeal = Appeal.find_by(uuid: uuid)
puts appeal.structure_render(:id,:status,:closed_at)
attorney_task_id = ""
judge_dr_task_id = ""
root_task_id= ""
Task.find(root_task_id).update!(status: Constants.TASK_STATUSES.on_hold, closed_at: nil)
Task.find(judge_dr_task_id).update!(status: Constants.TASK_STATUSES.on_hold, closed_at: nil)
Task.find(attorney_task_id).update!(status: Constants.TASK_STATUSES.assigned, closed_at: nil)
puts appeal.structure_render(:id,:status,:closed_at)
This can be complete by users.
We are working on a way to better keep Caseflow in sync when VACOLS cases are deleted. Until then, you can identify them per user with:
empties = user.tasks.open.where(appeal_type: "LegacyAppeal").select { |t| t.appeal.case_record.nil? }
and then cancel the assigned tasks on each so that they disappear from the user's queue.
If a claim fails to establish in VBMS, we can see the error:
review=SupplementalClaim.find_by(uuid: "abcd") # this is just an example to find a review
review.establishment_error
The user can see this themselves at the job detail page
https://appeals.cf.ds.va.gov/asyncable_jobs/[review.class.name]/jobs/[review.id]
For example:
https://appeals.cf.ds.va.gov/asyncable_jobs/HigherLevelReview/jobs/7
If it is failing after 24 hours, it is most likely due to a known upstream error. We are adding notes to the job details page for these issues. You can look at the job details page at the URL above, or do it in the console with:
JobNote.find_by(job_type: review.class.name, job_id: review.id)
The JobNote will likely have information about the GitHub issue associated to the upstream error.
There is no additional action required from the user at this time. However, these usually take a long time to resolve, so the claim will be stuck until it is.
Unidentified issues are what claims assistants add if they don't see the issue the veteran is requesting out of the available options (which come from rating issues from BGS, or prior Caseflow decision issues), and the issue is not a new non-rating issue.
Users are instructed to return to Caseflow to edit these. This means removing the unidentified issue, and adding the correct issue. This should be the first step for the user. They might confuse this with being able to edit the contention text, which is not currently available for unidentified issues. Allowing this is planned in current/upcoming work.
Sometimes they can't find the correct issue. If the user is a VSR, they can try backfilling the rating issue, and trying again. They should use a descriptive "decision_text" when backfilling the issue so that they can find it in Caseflow.
If an issue is still missing, this is often due to the issues being old, but there are other reasons an issue could be missing as well. We are working on implementing a new BGS service that should resolve these. Also, we have proposed to AMO to allow users to intake unidentified issues without connecting them to a rating issue, and work for that is underway.
We expect in February for both of these efforts to be completed, and hopefully approved by AMO. However, right now, if they can't find an issue, they may not be able to proceed processing that issue.
When investigating these, I usually check the veteran's ratings to see if the user just missed the correct issue:
v=review.veteran
rs=v.ratings
rs.first.issues
Occasionally, the issue is present, but not showing up because it was promulgated after the receipt date of the form. These are not currently available to be selected, but we are considering allowing them, pending some answers and approval from AMO.
- A user is trying to add a request issue on an intake, but it is ineligible because it is already in active review
- That same request issue is on another review, whose end product has been cleared
- The review doesn't have any decision issues, even though it got cleared a while ago
Users may want to cancel a claim for various reasons, for example if the veteran submitted a Supplemental Claim, but didn't submit any additional evidence. The proper way to do this is to remove the review, by going to the edit screen and removing the issues. However, instead users were manually clearing these claims in VBMS or Share. This may be because they do not get credit for canceled claims. We have heard that this behavior is now prevented in VBMS, I'm not sure about Share.
Whether a user cancels a claim in Caseflow or in VBMS/Share, that's okay and we detect that behavior and close the request issues. However, if a claim gets cleared, we interpret that as meaning it was fully processed and should be getting a new award generated to reflect the decision. So we start pinging VBMS and BGS for the decision. Once we get the disposition from VBMS, and if it was a rating issue, the new rating issue from BGS, we create a Caseflow decision issue.
If it was manually cleared, but should have been canceled, then we never get a decision issue. So it appears to us that the issue is still active, preventing users from adding it to a new intake, but it will also never get a decision.
# First, find the cleared claim
v=Veteran.find(1234) # find the veteran
scs=SupplementalClaim.where(veteran_file_number: v.file_number)
scs.first.end_product_establishments
# Let's say the first SC has one end product establishment, and it has a synced_status of "CLR". Then it may be what I'm looking for. I may want to check the support ticket for the specific issue, or check out other reviews too.
sc=scs.first
epe=sc.end_product_establishments.first
# Double check that there are no decision issues. If there are decision issues, this is not the right review.
sc.decision_issues.empty?
# Double check that the request issues don't have a contention, because sometimes decision issue syncing fails due to lack of a new rating issue.
epe.request_issues.first.contention_disposition.nil?
# Check if it got cleared a while ago (more than a month).
epe.last_synced_at
# If all of the above is true, there's a good chance it was manually cleared. Then you can close the issues with "no_decision", indicating we never expect to get a decision for them.
epe.request_issues.each{|ri| RequestIssueClosure.new(ri).with_no_decision!}
This may be needed if a user gets an error when removing an issue. The alert they may submit will be titled "Previous update not yet done processing", and the error class is VBMS::CannotDeleteContention
.
Here's an example of code to use, please adapt it to your needs.
Find the request issues update:
uuid_of_review=""
review=HigherLevelReview.find_by(uuid: uuid_of_review)
# Find the request issues update causing the error
# There should theoretically only be one
riu=RequestIssuesUpdate.where(review_type: review.class.name, review_id: review.id).processable.first
# Get this from the error message, for example: <VBMS::Responses::Contention id=\"94613799\"
riu.error
contention_id_of_failed_removal =
# First try re-processing it, just in case someone resolved the problem in VBMS
riu.establish!
# Find the removed issue causing the problem
removed_issue = riu.removed_issues.find{|ri| ri.contention_reference_id == contention_id_of_failed_removal}
# Update the removed issue's to an open status
removed_issue.update!(closed_status: nil, closed_at: nil)
# Check if there were any other changes on that request issues update
# If the removed issue was the only change, then cancel the request issues update because the update validation will fail if there are no changes
# Note: when adding these notes, we had the first case (one removed issue, no other changes), so the case where there were also other changes may require additional actions. From what I've checked, I think this should cover it, but it may be good to double check.
if riu.all_updated_issues == [removed_issue]
riu.canceled!
else
# Update the request issue update to not remove that issue
riu.update!(after_request_issue_ids: riu.after_request_issue_ids.push(ri.id))
# Un-memoize after_issues so that the request issues update recognizes its new state
riu.instance_variable_set(:@after_issues, nil)
# Re-process the request issues update
riu.establish!
end
- Home
- Acronyms and Glossary
- Caseflow products
- Caseflow Intake
- Caseflow Queue
- Appeals Consumer
- Caseflow Reader
- Caseflow eFolder
- Caseflow Hearings
- Caseflow Certification
- Caseflow APIs
- Appeal Status API
- Caseflow Dispatch
-
CSUM Roles
- System Admin
- VHA Team Management
- Active Record Queries Resource
- External Integrations
- Caseflow Demo
- Caseflow ProdTest
- Background
- Stuck Jobs
- VA Notify
- Caseflow-Team
- Frontend Best Practices
- Accessibility
- How-To
- Debugging Tips
- Adding a Feature Flag with FeatureToggle
- Editing AMA issues
- Editing a decision review
- Fixing task trees
- Investigating and diagnosing issues
- Data and Metric Request Workflow
- Exporting and Importing Appeals
- Explain page for Appeals
- Record associations and Foreign Keys
- Upgrading Ruby
- Stuck Appeals
- Testing Action Mailer Messages Locally
- Re-running Seed Files
- Rake Generator for Legacy Appeals
- Manually running Scheduled Jobs
- System Admin UI
- Caseflow Makefile
- Upgrading Postgresql from v11.7 to v14.8 Locally
- VACOLS VM Trigger Fix M1
- Using SlackService to Send a Job Alert
- Technical Talks