Bat Team

Definition

The Bat Team focuses on removing obstacles for regular sprint work. Each week two engineers are on-call and during that week they, along with the engineering lead, form the Bat Team. Their mission is to:

Respond to production incidents and update the team and users as necessary
Triage product manager and support requests related to production issues (primarily through resolving support requests in the #appeals-batteam channel)
Monitor Slack for Sentry alerts
Improve the code base to reduce technical debt (if time permits, by picking up issues labeled tech-improvement from their team's sprint backlog)

When an issue arises, the Bat Team assumes initial ownership. It is their responsibility to evaluate the importance of the issue relative to the other active issues at that point in time and commit to the appropriate level of support or clearly transfer ownership to someone else (engineer on the relevant sprint team, Engineering Lead, PM, etc).

For example, a PagerDuty alert that Caseflow is down takes priority over a support ticket to remedy a data error.

For example, a work stoppage support ticket (an employee can't do any work) may take priority over a support ticket about fixing a contention on an appeal

As you serve on Bat Team, you will improve at making these judgment calls. If there is any ambiguity, reach out to the Engineering Lead.

The name "Bat Team" refers to the "batman" military role. It is a growing software industry practice and an experiment in agile development.

Rituals

The Bat Team listens in the #appeals-batteam channel. They should be the only engineers actively participating in that channel during the week, since the whole goal of the Bat Team is removing distractions from the regular sprint team work.

Daily async standup: The team does async Slack standup reports each day, listing what happened yesterday, their plans for today, and any blockers they are experiencing. The role of the engineering lead is to remove those blockers.

Friday afternoon Bat Team handoff meeting: Both the current Bat Team and the upcoming Bat Team should attend a handoff meeting on Friday afternoon (check the VA Appeals calendar) in order to reflect on the week and share recommendations and context about any carryover issues. During this time, one member should write down notes in the Bat Team Running Handoffs Doc, so future Bat Team members have easy access and trends can be tracked long-term. When you start your Bat Shift, you should read over the Handoffs doc since your last shift for important updates.

Protocols

1. Respond to Production Incidents

The most important priority for Bat Team members is to respond and facilitate resolution of production Caseflow incidents as they arise. A production incident is any unplanned interruption to Caseflow or a reduction in quality or a failure of a dependency that impacts Caseflow. These will most likely be detected as Datadog or PagerDuty alerts on #appeals-app-alerts, by a Support team member responding to a user request, or by users themselves on Slack or email.

If a production incident has been identified:

Assign a severity level to the production incident (low, high or extreme) based on the postmortem guidelines.
- If the incident is low: the Bat Team should triage to the appropriate sprint team and hand-off the accountability and responsibility of resolving the issue to an engineer on the sprint team. Reach out to the Engineering Lead if there are questions.
- If the incident is high or extreme: the Bat Team must immediately notify the Engineering Lead, who will become accountable for the resolution of the issue. The Bat Team is expected to spend a minimum amount of time (~1 hour depending on capacity) investigating the issue, and after that should hand-off to the Engineering Lead, who will then assemble a SWAT team in #appeals-swat to resolve the issue.
the Caseflow Status Page must be updated by either a Bat Team member or the engineering lead. The person who identified the production incident should post in #appeals-support and tag the current Bat Team members and the engineering lead and receive acknowledgement before updating.

RACI chart for the team roles involved in a production incident

Note: this is provided as a guide and may not apply uniformly

Severity	Responsible	Accountable	Consulted	Informed
Low	Sprint team engineers	Sprint team engineer	PM, Caseflow Eng Team	Caseflow Support Team
High, Extreme	SWAT Team	Engineering Lead	Caseflow Team	Caseflow Support Team

After a production issue has occurred, the engineering lead will schedule a postmortem meeting for the engineering team and will document the incident in the post-mortems folder in the appeals-deployment repository.

2. Triage Requests from the Support and Product Management teams in Slack

Join the #appeals-batteam channel. Support and product folks will be in there and will bring issues to our attention. The Bat Team is expected to respond and facilitate resolution to all requests from Support on this channel. This does not mean the Bat Team must be the only ones to fix issues, but they should be the first responders and loop in others only as necessary.

If you want to acknowledge a slack message but don't have time to look at it immediately, use the 🦇 emoji response.

If you are actively looking into a slack message, add the 👀 emoji response.

When a slack message has been adequately addressed (question answered, or GitHub issue created, etc), add the ✅ emoji.

Prioritize pairing during investigations and starting a Slack call in the #appeals-batteam channel. This gives everyone a chance to learn about parts of Caseflow that they might not know as well.

If a Support ticket cannot be resolved within 7 days, a Github issue must be created to track long-term engineering resolution following the guidelines below.

Github Issue Creation

If a Github ticket needs to be created to follow up on a support ticket, to address a flakey test or alert (see below), or for anything else that comes up during Bat Team, please create a Github issue in the caseflow repository and include the appropriate labels as well as relevant links (to Slack conversations, test results, Sentry, etc.).

type (required): bug, sentry-alert, tech-improvement
sprint team: echo, foxtrot, tango, delta
product: caseflow-queue, hearings, etc.

Do not use the batteam Github label. This label is not standardized and does not provide extra information

These tickets will be picked up during the product team's backlog grooming process and be prioritized in sprints accordingly.

3. Monitor Slack for Sentry Alerts

Each week, Bat Team members should divide responsibility for monitoring the team's alert Slack channels and monitor for Sentry alerts. These come from the production Caseflow system.

When a new Sentry alert appears in Slack, it should be investigated asap. If you cannot investigate it immediately, emoji tag it with the 🦇 emoji.

Prioritize pairing during investigations. This gives everyone a chance to learn about parts of Caseflow that they might not know as well.

If a Github ticket already exists for the underlying issue, the Sentry alert should be ignored for a month.

If a Github ticket does not yet exist, create a Github ticket, with a link to the Sentry incident in the ticket description. Add the sentry-alert label to the new ticket, and any appropriate product-specific labels.

The key evaluation is whether this incident reflects an immediate production issue, particularly affecting data integrity, or whether it can be picked up during normal sprint planning. If it's an immediate production issue, you should escalate to the tech lead for the affected feature, and consult with them about next steps. If it's an outage of some kind, we should convene folks in #appeals-swat. The Bat Team should do just enough investigation to determine further action.

Mark the Sentry alert in Slack with the green checkmark emoji when it has been triaged, and you can ignore the alert in Sentry for a month.

4. Improve the codebase

Pull Requests

Drop links to PRs in the #appeals-batteam channel with the :git-pull-request: emoji. We review each others' PRs, which typically fix tests or Sentry-alert-related bugs.

Flakey tests

A signficant blocker for sprint teams is flakey tests. Right now we track them on a single GitHub issue. See Flakey Test Remedies for possible pitfalls.

We spend time-boxed effort to fix the tests, and if we can't do it in the 1 or 2 hours we allot ourselves, we skip the tests with a reason that refers to the CircleCI URL with the flake and a note about Bat Team efforts.

If you skip the test, consider adding the reason why to the Engineering Huddle agenda if you think it merits wider conversation.

Frequently Asked Questions

Preparations

To serve on Bat Team you will need to:

Read through this wiki fully, as updates may have been made since you last read it
Read the First Responder Manual
Review Bat Team Tips

Make sure you also have:

SSM access to production environment
A working CAG or GFE/VPN system with PIV card

With great power comes great responsibility. As the Bat Team focuses on resolving production issues, you will also need the below set up. Please be very careful whenever accessing production.

A production Caseflow account. You must be using CAG/GFE to access. Slack or email Kate Brown ([email protected]) if you are having trouble.
- System Admin access to production Caseflow and Global Admin access. Create a PR to add yourself (certification prod should be sufficient). Example: https://github.com/department-of-veterans-affairs/appeals-deployment/pull/2351

Slack channels we monitor

Both Bat Team members should be active on & monitoring the following channels:

#appeals-batteam (requests from Support)
#appeals-app-alerts (many alerts, pay particularly close attention to production alerting from Datadog/PagerDuty)
#appeals-support (low volume, discussion channel with Support team. Also gets StatusPage alerts)

The following Slack channels will receive automated alerts. Bat Team should divide the following list amongst themselves based on volume level (updated 2019/12/20 based on Messages Posted stats) and share the list with #appeals-batteam.

#appeals-certification (v. quiet)
#appeals-demo (v. quiet)
#appeals-dispatch (v. quiet)
#appeals-reader (1-10/day)
#appeals-hearings (1-10/day)
#appeals-idt (1-10/day)
#appeals-queue-alerts (15-20/day)
#appeals-efolder (15-20/day)

See more recent Slack stats per channel

Ignored Sentry alerts

Slack channels we currently don't monitor

Due to high number of alerts we can't do anything with, we are not actively watching these channels

#appeals-intake-alerts

As of 12/19/19, the Caseflow Delta team will take over monitoring of #appeals-job-alerts.

If VBMS API is throwing an error, whom do I contact?

Use #caseflow-vbms-dev channel to post the error in question. Provide error GUID. Example can be found here

If I have a question about VACOLS.

Take a look at VACOLS database tables
If this question cannot be answered by the Caseflow team, create a Github issue in dsva-vacols repo

If I receive a support issue regarding Dispatch claim being "stuck".

Look for a Sentry error in #appeals-dispatch channel.
If the error is related to AASM::InvalidTransition, that means the state transition is invalid.
Find the associated dispatch task in the dispatch_tasks table.
Choose the appropriate state for the task by referencing the aasm machine in Dispatch::Task model and update the task manually using production console.
Example of a similar problem can be found here.

RedistributedCase::CannotRedistribute Sentry alert

For Sentry alerts about RedistributedCase::CannotRedistribute, check to see if it is currently in the correct VACOLS location (i.e., not 81 "case storage"):

vacols_id=<paste the id from the sentry alert page>
pp VACOLS::Priorloc.where(lockey: vacols_id).order(:locdout).pluck(:locdout, :locstto, :locstrcv)

locdout: date location was changed
locstto: location code it was changed to
locstrcv: css_id for user that changed location. If it was Caseflow, I think the value will be DSUSER

If the last location entry is assigned to a user, then no action is needed. It's typically assigned to a judge and the judge should see it in their "Cases to assign" queue.

Also see Hunter's fix prescribed in 3b below.

If an Automatic Case Distribution job fails to run and errors with ActiveRecord::RecordNotUnique how do I fix it?

Look for the Sentry error and identify the case_id (or vacols_id) value.
If it's an integer, it's a legacy (VACOLS) case.
Review the tasks assigned to the LegacyAppeal with that vacols_id.

appeal = LegacyAppeal.find_by(vacols_id: the_case_id)
puts appeal.structure_render(:id, :status)

3a. If it does not have any tasks where the status is assigned, it may need to be re-distributed. This should now be handled automatically using the string -redistributed within the case_id -- if not, let Team Echo know. Any cases that cannot be automatically distributed will be reported as a CannotRedistribute error.

a. Update the existing DistributedCase with that case_id value to append the string -attempt1 to the value. The judge should then be able to re-run the Distribution job via the "Request more cases" button.

the_case_id = # case ID value identified above
DistributedCase.find_by(case_id: the_case_id).update!(case_id: "#{the_case_id}-attempt1")
# should return true
DistributedCase.find_by(case_id: the_case_id)
# should return nil

3b. If the LegacyAppeal has assigned tasks, the VACOLS location may just need to be updated, which will pull it out of the eligible-for-distribution pool.

If the appeal has open hearing tasks, the location should be updated to 'Caseflow'.

appeal = LegacyAppeal.find_by(vacols_id: the_case_id)
AppealRepository.update_location!(appeal, LegacyAppeal::LOCATION_CODES[:caseflow])

It is always a good idea to keep track of the last vacols case you have handled so we don't miss any.

If you have missed a bunch overnight, here is a way to batch them all

vacols_ids = []
appeals = vacols_ids.map { |vacols_id| LegacyAppeal.find_by(vacols_id: vacols_id) }
appeals_with_active_tasks = appeals.select do |appeal|
  appeal.tasks.active.where.not(type: TrackVeteranTask.name).any?
end.map(&:vacols_id)
appeals_without_active_tasks = vacols_ids - appeals_with_active_tasks
# Confirm this by checking out task trees and location codes:
puts appeals.map { |appeal| appeal.structure_render(:status) }
appeals_with_active_tasks.map { |vacols_id| LegacyAppeal.find_by(vacols_id: vacols_id).location_code }
# Fix the appeals without active tasks
appeals_without_active_tasks.each do |vacols_id|
  DistributedCase.find_by(case_id: vacols_id).update!(case_id: "#{vacols_id}-attempt1")
end
# Fix appeals with active tasks
appeals_with_active_tasks.each do |vacols_id|
 AppealRepository.update_location!(LegacyAppeal.find_by(vacols_id: vacols_id), LegacyAppeal::LOCATION_CODES[:caseflow])
end

Stop Sentry from sending [Alert] on to Slack

You can skip specific exception classes from generating Slack messages via the lamba.

Uncancel an Appeal accidentally closed by Closing Last Issue

If a user removes the final issue on an appeal, they will trigger cancellation of the case. See this slack thread. If an appeal in this state is brought to support/batteam, this state has most likely been caused accidentally by an attorney or intentionally by another person to cancel the appeal.

To determine whether this was done accidentally (and is in need of fixing) or intentionally (and the user needs to be notified as such) we need to confirm who removed the final request issue. Relevant Slack conversation.

Example:

uuid = ""
appeal = Appeal.find_by(uuid: uuid)
puts appeal.structure_render(:status, :closed_at)
# Note the time that the AttorneyTask, JudgeDecisionReviewTask, and RootTask were all cancelled
pp appeal.request_issues.pluck(:id, :closed_status, :closed_at)
# Confirm the most recently closed issue has the same timestamp as the cancelled tasks
remover_id = RequestIssuesUpdate.where(review: appeal, after_request_issue_ids: []).order(updated_at: :desc).first.user_id
User.find(remover_id)

If the person who removed the last issue was not the attorney, this was most likely an intentional case cancellation. In this situation, we would instruct the user to confirm with LRP that is case was intentionally cancelled.

In the case that the attorney did this accidentally, the css_id of the attorney and the person who removed the last issue would be the same. The process to fix this is to uncancel the Root, JudgeDecisionReview, and AttorneyTask, and instruct the User to add the corrected issue before closing the other one.

Example code:

uuid = ""
appeal = Appeal.find_by(uuid: uuid)
puts appeal.structure_render(:id,:status,:closed_at)
attorney_task_id = ""
judge_dr_task_id = ""
root_task_id= ""
Task.find(root_task_id).update!(status: Constants.TASK_STATUSES.on_hold, closed_at: nil)
Task.find(judge_dr_task_id).update!(status: Constants.TASK_STATUSES.on_hold, closed_at: nil)
Task.find(attorney_task_id).update!(status: Constants.TASK_STATUSES.assigned, closed_at: nil)
puts appeal.structure_render(:id,:status,:closed_at)

Creating a new Job Alert for Slack

See this guide

Cancel an Attorney Task & Return to Judge Assign Queue

This can be complete by users.

VACOLS case deleted, Legacy Appeal record dangling

We are working on a way to better keep Caseflow in sync when VACOLS cases are deleted. Until then, you can identify them per user with:

empties = user.tasks.open.where(appeal_type: "LegacyAppeal").select { |t| t.appeal.case_record.nil? }

and then cancel the assigned tasks on each so that they disappear from the user's queue.

Merge user accounts

Read all about it

Intake

Claim not established

If a claim fails to establish in VBMS, we can see the error:

review=SupplementalClaim.find_by(uuid: "abcd") # this is just an example to find a review
review.establishment_error

The user can see this themselves at the job detail page https://appeals.cf.ds.va.gov/asyncable_jobs/[review.class.name]/jobs/[review.id] For example: https://appeals.cf.ds.va.gov/asyncable_jobs/HigherLevelReview/jobs/7

If it is failing after 24 hours, it is most likely due to a known upstream error. We are adding notes to the job details page for these issues. You can look at the job details page at the URL above, or do it in the console with: JobNote.find_by(job_type: review.class.name, job_id: review.id)

The JobNote will likely have information about the GitHub issue associated to the upstream error.

There is no additional action required from the user at this time. However, these usually take a long time to resolve, so the claim will be stuck until it is.

Unidentified or missing issues

Unidentified issues are what claims assistants add if they don't see the issue the veteran is requesting out of the available options (which come from rating issues from BGS, or prior Caseflow decision issues), and the issue is not a new non-rating issue.

Users are instructed to return to Caseflow to edit these. This means removing the unidentified issue, and adding the correct issue. This should be the first step for the user. They might confuse this with being able to edit the contention text, which is not currently available for unidentified issues. Allowing this is planned in current/upcoming work.

Sometimes they can't find the correct issue. If the user is a VSR, they can try backfilling the rating issue, and trying again. They should use a descriptive "decision_text" when backfilling the issue so that they can find it in Caseflow.

If an issue is still missing, this is often due to the issues being old, but there are other reasons an issue could be missing as well. We are working on implementing a new BGS service that should resolve these. Also, we have proposed to AMO to allow users to intake unidentified issues without connecting them to a rating issue, and work for that is underway.

We expect in February for both of these efforts to be completed, and hopefully approved by AMO. However, right now, if they can't find an issue, they may not be able to proceed processing that issue.

When investigating these, I usually check the veteran's ratings to see if the user just missed the correct issue:

v=review.veteran
rs=v.ratings
rs.first.issues

Occasionally, the issue is present, but not showing up because it was promulgated after the receipt date of the form. These are not currently available to be selected, but we are considering allowing them, pending some answers and approval from AMO.

Making a "stuck" issue available from a manually cleared claim

How to detect this

A user is trying to add a request issue on an intake, but it is ineligible because it is already in active review
That same request issue is on another review, whose end product has been cleared
The review doesn't have any decision issues, even though it got cleared a while ago

Why it's happening

Users may want to cancel a claim for various reasons, for example if the veteran submitted a Supplemental Claim, but didn't submit any additional evidence. The proper way to do this is to remove the review, by going to the edit screen and removing the issues. However, instead users were manually clearing these claims in VBMS or Share. This may be because they do not get credit for canceled claims. We have heard that this behavior is now prevented in VBMS, I'm not sure about Share.

Whether a user cancels a claim in Caseflow or in VBMS/Share, that's okay and we detect that behavior and close the request issues. However, if a claim gets cleared, we interpret that as meaning it was fully processed and should be getting a new award generated to reflect the decision. So we start pinging VBMS and BGS for the decision. Once we get the disposition from VBMS, and if it was a rating issue, the new rating issue from BGS, we create a Caseflow decision issue.

If it was manually cleared, but should have been canceled, then we never get a decision issue. So it appears to us that the issue is still active, preventing users from adding it to a new intake, but it will also never get a decision.

Investigation and resolution

# First, find the cleared claim
v=Veteran.find(1234) # find the veteran
scs=SupplementalClaim.where(veteran_file_number: v.file_number)
scs.first.end_product_establishments

# Let's say the first SC has one end product establishment, and it has a synced_status of "CLR".  Then it may be what I'm looking for. I may want to check the support ticket for the specific issue, or check out other reviews too.
sc=scs.first
epe=sc.end_product_establishments.first

# Double check that there are no decision issues. If there are decision issues, this is not the right review.
sc.decision_issues.empty?

# Double check that the request issues don't have a contention, because sometimes decision issue syncing fails due to lack of a new rating issue.
epe.request_issues.first.contention_disposition.nil?

# Check if it got cleared a while ago (more than a month).
epe.last_synced_at

# If all of the above is true, there's a good chance it was manually cleared. Then you can close the issues with "no_decision", indicating we never expect to get a decision for them.
epe.request_issues.each{|ri| RequestIssueClosure.new(ri).with_no_decision!}

Undo removing a request issue

This may be needed if a user gets an error when removing an issue. The alert they may submit will be titled "Previous update not yet done processing", and the error class is VBMS::CannotDeleteContention.

Here's an example of code to use, please adapt it to your needs.

Find the request issues update:

uuid_of_review=""

review=HigherLevelReview.find_by(uuid: uuid_of_review)

# Find the request issues update causing the error
# There should theoretically only be one
riu=RequestIssuesUpdate.where(review_type: review.class.name, review_id: review.id).processable.first

# Get this from the error message, for example: <VBMS::Responses::Contention id=\"94613799\"
riu.error
contention_id_of_failed_removal =

# First try re-processing it, just in case someone resolved the problem in VBMS
riu.establish!

# Find the removed issue causing the problem
removed_issue = riu.removed_issues.find{|ri| ri.contention_reference_id == contention_id_of_failed_removal}

# Update the removed issue's to an open status
removed_issue.update!(closed_status: nil, closed_at: nil)

# Check if there were any other changes on that request issues update
# If the removed issue was the only change, then cancel the request issues update because the update validation will fail if there are no changes
# Note: when adding these notes, we had the first case (one removed issue, no other changes), so the case where there were also other changes may require additional actions.  From what I've checked, I think this should cover it, but it may be good to double check.

if riu.all_updated_issues == [removed_issue]
  riu.canceled!
else
  # Update the request issue update to not remove that issue
  riu.update!(after_request_issue_ids: riu.after_request_issue_ids.push(ri.id))

  # Un-memoize after_issues so that the request issues update recognizes its new state
  riu.instance_variable_set(:@after_issues, nil)

  # Re-process the request issues update
  riu.establish!
end

Provide feedback

Saved searches

Use saved searches to filter your results more quickly