Skip to content

Bat Team

Kat Tipton edited this page Dec 20, 2019 · 86 revisions

Definition

The Bat Team focuses on removing obstacles for regular sprint work. Each week two engineers are on-call and during that week they, along with the engineering lead, form the Bat Team. Their mission is to:

  1. Respond to production issues and update the team and users as necessary
  2. Triage product manager and support requests related to production issues (primarily through resolving support requests in the #appeals-batteam channel)
  3. Monitor Slack for Sentry alerts
  4. Improve the code base to reduce technical debt (if time permits, by picking up issues labeled tech-improvement from their team's sprint backlog)

The name "Bat Team" refers to the "batman" military role. It is a growing software industry practice and an experiment in agile development.

Rituals

The Bat Team listens in the #appeals-batteam channel. They should be the only engineers actively participating in that channel during the week, since the whole goal of the Bat Team is removing distractions from the regular sprint team work.

Daily async standup: The team does async Slack standup reports each day, listing what happened yesterday, their plans for today, and any blockers they are experiencing. The role of the engineering lead is to remove those blockers.

Friday afternoon Bat Team handoff meeting: Both the current Bat Team and the upcoming Bat Team should attend a handoff meeting on Friday afternoon (check the VA Appeals calendar) in order to reflect on the week and share recommendations and context about any carryover issues. During this time, one member should write down notes in the Bat Team Running Notes doc, so future Bat Team members have easy access and trends can be tracked long-term. When you start your Bat Shift, you should read over the Handoffs doc since your last shift for important updates.

Protocols

1. Respond to Production Issues

The most important priority for Bat Team members is to respond and facilitate resolution of production Caseflow issues as they arise. These will most likely be detected as Datadog or PagerDuty alerts on #appeals-app-alerts, by a Support team member responding to a user request, or by users themselves on Slack or email.

If a production issue has been identified, the Caseflow Status Page must be updated by either a Bat Team member or the engineering lead. The person who identified the production incident should post in #appeals-support and tag the current Bat Team members and the engineering lead and receive acknowledgement before updating.

After a production issue has occurred, the engineering lead will schedule a postmortem meeting for the engineering team and will document the incident in the post-mortems folder in the appeals-deployment repository.

2. Triage Requests from the Support and Product Management teams in Slack

Join the #appeals-batteam channel. Support and product folks will be in there and will bring issues to our attention. The Bat Team is expected to respond and facilitate resolution to all requests from Support on this channel. This does not mean the Bat Team must be the only ones to fix issues, but they should be the first responders and loop in others only as necessary.

If you want to acknowledge a slack message but don't have time to look at it immediately, use the 🦇 emoji response.

If you are actively looking into a slack message, add the 👀 emoji response.

When a slack message has been adequately addressed (question answered, or GitHub issue created, etc), add the ✅ emoji.

Prioritize pairing during investigations and starting a Slack call in the #appeals-batteam channel. This gives everyone a chance to learn about parts of Caseflow that they might not know as well.

If a Support ticket cannot be resolved within 7 days, a Github issue must be created to track long-term engineering resolution following the guidelines below.

Github Issue Creation

If a Github ticket needs to be created to follow up on a support ticket, to address a flakey test or alert (see below), or for anything else that comes up during Bat Team, please create a Github issue in the caseflow repository and include the appropriate labels as well as relevant links (to Slack conversations, test results, Sentry, etc.).

  • type (required): bug, sentry-alert, tech-improvement
  • sprint team: echo, foxtrot, tango, delta
  • product: caseflow-queue, hearings, etc.

Do not use the batteam Github label. This label is not standardized and does not provide extra information

These tickets will be picked up during the product team's backlog grooming process and be prioritized in sprints accordingly.

3. Monitor Slack for Sentry Alerts

Each week, Bat Team members should divide responsibility for monitoring the team's alert Slack channels and monitor for Sentry alerts. These come from the production Caseflow system.

When a new Sentry alert appears in Slack, it should be investigated asap. If you cannot investigate it immediately, emoji tag it with the 🦇 emoji.

Prioritize pairing during investigations. This gives everyone a chance to learn about parts of Caseflow that they might not know as well.

If a Github ticket already exists for the underlying issue, the Sentry alert should be ignored for a month.

If a Github ticket does not yet exist, create a Github ticket, with a link to the Sentry incident in the ticket description. Add the sentry-alert label to the new ticket, and any appropriate product-specific labels.

The key evaluation is whether this incident reflects an immediate production issue, particularly affecting data integrity, or whether it can be picked up during normal sprint planning. If it's an immediate production issue, you should escalate to the tech lead for the affected feature, and consult with them about next steps. If it's an outage of some kind, we should convene folks in #appeals-swat. The Bat Team should do just enough investigation to determine further action.

Mark the Sentry alert in Slack with the green checkmark emoji when it has been triaged, and you can ignore the alert in Sentry for a month.

4. Improve the codebase

Pull Requests

Drop links to PRs in the #appeals-batteam channel with the :git-pull-request: emoji. We review each others' PRs, which typically fix tests or Sentry-alert-related bugs.

Flakey tests

A signficant blocker for sprint teams is flakey tests. Right now we track them on a single GitHub issue. See Flakey Test Remedies for possible pitfalls.

We spend time-boxed effort to fix the tests, and if we can't do it in the 1 or 2 hours we allot ourselves, we skip the tests with a reason that refers to the CircleCI URL with the flake and a note about Bat Team efforts.

If you skip the test, consider adding the reason why to the Engineering Huddle agenda if you think it merits wider conversation.

Frequently Asked Questions

Preparations

To serve on Bat Team you will need to:

  • Read through this wiki fully, as updates may have been made since you last read it
  • Read the First Responder Manual

Make sure you also have:

With great power comes great responsibility. As the Bat Team focuses on resolving production issues, you will also need the below set up. Please be very careful whenever accessing production.

Slack channels we monitor

Both Bat Team members should be active on & monitoring the following channels:

  • #appeals-batteam (requests from Support)
  • #appeals-app-alerts (many alerts, pay particularly close attention to production alerting from Datadog/PagerDuty)
  • #appeals-support (low volume, discussion channel with Support team. Also gets StatusPage alerts)

The following Slack channels will receive automated alerts. Bat Team should divide the following list amongst themselves based on volume level (updated 2019/12/19) and share the list with #appeals-batteam.

  • #appeals-certification (v. quiet)
  • #appeals-demo (v. quiet)
  • #appeals-dispatch (1-5/day)
  • #appeals-efolder (1-5/day)
  • #appeals-hearings (1-5/day)
  • #appeals-idt (1-5/day)
  • #appeals-queue-alerts (many)
  • #appeals-reader (many)

Ignored Sentry alerts

Slack channels we currently don't monitor

Due to high number of alerts we can't do anything with, we are not actively watching these channels

  • #appeals-intake-alerts

As of 12/19/19, the Caseflow Delta team will take over monitoring of #appeals-job-alerts.

If VBMS API is throwing an error, whom do I contact?

Use #caseflow-vbms-dev channel to post the error in question. Provide error GUID. Example can be found here

If I have a question about VACOLS.

  • Take a look at VACOLS database tables
  • If this question cannot be answered by the Caseflow team, create a Github issue in dsva-vacols repo

If I receive a support issue regarding Dispatch claim being "stuck".

  1. Look for a Sentry error in #appeals-dispatch channel.
  2. If the error is related to AASM::InvalidTransition, that means the state transition is invalid.
  3. Find the associated dispatch task in the dispatch_tasks table.
  4. Choose the appropriate state for the task by referencing the aasm machine in Dispatch::Task model and update the task manually using production console.
  5. Example of a similar problem can be found here.

RedistributedCase::CannotRedistribute Sentry alert

For Sentry alerts about RedistributedCase::CannotRedistribute, check to see if it is currently in the correct VACOLS location (i.e., not 81 "case storage"):

vacols_id=<paste the id from the sentry alert page>
pp VACOLS::Priorloc.where(lockey: vacols_id).order(:locdout).pluck(:locdout, :locstto, :locstrcv)
  • locdout: date location was changed
  • locstto: location code it was changed to
  • locstrcv: css_id for user that changed location. If it was Caseflow, I think the value will be DSUSER

If the last location entry is assigned to a user, then no action is needed. It's typically assigned to a judge and the judge should see it in their "Cases to assign" queue.

If an Automatic Case Distribution job fails to run and errors with ActiveRecord::RecordNotUnique how do I fix it?

  1. Look for the Sentry error and identify the case_id (or vacols_id) value.
  2. If it's an integer, it's a legacy (VACOLS) case.
  3. Review the tasks assigned to the LegacyAppeal with that vacols_id.
appeal = LegacyAppeal.find_by(vacols_id: the_case_id)
puts appeal.structure_render(:id, :status)

3a. If it does not have any tasks where the status is assigned, it may need to be re-distributed. This should now be handled automatically using the string -redistributed within the case_id -- if not, let Team Echo know. Any cases that cannot be automatically distributed will be reported as a CannotRedistribute error.

  • a. Update the existing DistributedCase with that case_id value to append the string -attempt1 to the value. The judge should then be able to re-run the Distribution job via the "Request more cases" button.
the_case_id = # case ID value identified above
DistributedCase.find_by(case_id: the_case_id).update!(case_id: "#{the_case_id}-attempt1")
# should return true
DistributedCase.find_by(case_id: the_case_id)
# should return nil

3b. If the LegacyAppeal has assigned tasks, the VACOLS location may just need to be updated, which will pull it out of the eligible-for-distribution pool.

  • If the appeal has open hearing tasks, the location should be updated to 'Caseflow'.
    appeal = LegacyAppeal.find_by(vacols_id: the_case_id)
    AppealRepository.update_location!(appeal, LegacyAppeal::LOCATION_CODES[:caseflow])

It is always a good idea to keep track of the last vacols case you have handled so we don't miss any.

If you have missed a bunch overnight, here is a way to batch them all

vacols_ids = []
appeals = vacols_ids.map { |vacols_id| LegacyAppeal.find_by(vacols_id: vacols_id) }
appeals_with_active_tasks = appeals.select do |appeal|
  appeal.tasks.active.where.not(type: TrackVeteranTask.name).any?
end.map(&:vacols_id)
appeals_without_active_tasks = vacols_ids - appeals_with_active_tasks
# Confirm this by checking out task trees:
task_trees = appeals.map { |appeal| appeal.structure_render(:status) }
task_trees.each{ |task_tree| puts task_tree }
# Fix the appeals without active tasks
appeals_without_active_tasks.each do |vacols_id|
  DistributedCase.find_by(case_id: vacols_id).update!(case_id: "#{vacols_id}-attempt1")
end
# Fix appeals with active tasks
appeals_with_active_tasks.each do |vacols_id| 
 AppealRepository.update_location!(LegacyAppeal.find_by(vacols_id: vacols_id), LegacyAppeal::LOCATION_CODES[:caseflow])
end

Stop Sentry from sending [Alert] on to Slack

You can skip specific exception classes from generating Slack messages via the lamba.

Uncancel an Appeal accidentally closed by Closing Last Issue

If a user removes the final issue on an appeal, they will trigger cancellation of the case. See this slack thread. If an appeal in this state is brought to support/batteam, this state has most likely been caused accidentally by an attorney or intentionally by another person to cancel the appeal.

To determine whether this was done accidentally (and is in need of fixing) or intentionally (and the user needs to be notified as such) we need to confirm who removed the final request issue. Relevant Slack conversation.

Example:

uuid = ""
appeal = Appeal.find_by(uuid: uuid)
puts appeal.structure_render(:status, :closed_at)
# Note the time that the AttorneyTask, JudgeDecisionReviewTask, and RootTask were all cancelled
pp appeal.request_issues.pluck(:id, :closed_status, :closed_at)
# Confirm the most recently closed issue has the same timestamp as the cancelled tasks
remover_id = RequestIssuesUpdate.where(review: appeal, after_request_issue_ids: []).order(updated_at: :desc).first.user_id
User.find(remover_id)

If the person who removed the last issue was not the attorney, this was most likely an intentional case cancellation. In this situation, we would instruct the user to confirm with LRP that is case was intentionally cancelled.

In the case that the attorney did this accidentally, the css_id of the attorney and the person who removed the last issue would be the same. The process to fix this is to uncancel the Root, JudgeDecisionReview, and AttorneyTask, and instruct the User to add the corrected issue before closing the other one.

Example code:

uuid = ""
appeal = Appeal.find_by(uuid: uuid)
puts appeal.structure_render(:id,:status,:closed_at)
attorney_task_id = ""
judge_dr_task_id = ""
root_task_id= ""
Task.find(root_task_id).update!(status: Constants.TASK_STATUSES.on_hold, closed_at: nil)
Task.find(judge_dr_task_id).update!(status: Constants.TASK_STATUSES.on_hold, closed_at: nil)
Task.find(attorney_task_id).update!(status: Constants.TASK_STATUSES.assigned, closed_at: nil)
puts appeal.structure_render(:id,:status,:closed_at)

Creating a new Job Alert for Slack

See this guide

Cancel an Attorney Task & Return to Judge Assign Queue

This can be complete by users.

Merge user accounts

Read all about it

Intake

Making a "stuck" issue available from a manually cleared claim

How to detect this

  • A user is trying to add a request issue on an intake, but it is ineligible because it is already in active review
  • That same request issue is on another review, whose end product has been cleared
  • The review doesn't have any decision issues, even though it got cleared a while ago

Why it's happening

Users may want to cancel a claim for various reasons, for example if the veteran submitted a Supplemental Claim, but didn't submit any additional evidence. The proper way to do this is to remove the review, by going to the edit screen and removing the issues. However, instead users were manually clearing these claims in VBMS or Share. This may be because they do not get credit for canceled claims. We have heard that this behavior is now prevented in VBMS, I'm not sure about Share.

Whether a user cancels a claim in Caseflow or in VBMS/Share, that's okay and we detect that behavior and close the request issues. However, if a claim gets cleared, we interpret that as meaning it was fully processed and should be getting a new award generated to reflect the decision. So we start pinging VBMS and BGS for the decision. Once we get the disposition from VBMS, and if it was a rating issue, the new rating issue from BGS, we create a Caseflow decision issue.

If it was manually cleared, but should have been canceled, then we never get a decision issue. So it appears to us that the issue is still active, preventing users from adding it to a new intake, but it will also never get a decision.

Investigation and resolution

# First, find the cleared claim
v=Veteran.find(1234) # find the veteran
scs=SupplementalClaim.where(veteran_file_number: v.file_number)
scs.first.end_product_establishments

# Let's say the first SC has one end product establishment, and it has a synced_status of "CLR".  Then it may be what I'm looking for. I may want to check the support ticket for the specific issue, or check out other reviews too.
sc=scs.first
epe=sc.end_product_establishments.first

# Double check that there are no decision issues. If there are decision issues, this is not the right review.
sc.decision_issues.empty?

# Double check that the request issues don't have a contention, because sometimes decision issue syncing fails due to lack of a new rating issue.
epe.request_issues.first.contention_disposition.nil?

# Check if it got cleared a while ago (more than a month).
epe.last_synced_at

# If all of the above is true, there's a good chance it was manually cleared. Then you can close the issues with "no_decision", indicating we never expect to get a decision for them.
epe.request_issues.each{|ri| RequestIssueClosure.new(ri).with_no_decision!}

Undo removing a request issue

This may be needed if a user gets an error when removing an issue. The alert they may submit will be titled "Previous update not yet done processing", and the error class is VBMS::CannotDeleteContention.

Here's an example of code to use, please adapt it to your needs.

Find the request issues update:

uuid_of_review=""

review=HigherLevelReview.find_by(uuid: uuid_of_review)

# Find the request issues update causing the error
# There should theoretically only be one
riu=RequestIssuesUpdate.where(review_type: review.class.name, review_id: review.id).processable.first

# Get this from the error message, for example: <VBMS::Responses::Contention id=\"94613799\"
riu.error
contention_id_of_failed_removal =

# First try re-processing it, just in case someone resolved the problem in VBMS
riu.establish!

# Find the removed issue causing the problem
removed_issue = riu.removed_issues.find{|ri| ri.contention_reference_id == contention_id_of_failed_removal}

# Update the removed issue's to an open status
removed_issue.update!(closed_status: nil, closed_at: nil)

# Check if there were any other changes on that request issues update
# If the removed issue was the only change, then cancel the request issues update because the update validation will fail if there are no changes
# Note: when adding these notes, we had the first case (one removed issue, no other changes), so the case where there were also other changes may require additional actions.  From what I've checked, I think this should cover it, but it may be good to double check. 

if riu.all_updated_issues == [removed_issue]
  riu.canceled!
else
  # Update the request issue update to not remove that issue
  riu.update!(after_request_issue_ids: riu.after_request_issue_ids.push(ri.id))

  # Un-memoize after_issues so that the request issues update recognizes its new state
  riu.instance_variable_set(:@after_issues, nil)

  # Re-process the request issues update
  riu.establish!
end
Clone this wiki locally