Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add suspend stats related check (Bugfix) #1700

Open
wants to merge 9 commits into
base: main
Choose a base branch
from

Conversation

hanhsuan
Copy link
Contributor

@hanhsuan hanhsuan commented Jan 24, 2025

Description

WARNING: This modifies com.canonical.certification::sru-server

According to the suspend_advanced_auto is the dependency for all after suspend jobs, it would be better to do more tests for this concept to fix #1579.
Therefore, this PR is the first step to collect information by:

  1. valid_suspend_status is used to check success should be non zero
  2. any_fail_during_supend is used to check fail should be zero

If everything woks well, valid_suspend_status will be merge into suspend_advanced_auto in the future.

Resolved issues

#1579

Documentation

Tests

succ 202001-27667
fail 202412-36131

1. test case is_suspend_success will be merged into
   suspend_advanced_auto
2. is_device_supend_success is used to check any fail
Copy link

codecov bot commented Jan 24, 2025

Codecov Report

Attention: Patch coverage is 92.15686% with 4 lines in your changes missing coverage. Please review.

Project coverage is 49.16%. Comparing base (e5b7c6f) to head (6021828).
Report is 26 commits behind head on main.

Files with missing lines Patch % Lines
providers/base/bin/suspend_stats.py 92.15% 3 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1700      +/-   ##
==========================================
+ Coverage   49.01%   49.16%   +0.15%     
==========================================
  Files         372      374       +2     
  Lines       40348    40481     +133     
  Branches     6817     6845      +28     
==========================================
+ Hits        19777    19903     +126     
- Misses      19849    19853       +4     
- Partials      722      725       +3     
Flag Coverage Δ
provider-base 25.38% <92.15%> (+0.58%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@hanhsuan hanhsuan marked this pull request as ready for review January 24, 2025 05:38
@hanhsuan hanhsuan changed the title Add suspend stats releated check (Bugfix) Add suspend stats related check (Bugfix) Jan 24, 2025
Copy link
Collaborator

@Hook25 Hook25 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well done, I really like what I see and I wouldn't be against merging (the half that checks we did actually suspend) in suspend advanced auto.

I have 2 big worries:

  1. Is this supported on any kernel we have to support? Even xenial? Can you check? If not, what do we do about it?
  2. Does this work from a strict snap? Can you check?

Comment on lines 69 to 74
return (
self.contents["success"] != "0"
and self.contents["failed_prepare"] == "0"
and self.contents["failed_suspend"] == "0"
and self.contents["failed_resume"] == "0"
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Big questoion is, why are we comparing these values as strings? Is it even possible that these are not valid integers? if that happens I suppose its a big deal, for this reason I've also suggested to remove the utf8/ignore errors above, if sysfs is broken we want to know

Another big question is what do we do if one or more of these files is missing? is it possible? if it is, when does it happen? should we have a default for those situations?

@Hook25
Copy link
Collaborator

Hook25 commented Jan 31, 2025

Adding this I dug up from the kernel docs: https://github.com/torvalds/linux/blob/69e858e0b8b2ea07759e995aa383e8780d9d140c/Documentation/ABI/testing/sysfs-power#L317C9-L317C40

If this means what I think it means, then we need a fallback mechanism

@hanhsuan
Copy link
Contributor Author

hanhsuan commented Feb 12, 2025

Adding this I dug up from the kernel docs: https://github.com/torvalds/linux/blob/69e858e0b8b2ea07759e995aa383e8780d9d140c/Documentation/ABI/testing/sysfs-power#L317C9-L317C40

If this means what I think it means, then we need a fallback mechanism

I've checked it on the 201604-21849

ubuntu@201604-21849:/sys/power$ grep -r "" *
disk:[platform] shutdown reboot suspend 
image_size:6662103040
pm_async:1
pm_freeze_timeout:20000
pm_print_times:0
pm_test:[none] core processors platform devices freezer
pm_trace:0
pm_trace_dev_match:usb
grep: pm_wakeup_irq: No data available
reserved_size:1048576
resume:0:0
state:freeze mem disk
wake_lock:
wake_unlock:
wakeup_count:0

The things might be used to check are

ubuntu@201604-21849:~$ grep "PM\:" 1604_suspend.log 
Feb 11 22:13:20 201604-21849 kernel: PM: Syncing filesystems ... done.
Feb 11 22:13:20 201604-21849 kernel: PM: Preparing system for sleep (mem)
Feb 11 22:13:51 201604-21849 kernel: PM: Suspending system (mem)
Feb 11 22:13:51 201604-21849 kernel: PM: suspend of devices complete after 1155.793 msecs
Feb 11 22:13:51 201604-21849 kernel: PM: late suspend of devices complete after 20.104 msecs
Feb 11 22:13:51 201604-21849 kernel: PM: noirq suspend of devices complete after 17.105 msecs
Feb 11 22:13:51 201604-21849 kernel: PM: Saving platform NVS memory
Feb 11 22:13:51 201604-21849 kernel: PM: Restoring platform NVS memory
Feb 11 22:13:51 201604-21849 kernel: PM: noirq resume of devices complete after 15.017 msecs
Feb 11 22:13:51 201604-21849 kernel: PM: early resume of devices complete after 17.481 msecs
Feb 11 22:13:51 201604-21849 kernel: PM: resume of devices complete after 449.473 msecs
Feb 11 22:13:51 201604-21849 kernel: PM: Finishing wakeup.

and

ubuntu@201604-21849:~$ grep "systemd-sleep" 1604_suspend.log 
Feb 11 22:13:20 201604-21849 systemd-sleep[16925]: Selected interface 'mlan0'
Feb 11 22:13:20 201604-21849 systemd-sleep[16925]: OK
Feb 11 22:13:20 201604-21849 systemd-sleep[16925]: Suspending system...
Feb 11 22:13:51 201604-21849 systemd-sleep[16925]: System resumed.
Feb 11 22:13:51 201604-21849 systemd-sleep[16925]: Selected interface 'mlan0'
Feb 11 22:13:51 201604-21849 systemd-sleep[16925]: OK

Do you think it is enough to check the output of journal -b 0 -g systemd-sleep should not include error for validating the suspend status in 16.04 and 18.04 ? The log file is uploaded to CHECKBOX-1642

@hanhsuan
Copy link
Contributor Author

hanhsuan commented Feb 17, 2025

According to this, I think the failed_* indicates one or more device failed. Therefore the kernel changes the success from 0 to 1.
The device failed log:

Feb 17 07:49:24 ubuntu-202412-36132 kernel: spd5118 2-0050: PM: dpm_run_callback(): spd5118_resume [spd5118] returns -6
Feb 17 07:49:24 ubuntu-202412-36132 kernel: spd5118 2-0050: PM: failed to resume async: error -6

Therefore, I would like to change the method is_after_suspend to

    def is_after_suspend(self) -> bool:
        """
        The system is under after suspend status or not

        :returns: return Ture while system is under after suspend status
        """
        return self.contents["success"] != "0"

and is_any_failed to

    def is_any_failed(self) -> bool:
        """
        Is any failed during suspend

        :returns: return Ture while one failed during suspend
        """
        for c, v in self.contents.items():
            if c.startswith("fail") and v != "0":
                return True
        return False

@Hook25 How do you think ?

1. is_after_suspend --> success != 0
2. is_any_failed --> fail* != 0
@hanhsuan
Copy link
Contributor Author

New check condition test result:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

suspend/suspend_advanced_auto always returns passed result
2 participants