Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mediatek-filogic: weird tq on wr3000 - wifi instability after few minutes #3305

Open
4 of 5 tasks
maurerle opened this issue Jul 5, 2024 · 6 comments
Open
4 of 5 tasks

Comments

@maurerle
Copy link
Member

maurerle commented Jul 5, 2024

General instability on mediatek filogic devices with mt7915e have been seen, especially on the WR3000, WAX220 and others.
It has to be noted that some devices work better than others. Heavy wifi mesh seems to make the situation worse.

What is the problem?

An example of this is this behavior is this device:
https://grafana.ffac.rocks/d/000000002/node?orgId=1&var-node=80afca06d558&from=1718344052951&to=1718403869219&viewPanel=13
image

which includes very varying TQ of the device.

The latest finding is this:
https://grafana.ffac.rocks/d/000000002/node?orgId=1&var-node=80afca06d558&from=1720175532350&to=1720193698710&var-select_hostname=ffac-seilpforte-wr3000&var-hostname=ffac-seilpforte-wr3000&var-saveinterval=1m&var-nodetolink=0c0e76cf5d5e&viewPanel=13
image

At 1. I restarted the wifi driver using rmmod mt7915e && modprobe mt7915e
At 2. I added another mesh device with which this device could mesh on mesh1, creating the timeout issue without the device being possible to reload the firmware
At 3. I restarted the device, as nothing helped.

Afterward, the weird changing TQ can be seen, which behaves in weird waves.

The current workaround includes reloading the mt7915e driver and rebooting the device once the mt7915e bug from #3154 occurs.
A package for this can be found here: https://github.com/ffac/gluon-packages/tree/main/ffac-mt7915-hotfix/files/lib/gluon/mt7915

As @nrbffs also noted on IRC, some other people reported instability with these devices as well. Currently, reloading the wifi driver twice a day seems to help in this situation..

This issue is not about #3154 but about the weird changing TQ leading to bad mesh quality and wifi quality.

What is the expected behaviour?

Mesh and wifi quality should be stable on mediatek filogic devices such as the WR3000.

Further steps

  • check logread to find any related messages (nothing found)
  • test what happens if mesh is using HT20 instead of HE20 (decrease wifi standard from wifi6 to wifi4) - did not help
  • test gluon build from openwrt master firmware (did not recognize the radios at all...)
  • check ls /sys/kernel/debug/ieee80211/phy*/mt76 to find something
  • help needed

TX_Stats

I found that on other devices cat /sys/kernel/debug/ieee80211/phy1/mt76/tx_stats does only show values for 1 to 4 while the affected WR3000 has values for 1 to 8

Phy 0, Phy band 0
Length:        1 |   2 - 10 |  11 - 19 |  20 - 28 |  29 - 37 |  38 - 46 |  47 - 55 |  56 - 79 |  80 -103 | 104 -127 | 128 -151 | 152 -175 | 176 -199 | 200 -223 | 224 -247 | 
Count:      6743 |     5177 |      604 |      141 |      145 |        0 |        1 |        0 |        0 |        0 |        0 |        0 |        0 |        0 |        0 | 
BA miss count: 7072

Tx Beamformer applied PPDU counts: iBF: 0, eBF: 2461
Tx Beamformer Rx feedback statistics: All: 541, HE: 539, VHT: 2, HT: 0, BW20, NC: 8286, NR: 8589
Tx Beamformee successful feedback frames: 0
Tx Beamformee feedback triggered counts: 0
Tx multi-user Beamforming counts: 0
Tx multi-user MPDU counts: 0
Tx multi-user successful MPDU counts: 0
Tx single-user successful MPDU counts: 482790

Tx MSDU statistics:
AMSDU pack count of 1 MSDU in TXD:   227714 ( 99%)
AMSDU pack count of 2 MSDU in TXD:      180 (  0%)
AMSDU pack count of 3 MSDU in TXD:       86 (  0%)
AMSDU pack count of 4 MSDU in TXD:       71 (  0%)
AMSDU pack count of 5 MSDU in TXD:       39 (  0%)
AMSDU pack count of 6 MSDU in TXD:       31 (  0%)
AMSDU pack count of 7 MSDU in TXD:       20 (  0%)
AMSDU pack count of 8 MSDU in TXD:      129 (  0%)

I do not really know if this is related or not, just a finding.

Gluon Version:
v2023.2.3

Site Configuration:
ffac @ v2023.2.3-2

Custom patches:
see site

@maurerle maurerle changed the title mediatek-filogic: weird tq on wr3000 mediatek-filogic: weird tq on wr3000 - wifi instability after few minutes Jul 5, 2024
@blocktrron
Copy link
Member

Can you check if the tx retries / tx failed counters from iw dev mesh{0,1} station dump are continously incrementing?

@maurerle
Copy link
Member Author

maurerle commented Jul 6, 2024

They are slightly increasing, but most of the time, they are constant.

tx failed

root@ffac-seilpforte-wr3000:~# iw dev mesh0 station dump | grep "tx failed"
	tx failed:	2228
	tx failed:	47
	tx failed:	2249
	tx failed:	127
	tx failed:	535

# after 10 minutes
root@ffac-seilpforte-wr3000:~# iw dev mesh0 station dump | grep "tx failed"
	tx failed:	2236
	tx failed:	85
	tx failed:	2259
	tx failed:	171
	tx failed:	535

tx retries

root@ffac-seilpforte-wr3000:~# iw dev mesh0 station dump | grep "tx retries"
	tx retries:	2223
	tx retries:	47
	tx retries:	2234
	tx retries:	124
	tx retries:	506

# after 10 minutes
root@ffac-seilpforte-wr3000:~# iw dev mesh0 station dump | grep "tx retries"
	tx retries:	2231
	tx retries:	85
	tx retries:	2243
	tx retries:	168
	tx retries:	506

batctl p towards some mesh partner often does not work either, with package losses above 90%.
Does this help?

@maurerle
Copy link
Member Author

I just tested the MTK patch:
dd114b5
from @blocktrron's branch:
https://github.com/freifunk-gluon/gluon/compare/main...blocktrron:gluon:mtk-git-txs.patch

It looked good until I reloaded the driver at about 7:20
Then we had the usual airtime and link stability problems.
Until I reloaded the driver again at 08:45.
Problems then started again at 9:30

image

The logread still does not hint to something useful.
So this issue is waiting for other ideas for now :)

@T-X
Copy link
Contributor

T-X commented Jul 13, 2024

Some driver hiccup is probably the most likely. But still wanted to ask, as it's not clear to me from these graphs alone: Has external traffic causing these losses been ruled out? Is there some available airtime graph? Does it correlate with some route changes in batman-adv?

(I've seen funny route flapping / TQ changes/breakdowns caused by unicast traffic in the past in a test in a specific setup years ago when it was still 802.11g, caused by a hidden node problem: https://www.open-mesh.org/projects/batman-adv/wiki/Bcast-hidden-node. There it would oscillate between the good two-hop route and a bad, direct 1-hop route. Even if CTS/RTS was enabled for unicast traffic. Traffic over the two-hop route would interfere with the batman-adv OGM broadcasts... causing a breakdown in TQ and then switching to 1-hop. Then the TQ would improve, things would switch back to 2-hop. Rinse and repeat. Usually the hidden-node-problem should be quite rare though. Maybe even less likely with newer 802.11 revisions / improvements therein?)

@maurerle
Copy link
Member Author

Thanks for looking into this @T-X .
In FFAC we only have the WR3000 and WAX220 devices as filogic.
This did not yet occur on ramips-mt7621 devices (which also have the mt7915e wifi driver).

In our test-setup we experienced the same problems with a WAX220:
https://grafana.ffac.rocks/d/000000002/node?orgId=1&var-node=9418654360cb&from=1720773987464&to=1720797801279 (I did not see any correlation with Traffic, gateway, reboot or anything)

There seems to be another test-installation with wr3000 and wax220 which does not have such issues according to its grafana:
https://grafana.ffac.rocks/d/000000002/node?orgId=1&var-node=80afca09d90c&from=1720305372298&to=1721030139574
https://grafana.ffac.rocks/d/000000002/node?refresh=1m&orgId=1&var-node=941865436b4b&from=now-30d&to=now-1m
The TQ has these drops as we reload the mt7915 driver three times a day for filogic hardware due to these connectivity problems.

Remotely, I can only debug by checking the TQ with mesh partners (if there are any) - the actual symptom is that the wifi is unusable as a client when connected to the node during the times in which the device has 3-4% of TQ to neighbors (even though it might have a 100% TQ to mesh-vpn, so it surely is the wifi driver)

I don't think it is a flapping route in batman, though I can not exclude this completely.

@maurerle
Copy link
Member Author

maurerle commented Sep 2, 2024

First of all - the issue is still present in latest firmware with updated openwrt - as well as on openwrt master.

I just noticed, that before some firmware iterations, the max TQ and min TQ were both fluctuating:

image
(see here - wr3000)
another example is this one - wr3000:

The latest v2023.2.x firmware does have a solid max tq but still varying min tq which seems broken.
AFAIK the wifi is still unstable for clients - don't know if it got better - I am using the mesh quality as a remote monitoring indicator of the mesh.

The solid max tq can be seen here - wax220:
image

The same bad mesh symptoms were found on the NWA55AXE as well

nwa55axe
this is ramips-mt7621 though (linkt to grafana - nwa55axe)

openwrt master

I did build firmware from openwrt master to test, though it sometimes did not even load the wifi driver at all (for the wr3000) and did show the above behavior as well for the WAX220.

So the problem is still not solved in openwrt master as of August 2024 (commit 5d2a008670122f3f69eb3ab4f776d9fe9b6d76dd).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants