findspam.py: Add reason user unregistered or non-existent #4131

user12986714 · 2020-06-29T22:44:45Z

Spam posts often show such pattern.

makyen · 2020-06-30T00:58:17Z

Is this actually a good discriminator between spam vs. non-spam?

It appears that this will detect every single post that is by an unregistered user. What is the approximate volume of those per day across SE?

This also detects deleted accounts. SD usually scans new posts fairly quickly after they are posted. How often is a spammer going to have deleted their account prior to SD scanning?

It looks like this will detect every post where the user has been deleted. There are a lot of old posts where the user was deleted sometime between when it was posted and now. I suspect the vast majority of those are not spam. I would suggest this detection exclude from detection any post that is over X age, where X is quite small (maybe a day?, or less?).

makyen · 2020-06-30T11:28:58Z

At least some of the requests to the SE API don't actually ask for the user_type, so that information isn't available, unless you change the filter that's being passed to the SE API. I'd have to go through them methodically and determine which ones need to change.

user12986714 · 2020-06-30T23:47:32Z

This PR is not working currently with unknown reasons.

makyen · 2020-07-01T22:53:56Z

The filter used in API requests in at least two other places needs to change:

These filter parameters also need to be updated by PR #4074. Please coordinate to have both PRs use the same filters which include all desired properties, so that when merge conflicts are being resolved a third effort doesn't have to be put in to reverse engineer whatever changes were made in each PR to have the eventual merge include all the needed properties. We should be able to reduce the amount of work for determining the filter values by at least a third, if not two thirds, with a little coordination.

In fact, we should be able to use a single value for all four of the filters, because it doesn't matter to the SE API if you specify that you want data from object types which the endpoint doesn't provide.

Note that PR #3885 restructures the code with the filter in chatcommands.py.

iBug · 2020-07-04T05:44:09Z

Leaving it here for information:

Changes to filters in #4074: Added body_markdown for questions and answers where body is already selected.

Until #4145 has been implemented, you can re-do the above filter changes in this PR and expect everything goes as usual.

teward · 2020-07-13T14:34:49Z

Is this actually a good discriminator between spam vs. non-spam?

From what I know, not really. Anonymous edits happen all the time, etc. and not all of them are spam. (Most anon edits tend to lead to unregistered/registrations anyways, and most of the Spam contributors are "Registered" to try and circumvent things)

It appears that this will detect every single post that is by an unregistered user. What is the approximate volume of those per day across SE?

I'm assuming this volume is very very high because of attempts and Unregistered users making edit contributions.

This also detects deleted accounts. SD usually scans new posts fairly quickly after they are posted. How often is a spammer going to have deleted their account prior to SD scanning?

In a few rare instances this has been the case, but usually it is NOT the case that a user is deleted prior to an SD scan. SD is usually faster at detecting than moderators are at destroying users.

It looks like this will detect every post where the user has been deleted. There are a lot of old posts where the user was deleted sometime between when it was posted and now. I suspect the vast majority of those are not spam. I would suggest this detection exclude from detection any post that is over X age, where X is quite small (maybe a day?, or less?).

This as well is important.

This PR hasn't had any changes in a week, and has not taken into account anything currently.

Further:

This PR is not working currently with unknown reasons.

I'm closing this because it's "not working" currently. If you fix the issues, resubmit a PR but also keep in mind all the comments made thus far.

user12986714 added 4 commits June 29, 2020 18:43

findspam.py: Add reason user unregistered or non-existent

de59f4f

Typo fix

1d941b3

Adjust to create_rule() API

6f9b92c

Make CI happy attempt 1

10d0104

user12986714 added 4 commits June 29, 2020 22:26

_Post.py: Add creation date property

3ec2e5a

Take into consideration post creation date and user name

b4ca430

Bug fix

c02bceb

Further bug fix

1dd67ca

user12986714 added 3 commits June 30, 2020 12:44

apigetpost.py: Include more useful fields

dc2240c

apigetpost.py: Update filter - questions

756bbfd

apigetpost.py: Don't error out if user_type does not exist

740b9b7

user12986714 marked this pull request as draft June 30, 2020 23:47

apigetpost.py: Separate user_type populating process

4f0e9f7

user12986714 force-pushed the patch-42 branch from 3e753ae to 4f0e9f7 Compare July 1, 2020 19:31

makyen self-requested a review July 1, 2020 22:51

teward closed this Jul 13, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

findspam.py: Add reason user unregistered or non-existent #4131

findspam.py: Add reason user unregistered or non-existent #4131

user12986714 commented Jun 29, 2020

makyen commented Jun 30, 2020

makyen commented Jun 30, 2020

user12986714 commented Jun 30, 2020

makyen commented Jul 1, 2020

iBug commented Jul 4, 2020

teward commented Jul 13, 2020

findspam.py: Add reason user unregistered or non-existent #4131

findspam.py: Add reason user unregistered or non-existent #4131

Conversation

user12986714 commented Jun 29, 2020

makyen commented Jun 30, 2020

makyen commented Jun 30, 2020

user12986714 commented Jun 30, 2020

makyen commented Jul 1, 2020

iBug commented Jul 4, 2020

teward commented Jul 13, 2020