Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

findspam.py: Add reason user unregistered or non-existent #4131

Closed
wants to merge 12 commits into from

Conversation

user12986714
Copy link
Contributor

Spam posts often show such pattern.

@makyen
Copy link
Contributor

makyen commented Jun 30, 2020

Is this actually a good discriminator between spam vs. non-spam?

It appears that this will detect every single post that is by an unregistered user. What is the approximate volume of those per day across SE?

This also detects deleted accounts. SD usually scans new posts fairly quickly after they are posted. How often is a spammer going to have deleted their account prior to SD scanning?

It looks like this will detect every post where the user has been deleted. There are a lot of old posts where the user was deleted sometime between when it was posted and now. I suspect the vast majority of those are not spam. I would suggest this detection exclude from detection any post that is over X age, where X is quite small (maybe a day?, or less?).

@makyen
Copy link
Contributor

makyen commented Jun 30, 2020

At least some of the requests to the SE API don't actually ask for the user_type, so that information isn't available, unless you change the filter that's being passed to the SE API. I'd have to go through them methodically and determine which ones need to change.

@user12986714 user12986714 marked this pull request as draft June 30, 2020 23:47
@user12986714
Copy link
Contributor Author

This PR is not working currently with unknown reasons.

@makyen
Copy link
Contributor

makyen commented Jul 1, 2020

The filter used in API requests in at least two other places needs to change:

These filter parameters also need to be updated by PR #4074. Please coordinate to have both PRs use the same filters which include all desired properties, so that when merge conflicts are being resolved a third effort doesn't have to be put in to reverse engineer whatever changes were made in each PR to have the eventual merge include all the needed properties. We should be able to reduce the amount of work for determining the filter values by at least a third, if not two thirds, with a little coordination.

In fact, we should be able to use a single value for all four of the filters, because it doesn't matter to the SE API if you specify that you want data from object types which the endpoint doesn't provide.

Note that PR #3885 restructures the code with the filter in chatcommands.py.

@iBug
Copy link
Member

iBug commented Jul 4, 2020

Leaving it here for information:

Changes to filters in #4074: Added body_markdown for questions and answers where body is already selected.

Until #4145 has been implemented, you can re-do the above filter changes in this PR and expect everything goes as usual.

@teward
Copy link
Member

teward commented Jul 13, 2020

Is this actually a good discriminator between spam vs. non-spam?

From what I know, not really. Anonymous edits happen all the time, etc. and not all of them are spam. (Most anon edits tend to lead to unregistered/registrations anyways, and most of the Spam contributors are "Registered" to try and circumvent things)

It appears that this will detect every single post that is by an unregistered user. What is the approximate volume of those per day across SE?

I'm assuming this volume is very very high because of attempts and Unregistered users making edit contributions.

This also detects deleted accounts. SD usually scans new posts fairly quickly after they are posted. How often is a spammer going to have deleted their account prior to SD scanning?

In a few rare instances this has been the case, but usually it is NOT the case that a user is deleted prior to an SD scan. SD is usually faster at detecting than moderators are at destroying users.

It looks like this will detect every post where the user has been deleted. There are a lot of old posts where the user was deleted sometime between when it was posted and now. I suspect the vast majority of those are not spam. I would suggest this detection exclude from detection any post that is over X age, where X is quite small (maybe a day?, or less?).

This as well is important.


This PR hasn't had any changes in a week, and has not taken into account anything currently.

Further:

This PR is not working currently with unknown reasons.

I'm closing this because it's "not working" currently. If you fix the issues, resubmit a PR but also keep in mind all the comments made thus far.

@teward teward closed this Jul 13, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

4 participants