Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Forbid slow in inside comprehensions #864

Open
orsinium opened this issue Oct 9, 2019 · 16 comments
Open

Forbid slow in inside comprehensions #864

orsinium opened this issue Oct 9, 2019 · 16 comments
Labels
Hacktoberfest Hactoberfest fun! help wanted Extra attention is needed level:starter Good for newcomers rule request Adding a new rule

Comments

@orsinium
Copy link
Collaborator

orsinium commented Oct 9, 2019

Rule request

Thesis

len with comprehensions

Iterators have no len, and sometimes I forgetting it.

Bad:

len(1 for el in a if el in b)

Better:

len([1 for el in a if el in b])

in inside comprehensions

Good:

sum(1 for el in a if el in b)

Twice slower, but also ok:

sum(el in b for el in a)

Reasoning

Detect runtime TypeError in advance. We could also detect len from yield-like iterators, but resolving symbols in python always is a non-trivial thing, unfortunately.

@orsinium orsinium added the rule request Adding a new rule label Oct 9, 2019
@sobolevn
Copy link
Member

sobolevn commented Oct 9, 2019

Good idea!

But, we don't assume types as the best practice. mypy catches this case perfectly:

# ex.py
len(1 for el in a if el in b)

Output:

ex.py:1: error: Argument 1 to "len" has incompatible type "Generator[int, None, None]"; expected "Sized"

But, sum is the whole new case. That's about performance and the best practice.
That's something I want to have.

We can forbid to use x in y inside value part of the comprehension. And force people to write 1 or True and if x in y.

@sobolevn sobolevn changed the title Forbid len from lazy comprehension Forbid slow in inside comprehensions Oct 9, 2019
@sobolevn
Copy link
Member

sobolevn commented Oct 9, 2019

We need to measure the speed for different types with timeit
And make a decision based on the results.

@sobolevn sobolevn added this to the Version 0.13 milestone Oct 9, 2019
@sobolevn sobolevn added Hacktoberfest Hactoberfest fun! help wanted Extra attention is needed level:starter Good for newcomers labels Oct 9, 2019
@orsinium
Copy link
Collaborator Author

orsinium commented Oct 9, 2019

From worst to best:

from random import randint 
elements = list(randint(-100000, 100000) for _ in range(1000000))  

%timeit sum(a > 0 for a in elements) 
# 130 ms ± 3.15 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit sum(True for a in elements if a > 0) 
# 92.1 ms ± 2.62 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%%timeit
s = 0
for a in elements:
  if a > 0:
    s += 1
# 73.1 ms ± 714 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit sum(1 for a in elements if a > 0)  
# 71.8 ms ± 720 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit len([1 for a in elements if a > 0])  
# 62.4 ms ± 9.78 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

@orsinium
Copy link
Collaborator Author

orsinium commented Oct 9, 2019

The last one has a big std deviation. So, I'm not sure is it really the best or some optimization for repetitive list creation.

@ManishAradwad
Copy link
Contributor

Would like to take this! Can u plz assign this to me

@ManishAradwad
Copy link
Contributor

From worst to best:

from random import randint 
elements = list(randint(-100000, 100000) for _ in range(1000000))  

%timeit sum(a > 0 for a in elements) 
# 130 ms ± 3.15 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit sum(True for a in elements if a > 0) 
# 92.1 ms ± 2.62 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%%timeit
s = 0
for a in elements:
  if a > 0:
    s += 1
# 73.1 ms ± 714 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit sum(1 for a in elements if a > 0)  
# 71.8 ms ± 720 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit len([1 for a in elements if a > 0])  
# 62.4 ms ± 9.78 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

So, which one should be considered as a best practice??

@sobolevn
Copy link
Member

sobolevn commented Oct 13, 2019

sum(1 for a in elements if a in b) and len([1 for a in elements if a > 0]) win!

It means that it is better to use sum(1 for a in elements if a in b) then sum(a in b for a in elements). And this rule is perfectly valid.

@ManishAradwad
Copy link
Contributor

So, I'm done with the local setup and ready to implement the changes. But, I'm quite confused about how should I approach the issue...
I suppose I should create a new visitor and violation, or is there another simpler way to do this??
Any help is appreciated, Thanks!

@sobolevn
Copy link
Member

@ManishAradwad you need to create a new violation in best_practices, then we can create a new visitor here: https://github.com/wemake-services/wemake-python-styleguide/blob/master/wemake_python_styleguide/visitors/ast/loops.py

We need to visit:

  • ast.ListComp
  • ast.SetComp
  • ast.DictComp
  • ast.GeneratorExp

You can add a new method in the visitor: self._check_slow_in_expression(node)

Here's how our bad node ((a > 0 for a in elements)) looks like (just one example):

GeneratorExp(elt=Compare(left=Name(id='a', ctx=Load(), lineno=1, col_offset=4), ops=[Gt()], comparators=[Num(n=0, lineno=1, col_offset=8)], lineno=1, col_offset=4), generators=[comprehension(target=Name(id='a', ctx=Store(), lineno=1, col_offset=14), iter=Name(id='elements', ctx=Load(), lineno=1, col_offset=19), ifs=[], is_async=0)], lineno=1, col_offset=4)

Then you write the required logic, test it, and submit a PR. I am here to help.

@sobolevn
Copy link
Member

@ManishAradwad sorry for misleading you. This is a refactoring violation.

@ManishAradwad
Copy link
Contributor

sum(1 for a in elements if a in b) and len([1 for a in elements if a > 0]) win!

We are using both sum(1 for a in elements if a in b) and len([1 for a in elements if a > 0]) to find the length of the list, right??

Then what do u mean by

It means that it is better to use sum(1 for a in elements if a in b) then sum(a in b for a in elements). And this rule is perfectly valid.

@ManishAradwad
Copy link
Contributor

Also are you saying that I should add the function self._check_slow_in_expression(node) in the Wrong ComprehensionVisitor??

@sobolevn
Copy link
Member

sobolevn commented Oct 20, 2019

We don't check sum or len function. We check a comprehension inside them or inside any other python code: 1 for a in elements if a in b. And yes, we check it inside WrongComprehensionVisitor

@sobolevn
Copy link
Member

sobolevn commented Nov 4, 2019

Hi @ManishAradwad, how's it going? Do you need any help?

@sobolevn
Copy link
Member

This should be joined with #1008

@sobolevn sobolevn modified the milestones: Version 0.13, Version 0.15 Nov 15, 2019
@ManishAradwad ManishAradwad mentioned this issue Dec 16, 2019
4 tasks
@sobolevn sobolevn modified the milestones: Version 0.16, Version 0.15 aka New runtime Oct 20, 2020
@Dreamsorcerer
Copy link
Contributor

Dreamsorcerer commented Feb 8, 2021

We don't check sum or len function.

I think some clarification on exactly what is forbidden is needed here. I've just read through this twice, and I'm still not clear.

Are we forbidding slow sum/len calls (in which case we do need to check the functions), or something in all comprehensions?

Because some of the bad examples don't make sense to forbid outside of a len/sum call. e.g. [a > 0 for a in elements] might be used as a sequence of True/False values, there's no reason to forbid this unless you know only the True values are actually used (e.g. in sum()).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Hacktoberfest Hactoberfest fun! help wanted Extra attention is needed level:starter Good for newcomers rule request Adding a new rule
Projects
None yet
Development

No branches or pull requests

4 participants