-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy paththe-deceptive-anagram-question.html
218 lines (145 loc) · 49.2 KB
/
the-deceptive-anagram-question.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
<!DOCTYPE html>
<html> <head lang=en><meta charset=UTF-8><title>The Deceptive Anagram Question | EF</title><link href=//cdnjs.cloudflare.com/ajax/libs/font-awesome/4.2.0/css/font-awesome.min.css rel=stylesheet><link href=http://fonts.googleapis.com/css?family=Inconsolata rel=stylesheet type=text/css><link rel=stylesheet href=http://nafiulis.me/theme/css/main.css><link rel=stylesheet href=http://nafiulis.me/theme/css/pygment.css><script src=http://nafiulis.me/theme/js/jquery.min.js></script><script src=http://nafiulis.me/theme/js/main.js></script></head> <body> <!--Heading at the top saying "Engineering Fantasy"--> <div id=header_top> <div class=title> <a href=http://nafiulis.me><span id=engineering>Engineering</span><span id=fantasy>Fantasy</span></a> </div> </div> <button type=button class="js-menu-trigger sliding-menu-button button-nav"> <img src=https://raw.githubusercontent.com/thoughtbot/refills/master/source/images/menu-white.png alt="Menu Icon"> </button> <!--Navigation Bar--> <nav class="js-menu sliding-menu-content"> <span class=section-header>Pages</span> <ul> <li><a href=http://nafiulis.me>Home</a></li> <li><a href=http://nafiulis.me/tags.html>Tags</a></li> <li><a href=http://nafiulis.me/pages/about-me.html>About Me</a></li> </ul> <span class=section-header>Categories</span> <ul> <li><a href=http://nafiulis.me/category/anime.html>Anime</a></li> <li><a href=http://nafiulis.me/category/education.html>Education</a></li> <li><a href=http://nafiulis.me/category/productivity.html>Productivity</a></li> <li><a href=http://nafiulis.me/category/programming.html>programming</a></li> <li><a href=http://nafiulis.me/category/rants.html>rants</a></li> </ul> </nav> <div class="js-menu-screen menu-screen"></div> <!--Main Container--> <div class=container> <!--Top most menu--> <div id=menu> <div class=left> <a href=http://nafiulis.me/feeds/all.atom.xml><i class="fa fa-rss
icon"></i></a> <a href=https://twitter.com/gamesbrainiac><i class="fa fa-twitter icon"></i></a> </div> <div class=center> <h1 class=message>Nafiul Islam on casting spells with code</h1> </div> <div class=right> <a href=https://github.com/gamesbrainiac><i class="fa fa-github icon"></i></a> <a href=http://stackoverflow.com/users/1624921/games-brainiac><i class="fa fa-stack-overflow icon" style="padding-right: 30px"></i></a> </div> </div> <!--Main blog list container--> <div id=blogroll> <div class=article-container> <h1>The Deceptive Anagram Question</h1> <p class=time>Thursday, 16 April 2015</p> <div class=article-content> <div class="contents topic" id=table-of-contents> <p class="topic-title first">Table of Contents</p> <ul class=simple> <li><a class="reference internal" href=#the-question id=id5>The Question</a><ul> <li><a class="reference internal" href=#an-example id=id6>An Example</a></li> </ul> </li> <li><a class="reference internal" href=#quadratic-time id=id7>Quadratic Time</a><ul> <li><a class="reference internal" href=#initial-solution id=id8>Initial Solution</a></li> </ul> </li> <li><a class="reference internal" href=#linear-time id=id9>Linear Time</a><ul> <li><a class="reference internal" href=#hashing-the-right-way id=id10>Hashing The Right Way</a><ul> <li><a class="reference internal" href=#the-pythonic-version id=id11>The Pythonic Version</a></li> </ul> </li> <li><a class="reference internal" href=#ordering-done-right id=id12>Ordering Done Right</a></li> <li><a class="reference internal" href=#profiling id=id13>Profiling</a></li> </ul> </li> <li><a class="reference internal" href=#a-vector-approach-to-hashing id=id14>A Vector Approach to Hashing</a></li> <li><a class="reference internal" href=#the-final-frontier id=id15>The Final Frontier</a><ul> <li><a class="reference internal" href=#but-there-s-an-import-for-that id=id16>But There's An Import For That</a></li> </ul> </li> <li><a class="reference internal" href=#lessons-learnt id=id17>Lessons Learnt</a></li> <li><a class="reference internal" href=#acknowledgments id=id18>Acknowledgments</a></li> </ul> </div> <p>Last month, I had a programming interview. It hadn't gone as I would've liked, but I did get asked a question that I found interesting. The question is deceptively simple, but has a lot of depth to it. Since I failed to solve the problem correctly in the interview, I decided explore the ways in which I could optimize my initially O(n<sup>2</sup>) solution.</p> <p>After a few attempts at solving the problem from different angles, I've come to appreciate the importance of understanding complexity, as well as its limitations.</p> <div class=section id=the-question> <h2><a class=toc-backref href=#id5>The Question</a></h2> <p>The question was simple:</p> <blockquote> Given a list of words L, find all the anagrams in L in the order in which they appear in L.</blockquote> <div class=section id=an-example> <h3><a class=toc-backref href=#id6>An Example</a></h3> <p>Given the input</p> <div class=highlight><pre><span></span><span class=p>[</span><span class=s2>"pool"</span><span class=p>,</span> <span class=s2>"loco"</span><span class=p>,</span> <span class=s2>"cool"</span><span class=p>,</span> <span class=s2>"stain"</span><span class=p>,</span> <span class=s2>"satin"</span><span class=p>,</span> <span class=s2>"pretty"</span><span class=p>,</span> <span class=s2>"nice"</span><span class=p>,</span> <span class=s2>"loop"</span><span class=p>]</span>
</pre></div> <p>The desired output would be</p> <div class=highlight><pre><span></span><span class=p>[</span><span class=s2>"pool"</span><span class=p>,</span> <span class=s2>"loco"</span><span class=p>,</span> <span class=s2>"cool"</span><span class=p>,</span> <span class=s2>"stain"</span><span class=p>,</span> <span class=s2>"satin"</span><span class=p>,</span> <span class=s2>"loop"</span><span class=p>]</span>
</pre></div> <p>in that order <em>exactly</em>.</p> </div> </div> <div class=section id=quadratic-time> <h2><a class=toc-backref href=#id7>Quadratic Time</a></h2> <p>The naive solution to this will give you an O(n<sup>2</sup>) algorithm. Be warned, the following code may burn your eyes.</p> <div class=section id=initial-solution> <h3><a class=toc-backref href=#id8>Initial Solution</a></h3> <div class=highlight><pre><span></span><span class=kn>from</span> <span class=nn>collections</span> <span class=kn>import</span> <span class=n>Counter</span>
<span class=k>def</span> <span class=nf>anagram_finder</span><span class=p>(</span><span class=n>word_list</span><span class=p>):</span>
<span class=n>_ret</span> <span class=o>=</span> <span class=nb>set</span><span class=p>()</span>
<span class=k>for</span> <span class=n>word</span> <span class=ow>in</span> <span class=n>word_list</span><span class=p>:</span>
<span class=n>c</span> <span class=o>=</span> <span class=n>Counter</span><span class=p>(</span><span class=n>word</span><span class=p>)</span>
<span class=k>for</span> <span class=n>other_word</span> <span class=ow>in</span> <span class=p>[</span><span class=n>w</span> <span class=k>for</span> <span class=n>w</span> <span class=ow>in</span> <span class=n>word_list</span> <span class=k>if</span> <span class=n>w</span> <span class=o>!=</span> <span class=n>word</span><span class=p>]:</span>
<span class=k>if</span> <span class=n>Counter</span><span class=p>(</span><span class=n>other_word</span><span class=p>)</span> <span class=o>==</span> <span class=n>c</span><span class=p>:</span>
<span class=n>_ret</span><span class=o>.</span><span class=n>add</span><span class=p>(</span><span class=n>word</span><span class=p>)</span>
<span class=n>_ret</span><span class=o>.</span><span class=n>add</span><span class=p>(</span><span class=n>other_word</span><span class=p>)</span>
<span class=k>return</span> <span class=p>[</span><span class=n>w</span> <span class=k>for</span> <span class=n>w</span> <span class=ow>in</span> <span class=n>word_list</span> <span class=k>if</span> <span class=n>w</span> <span class=ow>in</span> <span class=n>_ret</span><span class=p>]</span>
<span class=k>if</span> <span class=vm>__name__</span> <span class=o>==</span> <span class=s1>'__main__'</span><span class=p>:</span>
<span class=k>print</span> <span class=n>anagram_finder</span><span class=p>([</span><span class=s2>"pool"</span><span class=p>,</span> <span class=s2>"loco"</span><span class=p>,</span> <span class=s2>"cool"</span><span class=p>,</span> <span class=s2>"stain"</span><span class=p>,</span> <span class=s2>"satin"</span><span class=p>,</span> <span class=s2>"pretty"</span><span class=p>,</span> <span class=s2>"nice"</span><span class=p>,</span> <span class=s2>"loop"</span><span class=p>])</span>
</pre></div> <p>This is <em>wrong</em> on so many levels, I don't even know <em>where</em> to begin.</p> <ul class=simple> <li>The problem solves the solution in quadratic time meaning, for more computing power we throw at this the slower it gets per computer. Imagine if we were to put an algorithm like this in the server, we would have serious scaling issues.</li> <li><code>collections.Counter</code> is expensive. It needs to create a dictionary, then it needs to add <em>each</em> character to the dictionary, that means <em>hashing</em> each character.</li> <li>It adds the original word every single time it finds an anagram, and relies on <code>set</code> not to add duplicate values.</li> </ul> <p>I'm probably missing a few points but the point remains; this is seriously bad code.</p> </div> </div> <div class=section id=linear-time> <h2><a class=toc-backref href=#id9>Linear Time</a></h2> <p>So, how can we turn this problem form an O(n<sup>2</sup>) solution into an O(n) solution? Using hash-maps correctly. Unfortunately, I did not come up with the brilliant idea of using a hash-map, but rather my interviewer told me that the way to get O(n) was to use a hash-map.</p> <div class=section id=hashing-the-right-way> <h3><a class=toc-backref href=#id10>Hashing The Right Way</a></h3> <p>In general, whenever you hear the word "hash" you think md5 or SHA. But in reality, a hash is a way to map data in a uniform way. Think of it like this, if you have a the word <em>pool</em> and <em>loop</em>, in the eyes of the anagram solver, they are the same. Why? Because <em>both</em> words use the same characters. In other words, there had to be a uniform way to converting these two words into the <em>same</em> thing. If we were to simply <em>sort</em> the characters in the word, we'd get exactly what we're looking for. Here's a demonstration:</p> <div class=highlight><pre><span></span><span class=o>>>></span> <span class=nb>sorted</span><span class=p>(</span><span class=s2>"loop"</span><span class=p>)</span>
<span class=p>[</span><span class=s1>'l'</span><span class=p>,</span> <span class=s1>'o'</span><span class=p>,</span> <span class=s1>'o'</span><span class=p>,</span> <span class=s1>'p'</span><span class=p>]</span>
<span class=o>>>></span> <span class=nb>sorted</span><span class=p>(</span><span class=s2>"pool"</span><span class=p>)</span>
<span class=p>[</span><span class=s1>'l'</span><span class=p>,</span> <span class=s1>'o'</span><span class=p>,</span> <span class=s1>'o'</span><span class=p>,</span> <span class=s1>'p'</span><span class=p>]</span>
<span class=o>>>></span> <span class=s2>""</span><span class=o>.</span><span class=n>join</span><span class=p>(</span><span class=nb>sorted</span><span class=p>(</span><span class=s2>"loop"</span><span class=p>))</span>
<span class=s1>'loop'</span>
<span class=o>>>></span> <span class=s2>""</span><span class=o>.</span><span class=n>join</span><span class=p>(</span><span class=nb>sorted</span><span class=p>(</span><span class=s2>"pool"</span><span class=p>))</span>
<span class=s1>'loop'</span>
</pre></div> <p>With that, I had my hashing function and with it, I had my linear solution.</p> <div class=highlight><pre><span></span><span class=k>def</span> <span class=nf>hasher</span><span class=p>(</span><span class=n>w</span><span class=p>):</span> <span class=c1># 1</span>
<span class=k>return</span> <span class=s2>""</span><span class=o>.</span><span class=n>join</span><span class=p>(</span><span class=nb>sorted</span><span class=p>(</span><span class=n>w</span><span class=p>))</span>
<span class=k>def</span> <span class=nf>anagram_finder</span><span class=p>(</span><span class=n>word_list</span><span class=p>):</span>
<span class=n>hash_dict</span> <span class=o>=</span> <span class=p>{}</span> <span class=c1># 2</span>
<span class=k>for</span> <span class=n>word</span> <span class=ow>in</span> <span class=n>word_list</span><span class=p>:</span>
<span class=n>h</span> <span class=o>=</span> <span class=n>hasher</span><span class=p>(</span><span class=n>word</span><span class=p>)</span> <span class=c1># 3</span>
<span class=k>if</span> <span class=n>h</span> <span class=ow>not</span> <span class=ow>in</span> <span class=n>hash_dict</span><span class=p>:</span> <span class=c1># 4</span>
<span class=n>hash_dict</span><span class=p>[</span><span class=n>h</span><span class=p>]</span> <span class=o>=</span> <span class=p>[]</span> <span class=c1># 5</span>
<span class=n>hash_dict</span><span class=p>[</span><span class=n>h</span><span class=p>]</span><span class=o>.</span><span class=n>append</span><span class=p>(</span><span class=n>word</span><span class=p>)</span> <span class=c1># 6</span>
<span class=k>return</span> <span class=p>[</span><span class=n>anagram</span> <span class=k>for</span> <span class=n>l</span> <span class=ow>in</span> <span class=n>hash_dict</span><span class=o>.</span><span class=n>values</span><span class=p>()</span> <span class=k>for</span> <span class=n>anagram</span> <span class=ow>in</span> <span class=n>l</span> <span class=k>if</span> <span class=nb>len</span><span class=p>(</span><span class=n>l</span><span class=p>)</span> <span class=o>></span> <span class=mi>1</span><span class=p>]</span> <span class=c1># 7</span>
<span class=k>if</span> <span class=vm>__name__</span> <span class=o>==</span> <span class=s1>'__main__'</span><span class=p>:</span>
<span class=k>print</span><span class=p>(</span><span class=n>anagram_finder</span><span class=p>([</span><span class=s2>"pool"</span><span class=p>,</span> <span class=s2>"loco"</span><span class=p>,</span> <span class=s2>"cool"</span><span class=p>,</span> <span class=s2>"stain"</span><span class=p>,</span> <span class=s2>"satin"</span><span class=p>,</span> <span class=s2>"pretty"</span><span class=p>,</span> <span class=s2>"nice"</span><span class=p>,</span> <span class=s2>"loop"</span><span class=p>]))</span>
</pre></div> <p>In <code class=coderef>1</code>, we create the <code>hasher</code> function. The hash is simple, it sorts the string alphabetically using <code>sorted</code>, which returns a list, which we then use as an iterable for <code>"".join</code> to create a string. We do this because python lists are not hashable (because they are mutable).</p> <p>In <code class=coderef>2</code>, inside the <code>anagram_finder</code> function, we create a <code>hash_dict</code>, a dictionary for all our hashes. It must be pointed out that the dictionary, when adding new keys, will hash those keys as well.<a class=footnote-reference href=#howstringsarehashed id=id1>[1]</a> The worst case for <code>hasher</code> is O(n) where n is the length of the word in question, so no issue with the size of the list we're given.</p> <p>In <code class=coderef>3</code> we actually call the <code>hasher</code> to hash the string. In <code class=coderef>4</code>, we check to see if this hash exists in the keys of <code>hash_dict</code>. If not, then we create a new list so that we can append words to it in <code class=coderef>5</code>.</p> <p>In the end, we <em>always</em> append the word to the list of that key in <code class=coderef>6</code>. This means, that every key will always have <em>at least one</em> value stored in its list, and these values are the ones we don't one.</p> <p>The simplified version of <code class=coderef>7</code> is as follows:</p> <div class=highlight><pre><span></span><span class=n>_ret</span> <span class=o>=</span> <span class=p>[]</span>
<span class=k>for</span> <span class=n>l</span> <span class=ow>in</span> <span class=n>hash_dict</span><span class=o>.</span><span class=n>values</span><span class=p>():</span>
<span class=k>if</span> <span class=nb>len</span><span class=p>(</span><span class=n>l</span><span class=p>)</span> <span class=o>></span> <span class=mi>1</span><span class=p>:</span>
<span class=n>_ret</span> <span class=o>+=</span> <span class=n>l</span>
</pre></div> <div class=section id=the-pythonic-version> <h4><a class=toc-backref href=#id11>The Pythonic Version</a></h4> <p>The above is great for explanation, but the pythonic version is much smaller:</p> <div class=highlight><pre><span></span><span class=kn>from</span> <span class=nn>collections</span> <span class=kn>import</span> <span class=n>defaultdict</span>
<span class=k>def</span> <span class=nf>hasher</span><span class=p>(</span><span class=n>w</span><span class=p>):</span>
<span class=k>return</span> <span class=s2>""</span><span class=o>.</span><span class=n>join</span><span class=p>(</span><span class=nb>sorted</span><span class=p>(</span><span class=n>w</span><span class=p>))</span>
<span class=k>def</span> <span class=nf>anagram_finder</span><span class=p>(</span><span class=n>word_list</span><span class=p>):</span>
<span class=n>hash_dict</span> <span class=o>=</span> <span class=n>defaultdict</span><span class=p>(</span><span class=nb>list</span><span class=p>)</span>
<span class=k>for</span> <span class=n>word</span> <span class=ow>in</span> <span class=n>word_list</span><span class=p>:</span>
<span class=n>hash_dict</span><span class=p>[</span><span class=n>hasher</span><span class=p>(</span><span class=n>word</span><span class=p>)]</span><span class=o>.</span><span class=n>append</span><span class=p>(</span><span class=n>word</span><span class=p>)</span> <span class=c1># 1</span>
<span class=k>return</span> <span class=p>[</span><span class=n>anagram</span> <span class=k>for</span> <span class=n>l</span> <span class=ow>in</span> <span class=n>hash_dict</span><span class=o>.</span><span class=n>values</span><span class=p>()</span> <span class=k>for</span> <span class=n>anagram</span> <span class=ow>in</span> <span class=n>l</span> <span class=k>if</span> <span class=nb>len</span><span class=p>(</span><span class=n>l</span><span class=p>)</span> <span class=o>></span> <span class=mi>1</span><span class=p>]</span>
<span class=k>if</span> <span class=vm>__name__</span> <span class=o>==</span> <span class=s1>'__main__'</span><span class=p>:</span>
<span class=k>print</span><span class=p>(</span><span class=n>anagram_finder</span><span class=p>([</span><span class=s2>"pool"</span><span class=p>,</span> <span class=s2>"loco"</span><span class=p>,</span> <span class=s2>"cool"</span><span class=p>,</span> <span class=s2>"stain"</span><span class=p>,</span> <span class=s2>"satin"</span><span class=p>,</span> <span class=s2>"pretty"</span><span class=p>,</span> <span class=s2>"nice"</span><span class=p>,</span> <span class=s2>"loop"</span><span class=p>]))</span>
</pre></div> <p>We've made the code significantly smaller by using a <code>defaultdict</code> in <code class=coderef>1</code>. <code>defaultdict</code> allows us to give it a factory, in this case <code>list</code>, that will automatically create a list with a key if a key does not exist. If it does exist, then it will return that list, and we can append to it.<a class=footnote-reference href=#moreondefaultdict id=id2>[3]</a></p> <p>But wait, we forgot about the ordering.</p> </div> </div> <div class=section id=ordering-done-right> <h3><a class=toc-backref href=#id12>Ordering Done Right</a></h3> <p>We have a quick fix for the ordering, and that is simply to loop through all the words in the initial list and include only those that is in the list of anagrams. The solution is <em>still</em> O(n), but remains highly inefficient. One thought might be to use the <code>collections.OrderedDict</code> class. But although that might <em>seem</em> to work, the ordering will still not match the original in the case where anagrams are not next to each other. For example, the following piece of code will return:</p> <div class=highlight><pre><span></span><span class=kn>from</span> <span class=nn>collections</span> <span class=kn>import</span> <span class=n>OrderedDict</span>
<span class=k>def</span> <span class=nf>hasher</span><span class=p>(</span><span class=n>w</span><span class=p>):</span>
<span class=k>return</span> <span class=s2>""</span><span class=o>.</span><span class=n>join</span><span class=p>(</span><span class=nb>sorted</span><span class=p>(</span><span class=n>w</span><span class=p>))</span>
<span class=k>def</span> <span class=nf>anagram_finder</span><span class=p>(</span><span class=n>word_list</span><span class=p>):</span>
<span class=n>hash_dict</span> <span class=o>=</span> <span class=n>OrderedDict</span><span class=p>()</span>
<span class=k>for</span> <span class=n>word</span> <span class=ow>in</span> <span class=n>word_list</span><span class=p>:</span>
<span class=n>hash_dict</span><span class=o>.</span><span class=n>setdefault</span><span class=p>(</span><span class=n>hasher</span><span class=p>(</span><span class=n>word</span><span class=p>),</span> <span class=p>[])</span><span class=o>.</span><span class=n>append</span><span class=p>(</span><span class=n>word</span><span class=p>)</span>
<span class=k>return</span> <span class=p>[</span><span class=n>anagram</span> <span class=k>for</span> <span class=n>l</span> <span class=ow>in</span> <span class=n>hash_dict</span><span class=o>.</span><span class=n>values</span><span class=p>()</span> <span class=k>for</span> <span class=n>anagram</span> <span class=ow>in</span> <span class=n>l</span> <span class=k>if</span> <span class=nb>len</span><span class=p>(</span><span class=n>l</span><span class=p>)</span> <span class=o>></span> <span class=mi>1</span><span class=p>]</span>
<span class=k>if</span> <span class=vm>__name__</span> <span class=o>==</span> <span class=s1>'__main__'</span><span class=p>:</span>
<span class=k>print</span><span class=p>(</span><span class=n>anagram_finder</span><span class=p>([</span><span class=s2>"nala"</span><span class=p>,</span> <span class=s2>"pool"</span><span class=p>,</span> <span class=s2>"loco"</span><span class=p>,</span> <span class=s2>"cool"</span><span class=p>,</span> <span class=s2>"stain"</span><span class=p>,</span> <span class=s2>"satin"</span><span class=p>,</span> <span class=s2>"pretty"</span><span class=p>,</span> <span class=s2>"nice"</span><span class=p>,</span> <span class=s2>"loop"</span><span class=p>,</span> <span class=s2>"laan"</span><span class=p>]))</span>
</pre></div> <div class=highlight><pre><span></span><span class=p>[</span><span class=s1>'nala'</span><span class=p>,</span> <span class=s1>'laan'</span><span class=p>,</span> <span class=s1>'pool'</span><span class=p>,</span> <span class=s1>'loop'</span><span class=p>,</span> <span class=s1>'loco'</span><span class=p>,</span> <span class=s1>'cool'</span><span class=p>,</span> <span class=s1>'stain'</span><span class=p>,</span> <span class=s1>'satin'</span><span class=p>]</span>
</pre></div> <p><code>nala</code> and <code>laan</code> should not be next to each other. This is because <code>collections.OrderedDict</code> remembers the order in which the <em>keys</em> were added.</p> <p>So, in the end, I stuck to the following:</p> <div class=highlight><pre><span></span><span class=kn>from</span> <span class=nn>collections</span> <span class=kn>import</span> <span class=n>defaultdict</span>
<span class=k>def</span> <span class=nf>hasher</span><span class=p>(</span><span class=n>w</span><span class=p>):</span>
<span class=k>return</span> <span class=s2>""</span><span class=o>.</span><span class=n>join</span><span class=p>(</span><span class=nb>sorted</span><span class=p>(</span><span class=n>w</span><span class=p>))</span>
<span class=k>def</span> <span class=nf>anagram_finder</span><span class=p>(</span><span class=n>word_list</span><span class=p>):</span>
<span class=n>hash_dict</span> <span class=o>=</span> <span class=n>defaultdict</span><span class=p>(</span><span class=nb>list</span><span class=p>)</span>
<span class=k>for</span> <span class=n>word</span> <span class=ow>in</span> <span class=n>word_list</span><span class=p>:</span>
<span class=n>hash_dict</span><span class=p>[</span><span class=n>hasher</span><span class=p>(</span><span class=n>word</span><span class=p>)]</span><span class=o>.</span><span class=n>append</span><span class=p>(</span><span class=n>word</span><span class=p>)</span>
<span class=k>return</span> <span class=p>[</span><span class=n>word</span> <span class=k>for</span> <span class=n>word</span> <span class=ow>in</span> <span class=n>word_list</span> <span class=k>if</span> <span class=nb>len</span><span class=p>(</span><span class=n>hash_dict</span><span class=p>[</span><span class=n>hasher</span><span class=p>(</span><span class=n>word</span><span class=p>)])</span> <span class=o>></span> <span class=mi>1</span><span class=p>]</span>
<span class=k>if</span> <span class=vm>__name__</span> <span class=o>==</span> <span class=s1>'__main__'</span><span class=p>:</span>
<span class=k>print</span><span class=p>(</span><span class=n>anagram_finder</span><span class=p>([</span><span class=s2>"nala"</span><span class=p>,</span> <span class=s2>"pool"</span><span class=p>,</span> <span class=s2>"loco"</span><span class=p>,</span> <span class=s2>"cool"</span><span class=p>,</span> <span class=s2>"stain"</span><span class=p>,</span> <span class=s2>"satin"</span><span class=p>,</span> <span class=s2>"pretty"</span><span class=p>,</span> <span class=s2>"nice"</span><span class=p>,</span> <span class=s2>"loop"</span><span class=p>,</span> <span class=s2>"laan"</span><span class=p>]))</span>
</pre></div> <p>With that, we solved the <em>ordering</em> problem and still managed to make it linear.</p> </div> <div class=section id=profiling> <h3><a class=toc-backref href=#id13>Profiling</a></h3> <p>We never know how well something will actually work unless we profile it. So I got a big list of words<a class=footnote-reference href=#biglist id=id3>[2]</a> and go to profiling.</p> <div class=highlight><pre><span></span><span class=kn>from</span> <span class=nn>collections</span> <span class=kn>import</span> <span class=n>defaultdict</span>
<span class=k>def</span> <span class=nf>hasher</span><span class=p>(</span><span class=n>w</span><span class=p>):</span>
<span class=k>return</span> <span class=s2>""</span><span class=o>.</span><span class=n>join</span><span class=p>(</span><span class=nb>sorted</span><span class=p>(</span><span class=n>w</span><span class=p>))</span>
<span class=k>def</span> <span class=nf>anagram_finder</span><span class=p>(</span><span class=n>word_list</span><span class=p>):</span>
<span class=n>hash_dict</span> <span class=o>=</span> <span class=n>defaultdict</span><span class=p>(</span><span class=nb>list</span><span class=p>)</span>
<span class=k>for</span> <span class=n>word</span> <span class=ow>in</span> <span class=n>word_list</span><span class=p>:</span>
<span class=n>hash_dict</span><span class=p>[</span><span class=n>hasher</span><span class=p>(</span><span class=n>word</span><span class=p>)]</span><span class=o>.</span><span class=n>append</span><span class=p>(</span><span class=n>word</span><span class=p>)</span>
<span class=k>return</span> <span class=p>[</span><span class=n>word</span> <span class=k>for</span> <span class=n>word</span> <span class=ow>in</span> <span class=n>word_list</span> <span class=k>if</span> <span class=nb>len</span><span class=p>(</span><span class=n>hash_dict</span><span class=p>[</span><span class=n>hasher</span><span class=p>(</span><span class=n>word</span><span class=p>)])</span> <span class=o>></span> <span class=mi>1</span><span class=p>]</span>
<span class=k>if</span> <span class=vm>__name__</span> <span class=o>==</span> <span class=s1>'__main__'</span><span class=p>:</span>
<span class=k>with</span> <span class=nb>open</span><span class=p>(</span><span class=s1>'wordsEn.txt'</span><span class=p>)</span> <span class=k>as</span> <span class=n>f</span><span class=p>:</span>
<span class=n>w_list</span> <span class=o>=</span> <span class=n>f</span><span class=o>.</span><span class=n>read</span><span class=p>()</span><span class=o>.</span><span class=n>split</span><span class=p>()</span>
<span class=n>anagram_finder</span><span class=p>(</span><span class=n>w_list</span><span class=p>)</span>
</pre></div> <p>I used a library called <a class="reference external" href=https://github.com/what-studio/profiling>profiling</a>, and it can be installed with a simple pip install. After profiling the above code, I found that my <code>hasher</code> function was actually making the the whole process a lot slower, because it was being called <em>twice</em>.</p> <img alt="Profiling without storing hash values" class=align-center src=/images/anagrams_profile_01.png> <p>So, my first attempt (like a good pythonista) was to use <code>functools.lru_cache</code>. That attempt failed spectacularly.</p> <div class=highlight><pre><span></span><span class=kn>from</span> <span class=nn>collections</span> <span class=kn>import</span> <span class=n>defaultdict</span>
<span class=kn>from</span> <span class=nn>functools</span> <span class=kn>import</span> <span class=n>lru_cache</span>
<span class=nd>@lru_cache</span><span class=p>()</span>
<span class=k>def</span> <span class=nf>hasher</span><span class=p>(</span><span class=n>w</span><span class=p>):</span>
<span class=k>return</span> <span class=s2>""</span><span class=o>.</span><span class=n>join</span><span class=p>(</span><span class=nb>sorted</span><span class=p>(</span><span class=n>w</span><span class=p>))</span>
<span class=k>def</span> <span class=nf>anagram_finder</span><span class=p>(</span><span class=n>word_list</span><span class=p>):</span>
<span class=n>hash_dict</span> <span class=o>=</span> <span class=n>defaultdict</span><span class=p>(</span><span class=nb>list</span><span class=p>)</span>
<span class=k>for</span> <span class=n>word</span> <span class=ow>in</span> <span class=n>word_list</span><span class=p>:</span>
<span class=n>hash_dict</span><span class=p>[</span><span class=n>hasher</span><span class=p>(</span><span class=n>word</span><span class=p>)]</span><span class=o>.</span><span class=n>append</span><span class=p>(</span><span class=n>word</span><span class=p>)</span>
<span class=k>return</span> <span class=p>[</span><span class=n>word</span> <span class=k>for</span> <span class=n>word</span> <span class=ow>in</span> <span class=n>word_list</span> <span class=k>if</span> <span class=nb>len</span><span class=p>(</span><span class=n>hash_dict</span><span class=p>[</span><span class=n>hasher</span><span class=p>(</span><span class=n>word</span><span class=p>)])</span> <span class=o>></span> <span class=mi>1</span><span class=p>]</span>
<span class=k>if</span> <span class=vm>__name__</span> <span class=o>==</span> <span class=s1>'__main__'</span><span class=p>:</span>
<span class=k>with</span> <span class=nb>open</span><span class=p>(</span><span class=s1>'wordsEn.txt'</span><span class=p>)</span> <span class=k>as</span> <span class=n>f</span><span class=p>:</span>
<span class=n>w_list</span> <span class=o>=</span> <span class=n>f</span><span class=o>.</span><span class=n>read</span><span class=p>()</span><span class=o>.</span><span class=n>split</span><span class=p>()</span>
<span class=n>anagram_finder</span><span class=p>(</span><span class=n>w_list</span><span class=p>)</span>
</pre></div> <p>This is because there were excessive calls being made all over the place. Even increasing the size of the cache from <code>lru_cache()</code> which has a default if <code>128</code> to <code>lru_cache(11000)</code> which is roughly the size of the word list I'm using. In fact, increasing the <code>lru_cache</code> size to such an amount slowed down the program so much that I didn't wait for it to finish, but the root problem was the same; too many calls were being made all over the place.</p> <img alt="Using LRU cache this time" class=align-center src=/images/anagrams_profile_02.png> <p>So, to compare the two programs, with lru and without lru, we can see that the program <em>without</em> lru was significantly faster than the one <em>with</em> lru (4.77 seconds to 15.31 seconds. With the failure or <code>lru_cache</code> as a feasible solution, I decided to just use a normal dictionary to store hashes. In our original attempt, <code>hasher</code> was being called twice, once to add words to the dictionary, and then in the list comprehension. Why not just use a dictionary to store the hash and the word?</p> <div class=highlight><pre><span></span><span class=kn>from</span> <span class=nn>collections</span> <span class=kn>import</span> <span class=n>defaultdict</span>
<span class=k>def</span> <span class=nf>hasher</span><span class=p>(</span><span class=n>word</span><span class=p>):</span>
<span class=k>return</span> <span class=s2>""</span><span class=o>.</span><span class=n>join</span><span class=p>(</span><span class=nb>sorted</span><span class=p>(</span><span class=n>word</span><span class=p>))</span>
<span class=k>def</span> <span class=nf>anagram_finder</span><span class=p>(</span><span class=n>word_list</span><span class=p>):</span>
<span class=n>hash_dict</span> <span class=o>=</span> <span class=n>defaultdict</span><span class=p>(</span><span class=nb>list</span><span class=p>)</span>
<span class=n>hashes</span> <span class=o>=</span> <span class=p>{}</span>
<span class=k>for</span> <span class=n>word</span> <span class=ow>in</span> <span class=n>word_list</span><span class=p>:</span>
<span class=n>hashes</span><span class=p>[</span><span class=n>word</span><span class=p>]</span> <span class=o>=</span> <span class=n>hasher</span><span class=p>(</span><span class=n>word</span><span class=p>)</span>
<span class=n>hash_dict</span><span class=p>[</span><span class=n>hashes</span><span class=p>[</span><span class=n>word</span><span class=p>]]</span><span class=o>.</span><span class=n>append</span><span class=p>(</span><span class=n>word</span><span class=p>)</span>
<span class=k>return</span> <span class=p>[</span><span class=n>word</span> <span class=k>for</span> <span class=n>word</span> <span class=ow>in</span> <span class=n>word_list</span> <span class=k>if</span> <span class=nb>len</span><span class=p>(</span><span class=n>hash_dict</span><span class=p>[</span><span class=n>hashes</span><span class=p>[</span><span class=n>word</span><span class=p>]])</span> <span class=o>></span> <span class=mi>1</span><span class=p>]</span>
<span class=k>if</span> <span class=vm>__name__</span> <span class=o>==</span> <span class=s1>'__main__'</span><span class=p>:</span>
<span class=k>with</span> <span class=nb>open</span><span class=p>(</span><span class=s1>'wordsEn.txt'</span><span class=p>)</span> <span class=k>as</span> <span class=n>f</span><span class=p>:</span>
<span class=n>w_list</span> <span class=o>=</span> <span class=n>f</span><span class=o>.</span><span class=n>read</span><span class=p>()</span><span class=o>.</span><span class=n>split</span><span class=p>()</span>
<span class=n>anagram_finder</span><span class=p>(</span><span class=n>w_list</span><span class=p>)</span>
</pre></div> <p>We just create a new dictionary, <code>hashes</code> to store all our hash and word pairs. This resulted in a speedup.</p> <img alt="This time with the hashes dictionary" class=align-center src=/images/anagrams_profile_03.png> </div> </div> <div class=section id=a-vector-approach-to-hashing> <h2><a class=toc-backref href=#id14>A Vector Approach to Hashing</a></h2> <p>Another approach to hashing would be to use a vector to determine the number of letters that appear in a word. This is better explained through code:</p> <div class=highlight><pre><span></span><span class=k>def</span> <span class=nf>hasher</span><span class=p>(</span><span class=n>word</span><span class=p>):</span>
<span class=n>h</span> <span class=o>=</span> <span class=p>[</span><span class=mi>0</span><span class=p>]</span> <span class=o>*</span> <span class=mi>26</span> <span class=c1># 1</span>
<span class=k>for</span> <span class=n>c</span> <span class=ow>in</span> <span class=n>word</span><span class=p>:</span>
<span class=n>h</span><span class=p>[</span><span class=nb>ord</span><span class=p>(</span><span class=n>c</span><span class=p>)</span> <span class=o>-</span> <span class=nb>ord</span><span class=p>(</span><span class=s1>'a'</span><span class=p>)]</span> <span class=o>+=</span> <span class=mi>1</span> <span class=c1># 2</span>
<span class=k>return</span> <span class=nb>tuple</span><span class=p>(</span><span class=n>h</span><span class=p>)</span> <span class=c1># 3</span>
</pre></div> <p>In <code class=coderef>1</code>, we create a list of length 26, having all its values initialized to 0. In <code class=coderef>2</code>, we calculate the rank of the letter in the english alphabet using <code>ord(c) - ord(a)</code>. We use this rank as the index of our list. We then add 1 to the value of the index. In other words we are merely counting the frequency of the letters in a word; very similar to a histogram.</p> <p>The above is O(n) compared to the O(n log(n)) solution that sorting a string gives us (n is the number of characters in the word here). However, considering that the average size of a word in the english language is actually 5 and the average size of a word in my <code>wordsEn.txt</code> file is 8, O(n) actually becomes the <em>smaller</em> problem here, and the constant time that is required to create a list of 26 items is the bigger problem. In other words, its O(26) for list creation and initiation. Its O(n) for creating the rank and its O(26) for creating the tuple.</p> <p>Compare that to O(n log(n)) worst case for using <code>sorted</code>. In this particular use case since n is very low, the better option is to use <code>sorted</code>.</p> </div> <div class=section id=the-final-frontier> <h2><a class=toc-backref href=#id15>The Final Frontier</a></h2> <p>The best solution to this problem however is not one of my own making, but rather one that a friend of mine (Alexander) provided me with. My solution actually is superfluous in many cases, but the following cuts straight to the point:</p> <div class=highlight><pre><span></span><span class=kn>from</span> <span class=nn>collections</span> <span class=kn>import</span> <span class=n>defaultdict</span>
<span class=kn>from</span> <span class=nn>itertools</span> <span class=kn>import</span> <span class=n>izip</span>
<span class=k>def</span> <span class=nf>get_anagrams</span><span class=p>(</span><span class=n>words</span><span class=p>):</span>
<span class=n>normalized</span> <span class=o>=</span> <span class=p>[</span><span class=s1>''</span><span class=o>.</span><span class=n>join</span><span class=p>(</span><span class=nb>sorted</span><span class=p>(</span><span class=n>w</span><span class=p>))</span> <span class=k>for</span> <span class=n>w</span> <span class=ow>in</span> <span class=n>words</span><span class=p>]</span> <span class=c1># 1</span>
<span class=n>d</span> <span class=o>=</span> <span class=n>defaultdict</span><span class=p>(</span><span class=nb>int</span><span class=p>)</span> <span class=c1># 2</span>
<span class=k>for</span> <span class=n>n</span> <span class=ow>in</span> <span class=n>normalized</span><span class=p>:</span> <span class=c1># 3</span>
<span class=n>d</span><span class=p>[</span><span class=n>n</span><span class=p>]</span> <span class=o>+=</span> <span class=mi>1</span> <span class=c1># 4</span>
<span class=k>return</span> <span class=p>[</span><span class=n>w</span> <span class=k>for</span> <span class=n>w</span><span class=p>,</span> <span class=n>n</span> <span class=ow>in</span> <span class=n>izip</span><span class=p>(</span><span class=n>words</span><span class=p>,</span> <span class=n>normalized</span><span class=p>)</span> <span class=k>if</span> <span class=n>d</span><span class=p>[</span><span class=n>n</span><span class=p>]</span> <span class=o>></span> <span class=mi>1</span><span class=p>]</span> <span class=c1># 5</span>
</pre></div> <p>The genius of this approach is to use <code>collections.defaultdict</code> with <code>int</code> as the factory. This basically means that you have a smaller memory footprint. So, in <code class=coderef>1</code>, we get a list of hashed values (which he calls normalized). In <code class=coderef>2</code>, we create a <code>defaultdict</code> of int, this means that all the keys will have an initial value of 0.</p> <p>In <code class=coderef>3</code>, we loop over all the normalized/hashed values. <code class=coderef>4</code> is where the magic happens. So, there will be duplicates in <code>normalized</code>, and we are going to count the number of times each of those normalized values appear.</p> <p>In <code class=coderef>5</code>, we zip over the initial word list, <code>words</code> and <code>normalized</code>. Remember, that <em>all</em> the words are normalized. All we do then is simply check to see if the number of times the normalized value of the word appears in <code>d</code>. If there's more than one occurrence, we add the word to the final list in this list comprehension.</p> <div class=section id=but-there-s-an-import-for-that> <h3><a class=toc-backref href=#id16>But There's An Import For That</a></h3> <p>In the very beginning, I used <code>collections.Counter</code>, to provide the <em>worst</em> possible solution to the problem. But, I think this time, we can actually use <code>collections.Counter</code> correctly, because <code class=coderef>2</code>, <code class=coderef>3</code> and <code class=coderef>4</code> can be shortened to just <code>Counter(normalized)</code>. So the <em>final final</em> (gosh, that feels javaish) version is:</p> <div class=highlight><pre><span></span><span class=k>def</span> <span class=nf>get_anagrams</span><span class=p>(</span><span class=n>words</span><span class=p>):</span>
<span class=n>normalized</span> <span class=o>=</span> <span class=p>[</span><span class=s1>''</span><span class=o>.</span><span class=n>join</span><span class=p>(</span><span class=nb>sorted</span><span class=p>(</span><span class=n>w</span><span class=p>))</span> <span class=k>for</span> <span class=n>w</span> <span class=ow>in</span> <span class=n>words</span><span class=p>]</span>
<span class=n>d</span> <span class=o>=</span> <span class=n>Counter</span><span class=p>(</span><span class=n>normalized</span><span class=p>)</span>
<span class=k>return</span> <span class=p>[</span><span class=n>w</span> <span class=k>for</span> <span class=n>w</span><span class=p>,</span> <span class=n>n</span> <span class=ow>in</span> <span class=n>izip</span><span class=p>(</span><span class=n>words</span><span class=p>,</span> <span class=n>normalized</span><span class=p>)</span> <span class=k>if</span> <span class=n>d</span><span class=p>[</span><span class=n>n</span><span class=p>]</span> <span class=o>></span> <span class=mi>1</span><span class=p>]</span>
</pre></div> </div> </div> <div class=section id=lessons-learnt> <h2><a class=toc-backref href=#id17>Lessons Learnt</a></h2> <ul class=simple> <li>Hash Maps are very important. Learn to use them properly.</li> <li>When making an algorithm don't think standard library, think of finding the best algorithmic solution.</li> <li>After you've found the algorithmic solution try optimizing it using the standard library.</li> <li>Quadratic time is worse than you can imagine.</li> <li>Complexity analysis alone is a blunt instrument so always profile.</li> <li>Just because an algorithm has better worst case complexity doesn't mean that its the best one for the job.</li> <li>Sometimes our brains stop working under pressure. Take a deep breath, stop worrying about what will happen if you <em>don't</em> solved the problem and focus on all the good things that will happen if you <em>do</em>.</li> </ul> </div> <div class=section id=acknowledgments> <h2><a class=toc-backref href=#id18>Acknowledgments</a></h2> <p><strong>Ashwin</strong> (@inspectorG4dget) for helping me answer this question and explaining the vector approach to hashing. Ashwin's help has been paramount to my understanding.</p> <p>To my interviewer for giving me a very good question, and to <strong>Ridwan</strong> (@hjr265) for pointing out flaws in my solution and for encouraging me to write a blog post on this problem.</p> <p><strong>Alexander</strong> (@metaprogrammer) for proof reading this post and providing me with the optimal solution to this problem.<a class=footnote-reference href=#whoisalexander id=id4>[4]</a></p> <hr class=docutils> <table class="docutils footnote" frame=void id=howstringsarehashed rules=none> <colgroup><col class=label><col></colgroup> <tbody valign=top> <tr><td class=label><a class=fn-backref href=#id1>[1]</a></td><td><a class="reference external" href=http://effbot.org/zone/python-hash.htm>Python Hash Algorithms</a> explains in detail how the different hash functions in Python actually work.</td></tr> </tbody> </table> <table class="docutils footnote" frame=void id=biglist rules=none> <colgroup><col class=label><col></colgroup> <tbody valign=top> <tr><td class=label><a class=fn-backref href=#id3>[2]</a></td><td><a class="reference external" href=http://www-01.sil.org/linguistics/wordlists/english/ >wordsEn.txt</a> is the list I used.</td></tr> </tbody> </table> <table class="docutils footnote" frame=void id=moreondefaultdict rules=none> <colgroup><col class=label><col></colgroup> <tbody valign=top> <tr><td class=label><a class=fn-backref href=#id2>[3]</a></td><td><a class="reference external" href=https://www.accelebrate.com/blog/using-defaultdict-python/ >Using defaultdict in Python</a> is a great writeup on how to use <code>defaultdict</code>. It explains the concept of a factory as well.</td></tr> </tbody> </table> <table class="docutils footnote" frame=void id=whoisalexander rules=none> <colgroup><col class=label><col></colgroup> <tbody valign=top> <tr><td class=label><a class=fn-backref href=#id4>[4]</a></td><td>Alexander Kozlovsky is the creator of Pony ORM.</td></tr> </tbody> </table> </div> </div> </div> <div class=post-meta><span class=meta-type>Category: </span> <span><a href=http://nafiulis.me/category/programming.html>Programming</a></span> <span class=meta-type> Tags: </span> <span> <a href=http://nafiulis.me/tag/interview.html>interview</a>, <a href=http://nafiulis.me/tag/puzzles.html>puzzles</a>, <a href=http://nafiulis.me/tag/optimization.html>optimization</a>, <a href=http://nafiulis.me/tag/python.html>python</a>, <a href=http://nafiulis.me/tag/hash-map.html>hash-map</a>, <a href=http://nafiulis.me/tag/algorithms.html>algorithms</a>, <a href=http://nafiulis.me/tag/linear-time.html>linear-time</a>, <a href=http://nafiulis.me/tag/complexity.html>complexity</a> </span> </div> <div id=disqus_thread style="margin-top: 10px; margin-left: 20px; margin-right: 20px;"></div> <script type=text/javascript>
/* * * CONFIGURATION VARIABLES: EDIT BEFORE PASTING INTO YOUR WEBPAGE * * */
var disqus_shortname = 'nafiulisme'; // required: replace example with your forum shortname
/* * * DON'T EDIT BELOW THIS LINE * * */
(function() {
var dsq = document.createElement('script'); dsq.type = 'text/javascript'; dsq.async = true;
dsq.src = '//' + disqus_shortname + '.disqus.com/embed.js';
(document.getElementsByTagName('head')[0] || document.getElementsByTagName('body')[0]).appendChild(dsq);
})();
</script> <noscript>Please enable JavaScript to view the <a href=http://disqus.com/?ref_noscript>comments powered by Disqus.</a></noscript> </div> <!--Footer--> <div id=footer> <footer> Code examples licenced under MIT License <br> Copyright <i class="fa fa-copyright"></i> 2018 Quazi Nafiul Islam </footer> </div> </div> <script>
(function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){
(i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o),
m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m)
})(window,document,'script','//www.google-analytics.com/analytics.js','ga');
ga('create', 'UA-55554110-1', 'auto');
ga('send', 'pageview');
</script> </body> </html>