-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path2016-08-06-python-zhengze-pachong-qiubai.html
345 lines (250 loc) · 41.1 KB
/
2016-08-06-python-zhengze-pachong-qiubai.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8">
<title>Python爬虫正则表达式抓取糗百笑话 | Hm's Blog</title>
<meta name="viewport" content="width=device-width, initial-scale=1, maximum-scale=1">
<meta name="description" content="很多人对正则表达式很头痛,其实写正则表达式是有套路的,看完这篇文章保证你会写任何html上的内容抓取的正则。为确保代码的正确性,此文约定环境为python3,如果在python2运行,可能要做一些调整。">
<meta property="og:type" content="article">
<meta property="og:title" content="Python爬虫正则表达式抓取糗百笑话">
<meta property="og:url" content="http://huangming.github.io/2016-08-06-python-zhengze-pachong-qiubai.html">
<meta property="og:site_name" content="Hm's Blog">
<meta property="og:description" content="很多人对正则表达式很头痛,其实写正则表达式是有套路的,看完这篇文章保证你会写任何html上的内容抓取的正则。为确保代码的正确性,此文约定环境为python3,如果在python2运行,可能要做一些调整。">
<meta property="og:updated_time" content="2016-08-07T14:40:58.662Z">
<meta name="twitter:card" content="summary">
<meta name="twitter:title" content="Python爬虫正则表达式抓取糗百笑话">
<meta name="twitter:description" content="很多人对正则表达式很头痛,其实写正则表达式是有套路的,看完这篇文章保证你会写任何html上的内容抓取的正则。为确保代码的正确性,此文约定环境为python3,如果在python2运行,可能要做一些调整。">
<meta name="twitter:creator" content="@hmorz">
<link rel="alternative" href="/atom.xml" title="Hm's Blog" type="application/atom+xml">
<link rel="icon" href="/favicon.png">
<!-- <link href="//fonts.googleapis.com/css?family=Source+Code+Pro" rel="stylesheet" type="text/css"> -->
<link rel="stylesheet" href="/css/font-awesome.min.css">
<link href="/css/font-quigleywiggly.css" rel="Stylesheet" type="text/css" />
<link rel="stylesheet" href="/css/style.css" type="text/css">
<script src="/js/jquery-1.4.2.min.js" type="text/javascript"></script>
<script src="/js/girls.js" type="text/javascript"></script>
<script type="text/javascript" charset="utf-8" src="/js/weather.js"></script>
<link href="http://huangming.github.io/stylesheets/main.css" rel="stylesheet" media="all"/>
<link href="http://huangming.github.io/images/weather/default/julying-weather.css" rel="stylesheet" media="all"/>
<!--
-->
</head>
<body>
<div id="container">
<div id="wrap">
<header id="header">
<div id="banner"></div>
<div id="header-outer" class="outer">
<div id="header-title" class="inner">
<h1 id="logo-wrap">
<a href="/" id="logo">Hm's Blog</a>
</h1>
<h2 id="subtitle-wrap">
<a href="/" id="subtitle">I am here</a>
</h2>
</div>
<div id="header-inner" class="inner">
<nav id="main-nav">
<a id="main-nav-toggle" class="nav-icon"></a>
<a class="main-nav-link" href="/">Home</a>
<a class="main-nav-link" href="/archives">Archives</a>
<a class="main-nav-link" href="/vimwiki">Wiki</a>
</nav>
<nav id="sub-nav">
<a id="nav-rss-link" class="nav-icon" href="/atom.xml" title="RSS Feed"></a>
<a id="nav-search-btn" class="nav-icon" title="Search"></a>
<div id="search-form-wrap">
<form action="//google.com/search" method="get" accept-charset="UTF-8" class="search-form"><input type="search" name="q" results="0" class="search-form-input" placeholder="Search"><button type="submit" class="search-form-submit"></button><input type="hidden" name="sitesearch" value="http://huangming.github.io"></form>
</div>
</nav>
</div>
</div>
</header>
<div class="outer">
<section id="main"><article id="post-python-zhengze-pachong-qiubai" class="article article-type-post" itemscope itemprop="blogPost">
<!-- <div class="article-meta"> -->
<!-- <a href="/2016-08-06-python-zhengze-pachong-qiubai.html" class="article-date">
<time datetime="2016-08-06T08:25:54.000Z" itemprop="datePublished">2016-08-06</time>
</a> -->
<!--
<div class="article-category">
<a class="article-category-link" href="/categories/python/">python</a>
</div>
-->
<!-- </div> -->
<div class="article-inner">
<header class="article-header">
<h1 class="article-title" itemprop="name">
Python爬虫正则表达式抓取糗百笑话
</h1>
<div class="article-meta">
<a href="/2016-08-06-python-zhengze-pachong-qiubai.html" class="article-date">
<time datetime="2016-08-06T08:25:54.000Z" itemprop="datePublished">2016-08-06</time>
</a>
<div class="article-category">
<a class="article-category-link" href="/categories/python/">python</a>
</div>
</div>
</header>
<div class="article-entry" itemprop="articleBody">
<p>很多人对正则表达式很头痛,其实写正则表达式是有套路的,看完这篇文章保证你会写任何html上的内容抓取的正则。<br>为确保代码的正确性,此文约定环境为python3,如果在python2运行,可能要做一些调整。</p>
<a id="more"></a>
<h2 id="从网上获取整个html">从网上获取整个html</h2><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">import</span> urllib.parse</span><br><span class="line"><span class="keyword">import</span> urllib.request</span><br><span class="line"></span><br><span class="line"><span class="class"><span class="keyword">class</span> <span class="title">QSBK</span>:</span></span><br><span class="line"> <span class="function"><span class="keyword">def</span> <span class="title">__init__</span><span class="params">(self)</span>:</span></span><br><span class="line"> self.page = <span class="number">1</span></span><br><span class="line"> <span class="comment"># 记录访问的页码</span></span><br><span class="line"> self.user_agent = <span class="string">'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'</span></span><br><span class="line"> </span><br><span class="line"> <span class="function"><span class="keyword">def</span> <span class="title">get_stories</span><span class="params">(self)</span>:</span></span><br><span class="line"> url = <span class="string">"http://www.qiushibaike.com/hot/page/"</span>+str(self.page) </span><br><span class="line"> <span class="comment">#构建请求的url </span></span><br><span class="line"> req = urllib.request.Request(url)</span><br><span class="line"> req.add_header(<span class="string">'User-Agent'</span>, self.user_agent)</span><br><span class="line"> response = urllib.request.urlopen(req)</span><br><span class="line"> the_page = response.read().decode(<span class="string">"utf-8"</span>)</span><br><span class="line"> <span class="keyword">return</span> the_page</span><br><span class="line"></span><br><span class="line"><span class="keyword">if</span> __name__ == <span class="string">'__main__'</span>:</span><br><span class="line"> qb = QSBK()</span><br><span class="line"> <span class="comment"># print(qb.get_stories()) 此处会报一个编码错误的error,应该是在html的后面有</span></span><br><span class="line"> <span class="comment"># 个别编码不一样的原因,用下面这个可以截取前面的部分,正常看到内容</span></span><br><span class="line"> print(qb.get_stories()[:<span class="number">5000</span>])</span><br></pre></td></tr></table></figure>
<h2 id="编写正则表达式">编写正则表达式</h2><p>为了写出正则表达式,我们总结了糗百的段子规律:以<code><div\sclass="article\sblock\suntagged\smb15"\sid='qiushi_tag</code>开头,然后有好多空行啦,html的 <code><div></code> 块啦,然后<code><div\sclass="content"></code>接着可能有几行空行或者没空行,接着就是我们要的段子,再接着是空行加<code></div></code>结束段子。在匹配到开头到段子内容之间有很多内容,我们如果想要用一些正则忽略他们的内容进行匹配就必须用到<code>.*</code>的多行匹配re.S。</p>
<p>这里用到了正则表达式的 <code>re.findall(pattern, string[, flags])</code> 大家可以先去搜索一下这个东西大概浏览一下,特别注意一下 <code>flags=re.X</code> 和 <code>flags=re.S</code> 的意义,这里了利用这两个flags能写出很好看很容易理解的正则表达式:</p>
<ul>
<li>re.S:DOTALL<br>使 “.” 特殊字符完全匹配任何字符,包括换行;没有这个标志, “.” 匹配除了换行外的任何字符。</li>
<li>re.X:VERBOSE<br>该标志通过给予你更灵活的格式以便你将正则表达式写得更易于理解。当该标志被指定时,在 RE 字符串中的空白符被忽略,除非该空白符在字符类中或在反斜杠之後;这可以让你更清晰地组织和缩进 RE。它也可以允许你将注释写入 RE,这些注释会被引擎忽略;注释用 “#”号 来标识,不过该符号不能在字符串或反斜杠之後。</li>
</ul>
<p>抓取到内容后就是编写正则表达式匹配内容了,这里先另外起程序写出正确的正则表达式再加进上面代码里面。</p>
<p>先选用一段html作为范例教大家怎么写正则。</p>
<figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">import</span> re</span><br><span class="line">html = <span class="string">'''<div class="article block untagged mb15" id='qiushi_tag_117188488'></span><br><span class="line"></span><br><span class="line"><div class="author clearfix"></span><br><span class="line"><a href="/users/29552012/" target="_blank" rel="nofollow"></span><br><span class="line"><img src="http://pic.qiushibaike.com/system/avtnew/2955/29552012/medium/20150723232021.jpg" alt="invictusmaneo"/></span><br><span class="line"></a></span><br><span class="line"><a href="/users/29552012/" target="_blank" title="invictusmaneo"></span><br><span class="line"><h2>invictusmaneo</h2></span><br><span class="line"></a></span><br><span class="line"></div></span><br><span class="line"></span><br><span class="line"></span><br><span class="line"><div class="content"></span><br><span class="line"></span><br><span class="line">昨天在食堂打菜 给人碰了一下 菜汤泼裤子上了 当时愁的 心说油渍多难洗啊 吃完饭 看了眼裤子 连个印子都没了 菜里根本就没油啊!!! 是我想太多了..</span><br><span class="line"></span><br><span class="line"></div></span><br><span class="line"></span><br><span class="line"></span><br><span class="line"></span><br><span class="line"><div class="stats"></span><br><span class="line"><span class="stats-vote"><i class="number">7152</i> 好笑</span></span><br><span class="line"><span class="stats-comments">'''</span></span><br><span class="line">pattern = <span class="string">'''<div class="article block untagged mb15" id='qiushi_tag_117192587'>'''</span></span><br><span class="line">myItems = re.findall(pattern,html,re.S|re.X)</span><br><span class="line">print(myItems)</span><br></pre></td></tr></table></figure>
<h3 id="1-空格">1.空格</h3><p>我们先拿一小段试试能不能匹配,发现匹配不了。匹配不了的时候不用慌,我们把匹配内容再缩小: <code>pattern='''<div'''</code> 发现是能匹配的,而 <code>pattern='''<div class'''</code> 匹配不了,这时候稍微想一下就知道,这个空格出问题了,然后再浏览一下python正则的博文或者直接百度搜索,就知道空格应该用 <code>\s</code>,然后我们把整行的空格全替换掉:<br><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">pattern = <span class="string">'''<div\sclass="article\sblock\suntagged\smb15"\sid='qiushi_tag_117192587'>'''</span></span><br></pre></td></tr></table></figure></p>
<p>运行后,结果能匹配到。然后我们继续研究关于多行匹配的</p>
<h3 id="2-多行">2.多行</h3><p>先试试匹配两行连续的。记得先把空格替换成 <code>\s</code></p>
<figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br></pre></td><td class="code"><pre><span class="line">pattern = <span class="string">'''<div\sclass="author\sclearfix"></span><br><span class="line"><a\shref="/users/27990908/"\starget="_blank"\srel="nofollow">'''</span></span><br></pre></td></tr></table></figure>
<p>发现是匹配不了的,这时候就要想了,有可能换行要特殊处理的,搜一下百度,我们试试 <code>\n</code> 代替换行。</p>
<figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">pattern = <span class="string">'''<div\sclass="author\sclearfix">\n<a\shref="/users/27990908/"\starget="_blank"\srel="nofollow">'''</span></span><br></pre></td></tr></table></figure>
<p>试了下,结果空的,说明不对,然后再试一下写两行的:</p>
<figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br></pre></td><td class="code"><pre><span class="line">pattern = <span class="string">'''<div\sclass="author\sclearfix">\n</span><br><span class="line"><a\shref="/users/27990908/"\starget="_blank"\srel="nofollow">'''</span></span><br></pre></td></tr></table></figure>
<p>结果还是空的,说明这个换行是匹配失败了,因为单独两行的分别去匹配是可以匹配到的,加个换行就不行了。此时再想一想,还有什么地方关于这个<code>\n</code>容易出问题的?很容易想到<code>\</code>可能会出问题。我们再试试<code>\\n</code>:</p>
<figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">pattern = <span class="string">'''<div\sclass="author\sclearfix">\\n<a\shref="/users/27990908/"\starget="_blank"\srel="nofollow">'''</span></span><br></pre></td></tr></table></figure>
<p>发现是能匹配到结果的。因为用了<code>re.X</code>,我们再试试换写成两行的:</p>
<figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br></pre></td><td class="code"><pre><span class="line">pattern = <span class="string">'''<div\sclass="author\sclearfix">\\n</span><br><span class="line"> <a\shref="/users/27990908/"\starget="_blank"\srel="nofollow">'''</span></span><br></pre></td></tr></table></figure>
<p>这个也是能匹配的。证明用了<code>re.X</code>确实对于空格和换行等于没效果的。</p>
<h3 id="3-分组">3.分组</h3><p>上面所有匹配到的都是在 <code>pattern</code> 的全部内容,我们最终要的只是笑话的部分,所以必然会有一个筛选的东西。通过大概的搜索浏览 <code>python正则</code> 可以知道括号分组能从匹配结果中只返回分组内容。例如:</p>
<figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br></pre></td><td class="code"><pre><span class="line">pattern = <span class="string">'''<div\sclass="author\sclearfix">\\n</span><br><span class="line"> <a\shref="/users/27990908/"\starget="_blank"\srel="(nofollow)">'''</span></span><br></pre></td></tr></table></figure>
<p>运行后得到 <code>nofollow</code>。所以我们只要把<code>pattern</code>一直写下去,写到笑话那里,然后把内容用括号括起来就可以得到笑话内容了。另外,因为每段笑话前后结构大体相同,但是还是有细节是不一样的,所以还要处理一些不一样的东西。不然像上面的<code>pattern</code>就只能匹配到一个结果,也就是说如果是笑话也只能匹配到一个。很容易,我们就能观察到时那串数字<code>27990908</code>不一样。此时我们可以用<code>\d+</code>代替,以匹配所有的数字。</p>
<h3 id="4-贪婪、非贪婪匹配">4.贪婪、非贪婪匹配</h3><p>我们继续写开始的<code>pattern</code>:</p>
<figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">pattern = <span class="string">'''<div\sclass="article\sblock\suntagged\smb15"\sid='qiushi_tag_\d+'>'''</span></span><br></pre></td></tr></table></figure>
<p>此时,能匹配到开头了,但是接下来有一大段内容,我们不想关注他是什么,我们只想关注到段子那里。我们尝试用 <code>.*</code>、<code>\w*</code>之类的写法: </p>
<figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">pattern = <span class="string">'''<div\sclass="article\sblock\suntagged\smb15"\sid='qiushi_tag_\d+'>.*<div\sclass="content">'''</span></span><br></pre></td></tr></table></figure>
<p>运行发现,只有一段笑话的话是可以匹配到的,但是整个html有很多段的时候,他直接匹配到最后一个笑话的<code><div\sclass="content"></code>!为了解决问题,我们搜索一下<code>python正则 尽可能少匹配</code>,很容易得到<code>贪婪</code>、<code>非贪婪</code>这两个词:</p>
<figure class="highlight 1c"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br></pre></td><td class="code"><pre><span class="line">Python里数量词默认是贪婪的(在少数语言里也可能是默认非贪婪),总是尝试匹配尽可能多的字符;</span><br><span class="line">非贪婪则相反,总是尝试匹配尽可能少的字符。在<span class="string">"*"</span>,<span class="string">"?"</span>,<span class="string">"+"</span>,<span class="string">"{m,n}"</span>后面加上?,使贪婪变成非贪婪</span><br></pre></td></tr></table></figure>
<p>然后我们得到新的pattern:</p>
<figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">pattern = <span class="string">'''<div\sclass="article\sblock\suntagged\smb15"\sid='qiushi_tag_\d+'>.*?<div\sclass="content">\\n*(.*?)\\n*</div>'''</span></span><br></pre></td></tr></table></figure>
<p>此时运行后发现已经能抓到大部分段子了,但是依然存在一些垃圾信息抓多了的。然后观察一下html会发现,这些垃圾信息是属于带图片的段子的文本部分,因为少了图片看起来就很奇怪。所以我们要继续改进pattern,使其不匹配带图片的段子。要想实现这个就必须找出文本跟图片段子的区别。</p>
<p>很容易我们就能发现,在文本段子后面,只有文本的段子是直接就<code><div class="stats"></code>,然后几行后是有<code><div class="single-clear"></div></code> 的,而有图片的段子在到达<code><div class="stats"></code>之前是多了好几个html标记的,其中一个是<code><img></code>标记。我们只要把pattrtn继续写下去,就能只匹配文本的段子了:</p>
<figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">pattern = <span class="string">'''<div\sclass="article\sblock\suntagged\smb15"\sid='qiushi_tag_\d+'>.*?<div\sclass="content">\\n*([^(?:</div>)]*?)\\n*</div>\\n*<div\sclass="stats">'''</span></span><br></pre></td></tr></table></figure>
<p>运行发现,垃圾信息依然存在</p>
<figure class="highlight stata"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">['昨天在食堂打菜 给人碰了一下 菜汤泼裤子上了 当时愁的 心说油渍多难洗啊 吃完饭 看了眼裤子 连个印子都没了 菜里根本就没油啊!!! 是我想太多了..', '12345\<span class="keyword">n</span>\<span class="keyword">n</span></div>\<span class="keyword">n</span>\<span class="keyword">n</span>\<span class="keyword">n</span>\<span class="keyword">n</span><div <span class="keyword">class</span>=<span class="string">"thumb"</span>>\<span class="keyword">n</span>\<span class="keyword">n</span><a href=<span class="string">"/article/117193665"</span> target=<span class="string">"_blank"</span>>\<span class="keyword">n</span><img src=<span class="string">"http://pic.qiushibaike.com/system/pictures/11719/117193665/medium/app117193665.jpg"</span> alt=<span class="string">"糗事#117193665"</span> />\<span class="keyword">n</span></a>', '家里有个两岁的小暖男,大热天他经常问他爸爸:冷吗?然后帮他爸爸把电风扇关了,把被子盖上,每天乐此不疲。']</span><br></pre></td></tr></table></figure>
<p>仔细分析我们的pattern和那个多余的垃圾信息,发现确实是能匹配到的,那个垃圾信息完全符合pattern开头中间结尾的所有内容。关键点就在于在 <code><div class="content"></code> 跟 <code></div></code> 之间的才是真正的段子。我们虽然对段子用了<code>(.*?)</code>非贪婪匹配,但是因为加了后面结尾的条件,所以匹配会在满足结尾的条件下去用非贪婪匹配,所以得到了上面的结果。这时候非贪婪就不够用了,我们要自己给他实现非贪婪匹配。既然是段子结尾接着是<code></div></code>,那么我们只要限制段子的内容不能是<code></div></code>就能让他抓取到正确的段子了。</p>
<p>搜索一下 <code>python正则 否定</code> 就能得到<code>^</code>在python正则里边是否定的意思。但是网上的例子大部分都是只有一个字符的,我们想要实现<code>不是</div></code>要怎么写呢?答案是 <code>[^(</div>)]*</code> 至于为什么这样,写起来太啰嗦,能理解的理解,不能理解的记住规则就可以了。如果纠结于括号会想到分组,可以用<code>[^(?:</div>)]*</code>,分组括号里面用<code>?:</code>表示这个分组的内容不会被当作返回结果,分组只作为一个整体的作用。</p>
<p>最终,我们得到了能完全匹配到自己想要的段子的正则,然后可以利用<code>re.X</code>的特性,让正则更美观易懂一些:</p>
<figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br></pre></td><td class="code"><pre><span class="line">pattern = <span class="string">'''<div\sclass="article\sblock\suntagged\smb15"\sid='qiushi_tag_\d+'> # 这里匹配开头</span><br><span class="line"> .*? # 这里匹配开头到段子前标记之间的内容,注意要用非贪婪匹配?</span><br><span class="line"> <div\sclass="content">\\n* # 这里是段子开始的标记,包括段子前的空行</span><br><span class="line"> ([^(?:</div>)]*?) # 这里是段子内容,最外边的括号是分组,里边的只是整体作用</span><br><span class="line"> \\n*</div> # 这里是段子结束的标记,包括段子后的空行</span><br><span class="line"> \\n*</span><br><span class="line"> <div\sclass="stats"> # 只匹配不含图片的段子。</span><br><span class="line">'''</span></span><br></pre></td></tr></table></figure>
<h2 id="最终结果">最终结果</h2><p>因为糗事百科网页可能会变化,这段代码在写这篇文章的时候是能得到预期结果的,如果不能得到预期结果,请自行修正。</p>
<figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br><span class="line">35</span><br><span class="line">36</span><br><span class="line">37</span><br><span class="line">38</span><br><span class="line">39</span><br><span class="line">40</span><br><span class="line">41</span><br><span class="line">42</span><br><span class="line">43</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">import</span> urllib.parse</span><br><span class="line"><span class="keyword">import</span> urllib.request</span><br><span class="line"><span class="keyword">import</span> re</span><br><span class="line"></span><br><span class="line"><span class="class"><span class="keyword">class</span> <span class="title">QSBK</span>:</span></span><br><span class="line"> <span class="function"><span class="keyword">def</span> <span class="title">__init__</span><span class="params">(self)</span>:</span></span><br><span class="line"> self.page = <span class="number">1</span></span><br><span class="line"> <span class="comment"># 记录访问的页码</span></span><br><span class="line"> self.user_agent = <span class="string">'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'</span></span><br><span class="line"> self.stories = []</span><br><span class="line"> <span class="comment"># 存储段子</span></span><br><span class="line"> </span><br><span class="line"> <span class="function"><span class="keyword">def</span> <span class="title">get_stories</span><span class="params">(self)</span>:</span></span><br><span class="line"> url = <span class="string">"http://www.qiushibaike.com/hot/page/"</span>+str(self.page) </span><br><span class="line"> <span class="comment">#构建请求的url </span></span><br><span class="line"> req = urllib.request.Request(url)</span><br><span class="line"> req.add_header(<span class="string">'User-Agent'</span>, self.user_agent)</span><br><span class="line"> response = urllib.request.urlopen(req)</span><br><span class="line"> the_page = response.read().decode(<span class="string">"utf-8"</span>)</span><br><span class="line"> <span class="comment"># print(the_page)</span></span><br><span class="line"> pattern = <span class="string">'''<div\sclass="article\sblock\suntagged\smb15"\sid='qiushi_tag_\d+'> # 这里匹配开头</span><br><span class="line"> .*? # 这里匹配开头到段子前标记之间的内容,注意要用非贪婪匹配?</span><br><span class="line"> <div\sclass="content">\\n* # 这里是段子开始的标记,包括段子前的空行</span><br><span class="line"> ([^(?:</div>)]*?) # 这里是段子内容,最外边的括号是分组,里边的只是整体作用</span><br><span class="line"> \\n*</div> # 这里是段子结束的标记,包括段子后的空行</span><br><span class="line"> \\n*</span><br><span class="line"> <div\sclass="stats"> # 只匹配不含图片的段子。</span><br><span class="line"> '''</span></span><br><span class="line"> myItems = re.findall(pattern,the_page,re.S|re.X)</span><br><span class="line"> <span class="comment"># print(len(myItems))</span></span><br><span class="line"> <span class="keyword">for</span> item <span class="keyword">in</span> myItems: </span><br><span class="line"> self.stories.append(item)</span><br><span class="line"> self.page += <span class="number">1</span></span><br><span class="line"> print(self.stories)</span><br><span class="line"></span><br><span class="line"> <span class="function"><span class="keyword">def</span> <span class="title">haha</span><span class="params">(self)</span>:</span></span><br><span class="line"> <span class="keyword">if</span> len(self.stories)<<span class="number">2</span>:</span><br><span class="line"> self.get_stories()</span><br><span class="line"> <span class="keyword">return</span> self.stories.pop()[<span class="number">1</span>]</span><br><span class="line"></span><br><span class="line"><span class="keyword">if</span> __name__ == <span class="string">'__main__'</span>:</span><br><span class="line"> qb = QSBK()</span><br><span class="line"> print(qb.haha())</span><br></pre></td></tr></table></figure>
<p>只要掌握了上面的1234,基本的网页正则匹配都没什么问题了。</p>
</div>
<footer class="article-footer">
<!-- <a data-url="http://huangming.github.io/2016-08-06-python-zhengze-pachong-qiubai.html" data-id="cixmol30y000vewfu27wemf1r" class="article-share-link">Share</a> -->
<nav id="article-nav">
<a href="/2016-08-18-oracle-auto-backup-and-restore.html" id="article-nav-newer" class="article-nav-link-wrap">
<!-- <strong class="article-nav-caption">Newer</strong> -->
<div class="article-nav-title">
«oracle自动备份同步
</div>
</a>
<a href="/2016-06-25-ssh-upload-dir-to-linux-vps-by-scp.html" id="article-nav-older" class="article-nav-link-wrap">
<!-- <strong class="article-nav-caption">Older</strong> -->
<div class="article-nav-title">SSH通过SCP命令上传文件夹到linux系统的vps»</div>
</a>
</nav>
<a href="http://huangming.github.io/2016-08-06-python-zhengze-pachong-qiubai.html#disqus_thread" class="article-comment-link">Comments</a>
<ul class="article-tag-list"><li class="article-tag-list-item"><a class="article-tag-list-link" href="/tags/python/">python</a></li><li class="article-tag-list-item"><a class="article-tag-list-link" href="/tags/正则/">正则</a></li><li class="article-tag-list-item"><a class="article-tag-list-link" href="/tags/爬虫/">爬虫</a></li></ul>
</footer>
</div>
</article>
<section id="comments">
<div id="disqus_thread">
<noscript>Please enable JavaScript to view the <a href="//disqus.com/?ref_noscript">comments powered by Disqus.</a></noscript>
</div>
</section>
</section>
</div>
<footer id="footer">
<aside id="sidebar" class="outer">
<div class="widget-wrap">
<h3 class="widget-title">Categories</h3>
<div class="widget">
<ul class="category-list"><li class="category-list-item"><a class="category-list-link" href="/categories/Vim/">Vim</a><span class="category-list-count">1</span></li><li class="category-list-item"><a class="category-list-link" href="/categories/blog/">blog</a><span class="category-list-count">1</span></li><li class="category-list-item"><a class="category-list-link" href="/categories/database/">database</a><span class="category-list-count">3</span></li><li class="category-list-item"><a class="category-list-link" href="/categories/diary/">diary</a><span class="category-list-count">33</span></li><li class="category-list-item"><a class="category-list-link" href="/categories/other/">other</a><span class="category-list-count">1</span></li><li class="category-list-item"><a class="category-list-link" href="/categories/python/">python</a><span class="category-list-count">3</span></li><li class="category-list-item"><a class="category-list-link" href="/categories/vps/">vps</a><span class="category-list-count">1</span></li><li class="category-list-item"><a class="category-list-link" href="/categories/www/">www</a><span class="category-list-count">1</span></li></ul>
</div>
</div>
<div class="widget-wrap">
<h3 class="widget-title">Tags</h3>
<div class="widget">
<ul class="tag-list"><li class="tag-list-item"><a class="tag-list-link" href="/tags/backup/">backup</a><span class="tag-list-count">1</span></li><li class="tag-list-item"><a class="tag-list-link" href="/tags/dokuwiki/">dokuwiki</a><span class="tag-list-count">1</span></li><li class="tag-list-item"><a class="tag-list-link" href="/tags/excel/">excel</a><span class="tag-list-count">1</span></li><li class="tag-list-item"><a class="tag-list-link" href="/tags/expdp/">expdp</a><span class="tag-list-count">1</span></li><li class="tag-list-item"><a class="tag-list-link" href="/tags/hexo/">hexo</a><span class="tag-list-count">1</span></li><li class="tag-list-item"><a class="tag-list-link" href="/tags/impdp/">impdp</a><span class="tag-list-count">1</span></li><li class="tag-list-item"><a class="tag-list-link" href="/tags/linux/">linux</a><span class="tag-list-count">1</span></li><li class="tag-list-item"><a class="tag-list-link" href="/tags/log/">log</a><span class="tag-list-count">1</span></li><li class="tag-list-item"><a class="tag-list-link" href="/tags/markdown/">markdown</a><span class="tag-list-count">1</span></li><li class="tag-list-item"><a class="tag-list-link" href="/tags/nginx/">nginx</a><span class="tag-list-count">1</span></li><li class="tag-list-item"><a class="tag-list-link" href="/tags/oracle/">oracle</a><span class="tag-list-count">4</span></li><li class="tag-list-item"><a class="tag-list-link" href="/tags/oracle热备/">oracle热备</a><span class="tag-list-count">1</span></li><li class="tag-list-item"><a class="tag-list-link" href="/tags/python/">python</a><span class="tag-list-count">3</span></li><li class="tag-list-item"><a class="tag-list-link" href="/tags/ssh/">ssh</a><span class="tag-list-count">1</span></li><li class="tag-list-item"><a class="tag-list-link" href="/tags/vps/">vps</a><span class="tag-list-count">1</span></li><li class="tag-list-item"><a class="tag-list-link" href="/tags/wiki/">wiki</a><span class="tag-list-count">1</span></li><li class="tag-list-item"><a class="tag-list-link" href="/tags/云笔记/">云笔记</a><span class="tag-list-count">1</span></li><li class="tag-list-item"><a class="tag-list-link" href="/tags/动态链接库/">动态链接库</a><span class="tag-list-count">1</span></li><li class="tag-list-item"><a class="tag-list-link" href="/tags/正则/">正则</a><span class="tag-list-count">1</span></li><li class="tag-list-item"><a class="tag-list-link" href="/tags/爬虫/">爬虫</a><span class="tag-list-count">1</span></li><li class="tag-list-item"><a class="tag-list-link" href="/tags/自动化/">自动化</a><span class="tag-list-count">1</span></li></ul>
</div>
</div>
<div class="widget-wrap">
<h3 class="widget-title">Tag Cloud</h3>
<div class="widget tagcloud">
<a href="/tags/backup/" style="font-size: 10px;">backup</a> <a href="/tags/dokuwiki/" style="font-size: 10px;">dokuwiki</a> <a href="/tags/excel/" style="font-size: 10px;">excel</a> <a href="/tags/expdp/" style="font-size: 10px;">expdp</a> <a href="/tags/hexo/" style="font-size: 10px;">hexo</a> <a href="/tags/impdp/" style="font-size: 10px;">impdp</a> <a href="/tags/linux/" style="font-size: 10px;">linux</a> <a href="/tags/log/" style="font-size: 10px;">log</a> <a href="/tags/markdown/" style="font-size: 10px;">markdown</a> <a href="/tags/nginx/" style="font-size: 10px;">nginx</a> <a href="/tags/oracle/" style="font-size: 20px;">oracle</a> <a href="/tags/oracle热备/" style="font-size: 10px;">oracle热备</a> <a href="/tags/python/" style="font-size: 15px;">python</a> <a href="/tags/ssh/" style="font-size: 10px;">ssh</a> <a href="/tags/vps/" style="font-size: 10px;">vps</a> <a href="/tags/wiki/" style="font-size: 10px;">wiki</a> <a href="/tags/云笔记/" style="font-size: 10px;">云笔记</a> <a href="/tags/动态链接库/" style="font-size: 10px;">动态链接库</a> <a href="/tags/正则/" style="font-size: 10px;">正则</a> <a href="/tags/爬虫/" style="font-size: 10px;">爬虫</a> <a href="/tags/自动化/" style="font-size: 10px;">自动化</a>
</div>
</div>
<div class="widget-wrap">
<h3 class="widget-title">Recents</h3>
<div class="widget">
<ul>
<li>
<a href="/2016-12-27-oracle-ora00020.html">Oracle无法登陆:ORA-00020</a>
</li>
<li>
<a href="/2016-12-17-dokuwiki-nginx.html">Windows下用Nginx部署DokuWiki</a>
</li>
<li>
<a href="/2016-12-03-cloud-notes-2012.html">无意翻到当年的云笔记</a>
</li>
<li>
<a href="/2016-08-18-oracle-auto-backup-and-restore.html">oracle自动备份同步</a>
</li>
<li>
<a href="/2016-08-06-python-zhengze-pachong-qiubai.html">Python爬虫正则表达式抓取糗百笑话</a>
</li>
</ul>
</div>
</div>
</aside>
<div class="outer">
<div id="footer-info" class="inner">
© 2017 Mingo<br>
Powered by <a href="http://hexo.io/" target="_blank">Hexo</a>
</div>
</div>
</footer>
</div>
<nav id="mobile-nav">
<a href="/" class="mobile-nav-link">Home</a>
<a href="/archives" class="mobile-nav-link">Archives</a>
<a href="/vimwiki" class="mobile-nav-link">Wiki</a>
</nav>
<script>
var disqus_shortname = 'huangming';
var disqus_url = 'http://huangming.github.io/2016-08-06-python-zhengze-pachong-qiubai.html';
var disqus_config = function () {
this.page.url = PAGE_URL; // Replace PAGE_URL with your page's canonical URL variable
this.page.identifier = PAGE_IDENTIFIER; // Replace PAGE_IDENTIFIER with your page's unique identifier variable
};
(function() { // DON'T EDIT BELOW THIS LINE
var d = document, s = d.createElement('script');
s.src = '//'+disqus_shortname + '.disqus.com/embed.js';
s.setAttribute('data-timestamp', +new Date());
(d.head || d.body).appendChild(s);
})();
</script>
<!-- <script src="//ajax.googleapis.com/ajax/libs/jquery/2.0.3/jquery.min.js"></script> -->
<script src="/js/script.js" type="text/javascript"></script>
<script type="text/javascript">
var myDate = new Date()
var m = myDate.getMinutes()
m = m % 10
document.body.style.background = '#333 url("/css/images/bgr'+m+'.jpg") top left'
document.getElementById('wrap').style.background='#333 url("/css/images/bgr'+m+'.jpg") top left'
</script>
</div>
<div id="spig" class="spig">
<div id="message">正在加载中……</div>
<div id="mumu" class="mumu"></div>
</div>
</body>
</html>