add people daily publisher #444

screw-44 · 2024-04-22T19:54:17Z

Adding People daily 人民日报

Hi! I tried to add people daily publisher(in short People) to the fundas, it now is working in my own test, but still have some isusses. People daily is a chinese publisher which is the first asia publisher(if added) to fundas.

The most confusing problem I run into is encoding problems. Due to the fact the "People" uses "GBK" encoding(a old encoding method for chinese character), it turns into a mess in the Parser. So I have to convert the encoding but the code is not nice.

I tested by myself, and it seems to be working. The test code folllowed the code in the tutorial.

But I can't compile the test. The error message shows as this.

(nlpcourse) ➜  fundus git:(add-publihsers-cn-people) ✗ python -m scripts.generate_parser_test_files -p People
People:   0%|                                                                                                                               | 0/1 [00:00<?, ?it/s]2024-04-22 21:44:44,072 - basic_logger - WARNING - Warning! Couldn't reach sitemap 'https://politics.people.com.cn/news_sitemap.xml' because of HTTPSConnectionPool(host='politics.people.com.cn', port=443): Max retries exceeded with url: /news_sitemap.xml (Caused by SSLError(CertificateError("hostname 'politics.people.com.cn' doesn't match either of 'default.chinanetcenter.com', '*.i3839.com', '*.ourdvsss.com', '*.ziroom.com', '*.blued.com', 'sstatic.chunboimg.com', '*.ip138.com', 'm.bbs.3839.com', 'microgame.5054399.net', 'nitrome.com.4399.com', 's3.chunboimg.com', 'jssdk.3304399.net', '*.lof3.xyz', '*.rax0mai4.xyz', '*.4399.cn', 's0.chunboimg.com', '*.3839.com', 'www.miniclip.com.4399pk.com', '*.1zhe.com', 'ip138.com', 'd.3839app.net', 'maangh2.chinanetcenter.com', '*.9k9.cn', '*.4399.com', 's1.chunboimg.com', 'cdn.h5wan.4399sj.com', '*.service.kugou.com', 'lvs.lxdns.net', '*.wscdns.com', '*.walla-app.com', '*.bldimg.com', '*.5054399.com', '*.4399youpai.com', '*.3839app.com', '*.v.cdn20.com', 'hls.vda.v.cdn20.com', '*.cntv.cdn20.com', '*.img4399.com', 's2.chunboimg.com', '*.cntv.lxdns.com', '*.ourdvsssvip.com', '*.v.wscdns.com', '*.iwan4399.com', '4399.cn'")))
　　新华社北京4月22日电　4月22日，国务委员、全国妇联主席谌贻琴在北京会见妇联系统港澳执委、特邀代表考察团并与团员座谈。
　　谌贻琴指出，近年来，妇联系统港澳执委和特邀代表深入学习贯彻习近平新时代中国特色社会主义思想，主动为国家发展建言献策，支持特区政府和行政长官依法施政，不愧是爱国方针，助力推动港澳更好融入国家发展大局，热诚服务基层社区群众，更好展现巾帼担当和风采，为推进中国式现代化，保持香港、澳门长期繁荣稳定作出更大贡献。
　　全国妇联副主席、书记处第一书记黄晓薇主持座谈会。
分享让更多人看到  
　　新华社北京4月22日电　4月22日，国务委员、全国妇联主席谌贻琴在北京会见妇联系统港澳执委、特邀代表考察团并与团员座谈。
　　谌贻琴指出，近年来，妇联系统港澳执委和特邀代表深入学习贯彻习近平新时代中国特色社会主义思想，主动为国家发展建言献策，支持特区政府和行政长官依法施政，不愧是爱国方针，助力推动港澳更好融入国家发展大局，热诚服务基层社区群众，更好展现巾帼担当和风采，为推进中国式现代化，保持香港、澳门长期繁荣稳定作出更大贡献。
　　全国妇联副主席、书记处第一书记黄晓薇主持座谈会。
分享让更多人看到  
People:   0%|                                                                                                                               | 0/1 [00:02<?, ?it/s]
Traceback (most recent call last):
  File "/Users/hexinyu/anaconda3/envs/nlpcourse/lib/python3.9/runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/Users/hexinyu/anaconda3/envs/nlpcourse/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/Users/hexinyu/Library/CloudStorage/OneDrive-个人/002-HU/fundus/scripts/generate_parser_test_files.py", line 143, in <module>
    main()
  File "/Users/hexinyu/Library/CloudStorage/OneDrive-个人/002-HU/fundus/scripts/generate_parser_test_files.py", line 137, in main
    test_data_file.write(test_data)
  File "/Users/hexinyu/Library/CloudStorage/OneDrive-个人/002-HU/fundus/tests/utility.py", line 103, in write
    json.dump(content, json_file, **kwargs)
  File "/Users/hexinyu/anaconda3/envs/nlpcourse/lib/python3.9/json/__init__.py", line 179, in dump
    for chunk in iterable:
  File "/Users/hexinyu/anaconda3/envs/nlpcourse/lib/python3.9/json/encoder.py", line 431, in _iterencode
    yield from _iterencode_dict(o, _current_indent_level)
  File "/Users/hexinyu/anaconda3/envs/nlpcourse/lib/python3.9/json/encoder.py", line 405, in _iterencode_dict
    yield from chunks
  File "/Users/hexinyu/anaconda3/envs/nlpcourse/lib/python3.9/json/encoder.py", line 405, in _iterencode_dict
    yield from chunks
  File "/Users/hexinyu/anaconda3/envs/nlpcourse/lib/python3.9/json/encoder.py", line 438, in _iterencode
    o = _default(o)
  File "/Users/hexinyu/Library/CloudStorage/OneDrive-个人/002-HU/fundus/tests/utility.py", line 112, in default
    return obj.serialize()
  File "/Users/hexinyu/Library/CloudStorage/OneDrive-个人/002-HU/nlp-exercise-1/fundus/src/fundus/parser/data.py", line 256, in serialize
    "sections": [section.serialize() for section in self.sections],
  File "/Users/hexinyu/Library/CloudStorage/OneDrive-个人/002-HU/nlp-exercise-1/fundus/src/fundus/parser/data.py", line 256, in <listcomp>
    "sections": [section.serialize() for section in self.sections],
AttributeError: 'str' object has no attribute 'serialize'

Anyway, I hope it is just the testing problem, hope to be hearing from you guys soon.

screw-44 · 2024-04-22T19:59:30Z

I should also added the fact that the HTML page and the whole structure is quite different compare to websites in the tutorial, which means I have to find a lot of work arounds. Just look at the two sources you will know what I mean.

        sources=[Sitemap("https://www.people.cn/sitemap_index.xml"),
                 NewsMap("https://politics.people.com.cn/news_sitemap.xml")],

MaxDall · 2024-04-23T11:08:56Z

Hey @screw-44 thanks for adding our first chinese news outlet 🚀 and sorry to hear that you ran into so much issues!

The problem here is that requests isn't able to parse the proper charset for this publisher and uses latin-1 as a fallback. I can think of two practical solutions here.

Let the parser handle the encoding, as you already implemented ;), but there is also a cleaner way to do so. Fundus supports registering function for parsers which then will be executed based on priority. You can use this functionality to overwrite attributes in the precomputed class variable.

        @function(priority=0)
        def fix_encoding(self):
            self.precomputed.html = self.precomputed.html.encode("latin-1").decode("gbk")
            self.precomputed.doc = lxml.html.document_fromstring(self.precomputed.html)

Now the rest can be parsed as usual. I will post the entire parser here.

Click me

import datetime
from typing import List, Optional

import lxml.html
from lxml.cssselect import CSSSelector

from fundus.parser import ArticleBody, BaseParser, ParserProxy, attribute, function
from fundus.parser.utility import (
  extract_article_body_with_selector,
  generic_author_parsing,
  generic_date_parsing,
  parse_title_from_root,
)


class PeopleParser(ParserProxy):
  class V1(BaseParser):
      _paragraph_selector = CSSSelector("div.rm_txt_con > p")

      @function(priority=0)
      def fix_encoding(self):
          self.precomputed.html = self.precomputed.html.encode("latin-1").decode("gbk")
          self.precomputed.doc = lxml.html.document_fromstring(self.precomputed.html)

      @attribute
      def body(self) -> ArticleBody:
          return extract_article_body_with_selector(self.precomputed.doc, paragraph_selector=self._paragraph_selector)

      @attribute
      def title(self) -> Optional[str]:
          return parse_title_from_root(self.precomputed.doc)

      @attribute
      def authors(self) -> List[str]:
          return generic_author_parsing(self.precomputed.meta.get("author"))

      @attribute
      def publishing_date(self) -> Optional[datetime.datetime]:
          return generic_date_parsing(self.precomputed.meta.get("publishdate"))

The more complicated and proper solution would be to fix the encoding within Fundus crawling. While this may take a while, for now i would suggest using the parser I posted as a temporary fix :)

…dd-publihsers-cn-people

MaxDall · 2024-04-23T16:34:23Z

@screw-44 Could you run python -m scripts.generate_parser_test_files -p People in the project root and commit the changes to your branch?

screw-44 · 2024-04-23T17:24:10Z

@MaxDall Hi! Of course, thanks for the help in the PR. But it still produce this error
People: 0%| | 0/1 [00:03<?, ?it/s]
Traceback (most recent call last):
File "/Users/hexinyu/anaconda3/envs/nlpcourse/lib/python3.9/runpy.py", line 197, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/Users/hexinyu/anaconda3/envs/nlpcourse/lib/python3.9/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/Users/hexinyu/Library/CloudStorage/OneDrive-个人/002-HU/fundus/scripts/generate_parser_test_files.py", line 143, in
main()
File "/Users/hexinyu/Library/CloudStorage/OneDrive-个人/002-HU/fundus/scripts/generate_parser_test_files.py", line 137, in main
test_data_file.write(test_data)
File "/Users/hexinyu/Library/CloudStorage/OneDrive-个人/002-HU/fundus/tests/utility.py", line 103, in write
json.dump(content, json_file, **kwargs)
File "/Users/hexinyu/anaconda3/envs/nlpcourse/lib/python3.9/json/init.py", line 179, in dump
for chunk in iterable:
File "/Users/hexinyu/anaconda3/envs/nlpcourse/lib/python3.9/json/encoder.py", line 431, in _iterencode
yield from _iterencode_dict(o, _current_indent_level)
File "/Users/hexinyu/anaconda3/envs/nlpcourse/lib/python3.9/json/encoder.py", line 405, in _iterencode_dict
yield from chunks
File "/Users/hexinyu/anaconda3/envs/nlpcourse/lib/python3.9/json/encoder.py", line 405, in _iterencode_dict
yield from chunks
File "/Users/hexinyu/anaconda3/envs/nlpcourse/lib/python3.9/json/encoder.py", line 438, in _iterencode
o = _default(o)
File "/Users/hexinyu/Library/CloudStorage/OneDrive-个人/002-HU/fundus/tests/utility.py", line 112, in default
return obj.serialize()
File "/Users/hexinyu/Library/CloudStorage/OneDrive-个人/002-HU/nlp-exercise-1/fundus/src/fundus/parser/data.py", line 256, in serialize
"sections": [section.serialize() for section in self.sections],
File "/Users/hexinyu/Library/CloudStorage/OneDrive-个人/002-HU/nlp-exercise-1/fundus/src/fundus/parser/data.py", line 256, in
"sections": [section.serialize() for section in self.sections],
AttributeError: 'str' object has no attribute 'serialize'

MaxDall · 2024-04-23T19:12:16Z

@screw-44 Hmm, no idea what happened there. When I check-out your branch I can run the script with no errors. I'm gonna update the branch for you.

Further, I opened a PR to address the encoding issue #450.

screw-44 · 2024-04-23T19:13:39Z

@screw-44 Hmm, no idea what happened there. When I checkout your branch I can run the script with no errors. I'm gonna update the branch for you.

Oh thank you! You are very kind and nice!!!

MaxDall · 2024-04-24T19:59:33Z

@screw-44 I finished the parser for now. Can you do me a favor and check the publisher's articles if you can find any author by name? Apparently, I don't speak/read Mandarin so it's almost impossible for me. You could start with the article used for the unit test. I just want to know if there is any author name we can scrape. Right now its just a number.

screw-44 · 2024-04-24T20:04:58Z

@screw-44 I finished the parser for now. Can you do me a favor and check the publisher's articles if you can find any author by name? Apparently, I don't speak/read Mandarin so it's almost impossible for me. You could start with the article used for the unit test. I just want to know if there is any author name we can scrape. Right now its just a number.

Of course, The authors' name is in the html, I can try to add that.

screw-44 · 2024-04-24T20:13:15Z

@MaxDall But I think I should say that the technical "author" of this news paper is usually not disclosed, that is why the author is showing up as only a number. But there are people's name on each news page, written as 责编, which means "someone who is responsible for the editing and quality control". Given this context, do you think I should try to use those peoples' name as the authors' name?

The structure of publisher and how news is made is different between the east and the west . I think it make sense to put those peoples' name into the author. But I think it is responsible to mention it

MaxDall · 2024-04-24T20:28:03Z

@screw-44 Nice, that sounds like a plan!

… are responsible for the article, not necessarily mean the actual author.

screw-44 and others added 3 commits April 22, 2024 21:40

add people daily publisher

8aa584c

add people daily publisher

4ff6fd3

Update documentation from @ 72d1bef

81ed146

MaxDall mentioned this pull request Apr 23, 2024

[Bug]: Falsely encoded HTML #446

Closed

screw-44 and others added 3 commits April 23, 2024 17:48

change to the recommand implenmentation

20ffe3d

Merge remote-tracking branch 'origin/add-publihsers-cn-people' into a…

9bed6bf

…dd-publihsers-cn-people

Update documentation from @ eca7420

ca95462

MaxDall and others added 3 commits April 24, 2024 21:26

Merge branch 'master' into fork/add-publihsers-cn-people

6412fba

finish People

e710aa0

Update documentation from @ be8addc

256b263

MaxDall mentioned this pull request Apr 24, 2024

Fix some issues regarding encoding detection #455

Merged

screw-44 and others added 4 commits April 25, 2024 12:16

change the implementation of authors in People, note it is people who…

c381b3c

… are responsible for the article, not necessarily mean the actual author.

change the author in the test resources

8396e54

Run black src, isort src and mypy src, improved looks of the code

fc03b20

code rework

b943f8d

MaxDall approved these changes Apr 25, 2024

View reviewed changes

screw-44 closed this Apr 25, 2024

screw-44 reopened this Apr 25, 2024

MaxDall merged commit 3735af0 into flairNLP:master May 6, 2024
10 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add people daily publisher #444

add people daily publisher #444

screw-44 commented Apr 22, 2024

screw-44 commented Apr 22, 2024

MaxDall commented Apr 23, 2024

MaxDall commented Apr 23, 2024

screw-44 commented Apr 23, 2024

MaxDall commented Apr 23, 2024 •

edited

Loading

screw-44 commented Apr 23, 2024

MaxDall commented Apr 24, 2024

screw-44 commented Apr 24, 2024

screw-44 commented Apr 24, 2024 •

edited

Loading

MaxDall commented Apr 24, 2024

add people daily publisher #444

add people daily publisher #444

Conversation

screw-44 commented Apr 22, 2024

Adding People daily 人民日报

screw-44 commented Apr 22, 2024

MaxDall commented Apr 23, 2024

MaxDall commented Apr 23, 2024

screw-44 commented Apr 23, 2024

MaxDall commented Apr 23, 2024 • edited Loading

screw-44 commented Apr 23, 2024

MaxDall commented Apr 24, 2024

screw-44 commented Apr 24, 2024

screw-44 commented Apr 24, 2024 • edited Loading

MaxDall commented Apr 24, 2024

MaxDall commented Apr 23, 2024 •

edited

Loading

screw-44 commented Apr 24, 2024 •

edited

Loading