Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add people daily publisher #444

Merged
merged 13 commits into from
May 6, 2024

Conversation

screw-44
Copy link
Contributor

Adding People daily 人民日报

Hi! I tried to add people daily publisher(in short People) to the fundas, it now is working in my own test, but still have some isusses. People daily is a chinese publisher which is the first asia publisher(if added) to fundas.

  1. The most confusing problem I run into is encoding problems. Due to the fact the "People" uses "GBK" encoding(a old encoding method for chinese character), it turns into a mess in the Parser. So I have to convert the encoding but the code is not nice.

I tested by myself, and it seems to be working. The test code folllowed the code in the tutorial.

  1. But I can't compile the test. The error message shows as this.
(nlpcourse) ➜  fundus git:(add-publihsers-cn-people) ✗ python -m scripts.generate_parser_test_files -p People
People:   0%|                                                                                                                               | 0/1 [00:00<?, ?it/s]2024-04-22 21:44:44,072 - basic_logger - WARNING - Warning! Couldn't reach sitemap 'https://politics.people.com.cn/news_sitemap.xml' because of HTTPSConnectionPool(host='politics.people.com.cn', port=443): Max retries exceeded with url: /news_sitemap.xml (Caused by SSLError(CertificateError("hostname 'politics.people.com.cn' doesn't match either of 'default.chinanetcenter.com', '*.i3839.com', '*.ourdvsss.com', '*.ziroom.com', '*.blued.com', 'sstatic.chunboimg.com', '*.ip138.com', 'm.bbs.3839.com', 'microgame.5054399.net', 'nitrome.com.4399.com', 's3.chunboimg.com', 'jssdk.3304399.net', '*.lof3.xyz', '*.rax0mai4.xyz', '*.4399.cn', 's0.chunboimg.com', '*.3839.com', 'www.miniclip.com.4399pk.com', '*.1zhe.com', 'ip138.com', 'd.3839app.net', 'maangh2.chinanetcenter.com', '*.9k9.cn', '*.4399.com', 's1.chunboimg.com', 'cdn.h5wan.4399sj.com', '*.service.kugou.com', 'lvs.lxdns.net', '*.wscdns.com', '*.walla-app.com', '*.bldimg.com', '*.5054399.com', '*.4399youpai.com', '*.3839app.com', '*.v.cdn20.com', 'hls.vda.v.cdn20.com', '*.cntv.cdn20.com', '*.img4399.com', 's2.chunboimg.com', '*.cntv.lxdns.com', '*.ourdvsssvip.com', '*.v.wscdns.com', '*.iwan4399.com', '4399.cn'")))
  新华社北京4月22日电 4月22日,国务委员、全国妇联主席谌贻琴在北京会见妇联系统港澳执委、特邀代表考察团并与团员座谈。
  谌贻琴指出,近年来,妇联系统港澳执委和特邀代表深入学习贯彻习近平新时代中国特色社会主义思想,主动为国家发展建言献策,支持特区政府和行政长官依法施政,不愧是爱国方针,助力推动港澳更好融入国家发展大局,热诚服务基层社区群众,更好展现巾帼担当和风采,为推进中国式现代化,保持香港、澳门长期繁荣稳定作出更大贡献。
  全国妇联副主席、书记处第一书记黄晓薇主持座谈会。
分享让更多人看到  
  新华社北京4月22日电 4月22日,国务委员、全国妇联主席谌贻琴在北京会见妇联系统港澳执委、特邀代表考察团并与团员座谈。
  谌贻琴指出,近年来,妇联系统港澳执委和特邀代表深入学习贯彻习近平新时代中国特色社会主义思想,主动为国家发展建言献策,支持特区政府和行政长官依法施政,不愧是爱国方针,助力推动港澳更好融入国家发展大局,热诚服务基层社区群众,更好展现巾帼担当和风采,为推进中国式现代化,保持香港、澳门长期繁荣稳定作出更大贡献。
  全国妇联副主席、书记处第一书记黄晓薇主持座谈会。
分享让更多人看到  
People:   0%|                                                                                                                               | 0/1 [00:02<?, ?it/s]
Traceback (most recent call last):
  File "/Users/hexinyu/anaconda3/envs/nlpcourse/lib/python3.9/runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/Users/hexinyu/anaconda3/envs/nlpcourse/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/Users/hexinyu/Library/CloudStorage/OneDrive-个人/002-HU/fundus/scripts/generate_parser_test_files.py", line 143, in <module>
    main()
  File "/Users/hexinyu/Library/CloudStorage/OneDrive-个人/002-HU/fundus/scripts/generate_parser_test_files.py", line 137, in main
    test_data_file.write(test_data)
  File "/Users/hexinyu/Library/CloudStorage/OneDrive-个人/002-HU/fundus/tests/utility.py", line 103, in write
    json.dump(content, json_file, **kwargs)
  File "/Users/hexinyu/anaconda3/envs/nlpcourse/lib/python3.9/json/__init__.py", line 179, in dump
    for chunk in iterable:
  File "/Users/hexinyu/anaconda3/envs/nlpcourse/lib/python3.9/json/encoder.py", line 431, in _iterencode
    yield from _iterencode_dict(o, _current_indent_level)
  File "/Users/hexinyu/anaconda3/envs/nlpcourse/lib/python3.9/json/encoder.py", line 405, in _iterencode_dict
    yield from chunks
  File "/Users/hexinyu/anaconda3/envs/nlpcourse/lib/python3.9/json/encoder.py", line 405, in _iterencode_dict
    yield from chunks
  File "/Users/hexinyu/anaconda3/envs/nlpcourse/lib/python3.9/json/encoder.py", line 438, in _iterencode
    o = _default(o)
  File "/Users/hexinyu/Library/CloudStorage/OneDrive-个人/002-HU/fundus/tests/utility.py", line 112, in default
    return obj.serialize()
  File "/Users/hexinyu/Library/CloudStorage/OneDrive-个人/002-HU/nlp-exercise-1/fundus/src/fundus/parser/data.py", line 256, in serialize
    "sections": [section.serialize() for section in self.sections],
  File "/Users/hexinyu/Library/CloudStorage/OneDrive-个人/002-HU/nlp-exercise-1/fundus/src/fundus/parser/data.py", line 256, in <listcomp>
    "sections": [section.serialize() for section in self.sections],
AttributeError: 'str' object has no attribute 'serialize'

Anyway, I hope it is just the testing problem, hope to be hearing from you guys soon.

@screw-44
Copy link
Contributor Author

I should also added the fact that the HTML page and the whole structure is quite different compare to websites in the tutorial, which means I have to find a lot of work arounds. Just look at the two sources you will know what I mean.

        sources=[Sitemap("https://www.people.cn/sitemap_index.xml"),
                 NewsMap("https://politics.people.com.cn/news_sitemap.xml")],

@MaxDall
Copy link
Collaborator

MaxDall commented Apr 23, 2024

Hey @screw-44 thanks for adding our first chinese news outlet 🚀 and sorry to hear that you ran into so much issues!

The problem here is that requests isn't able to parse the proper charset for this publisher and uses latin-1 as a fallback. I can think of two practical solutions here.

  1. Let the parser handle the encoding, as you already implemented ;), but there is also a cleaner way to do so. Fundus supports registering function for parsers which then will be executed based on priority. You can use this functionality to overwrite attributes in the precomputed class variable.
        @function(priority=0)
        def fix_encoding(self):
            self.precomputed.html = self.precomputed.html.encode("latin-1").decode("gbk")
            self.precomputed.doc = lxml.html.document_fromstring(self.precomputed.html)

Now the rest can be parsed as usual. I will post the entire parser here.

Click me
import datetime
from typing import List, Optional

import lxml.html
from lxml.cssselect import CSSSelector

from fundus.parser import ArticleBody, BaseParser, ParserProxy, attribute, function
from fundus.parser.utility import (
  extract_article_body_with_selector,
  generic_author_parsing,
  generic_date_parsing,
  parse_title_from_root,
)


class PeopleParser(ParserProxy):
  class V1(BaseParser):
      _paragraph_selector = CSSSelector("div.rm_txt_con > p")

      @function(priority=0)
      def fix_encoding(self):
          self.precomputed.html = self.precomputed.html.encode("latin-1").decode("gbk")
          self.precomputed.doc = lxml.html.document_fromstring(self.precomputed.html)

      @attribute
      def body(self) -> ArticleBody:
          return extract_article_body_with_selector(self.precomputed.doc, paragraph_selector=self._paragraph_selector)

      @attribute
      def title(self) -> Optional[str]:
          return parse_title_from_root(self.precomputed.doc)

      @attribute
      def authors(self) -> List[str]:
          return generic_author_parsing(self.precomputed.meta.get("author"))

      @attribute
      def publishing_date(self) -> Optional[datetime.datetime]:
          return generic_date_parsing(self.precomputed.meta.get("publishdate"))
  1. The more complicated and proper solution would be to fix the encoding within Fundus crawling. While this may take a while, for now i would suggest using the parser I posted as a temporary fix :)

@MaxDall
Copy link
Collaborator

MaxDall commented Apr 23, 2024

@screw-44 Could you run python -m scripts.generate_parser_test_files -p People in the project root and commit the changes to your branch?

@screw-44
Copy link
Contributor Author

@MaxDall Hi! Of course, thanks for the help in the PR. But it still produce this error
People: 0%| | 0/1 [00:03<?, ?it/s]
Traceback (most recent call last):
File "/Users/hexinyu/anaconda3/envs/nlpcourse/lib/python3.9/runpy.py", line 197, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/Users/hexinyu/anaconda3/envs/nlpcourse/lib/python3.9/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/Users/hexinyu/Library/CloudStorage/OneDrive-个人/002-HU/fundus/scripts/generate_parser_test_files.py", line 143, in
main()
File "/Users/hexinyu/Library/CloudStorage/OneDrive-个人/002-HU/fundus/scripts/generate_parser_test_files.py", line 137, in main
test_data_file.write(test_data)
File "/Users/hexinyu/Library/CloudStorage/OneDrive-个人/002-HU/fundus/tests/utility.py", line 103, in write
json.dump(content, json_file, **kwargs)
File "/Users/hexinyu/anaconda3/envs/nlpcourse/lib/python3.9/json/init.py", line 179, in dump
for chunk in iterable:
File "/Users/hexinyu/anaconda3/envs/nlpcourse/lib/python3.9/json/encoder.py", line 431, in _iterencode
yield from _iterencode_dict(o, _current_indent_level)
File "/Users/hexinyu/anaconda3/envs/nlpcourse/lib/python3.9/json/encoder.py", line 405, in _iterencode_dict
yield from chunks
File "/Users/hexinyu/anaconda3/envs/nlpcourse/lib/python3.9/json/encoder.py", line 405, in _iterencode_dict
yield from chunks
File "/Users/hexinyu/anaconda3/envs/nlpcourse/lib/python3.9/json/encoder.py", line 438, in _iterencode
o = _default(o)
File "/Users/hexinyu/Library/CloudStorage/OneDrive-个人/002-HU/fundus/tests/utility.py", line 112, in default
return obj.serialize()
File "/Users/hexinyu/Library/CloudStorage/OneDrive-个人/002-HU/nlp-exercise-1/fundus/src/fundus/parser/data.py", line 256, in serialize
"sections": [section.serialize() for section in self.sections],
File "/Users/hexinyu/Library/CloudStorage/OneDrive-个人/002-HU/nlp-exercise-1/fundus/src/fundus/parser/data.py", line 256, in
"sections": [section.serialize() for section in self.sections],
AttributeError: 'str' object has no attribute 'serialize'

@MaxDall
Copy link
Collaborator

MaxDall commented Apr 23, 2024

@screw-44 Hmm, no idea what happened there. When I check-out your branch I can run the script with no errors. I'm gonna update the branch for you.

Further, I opened a PR to address the encoding issue #450.

@screw-44
Copy link
Contributor Author

@screw-44 Hmm, no idea what happened there. When I checkout your branch I can run the script with no errors. I'm gonna update the branch for you.

Oh thank you! You are very kind and nice!!!

@MaxDall
Copy link
Collaborator

MaxDall commented Apr 24, 2024

@screw-44 I finished the parser for now. Can you do me a favor and check the publisher's articles if you can find any author by name? Apparently, I don't speak/read Mandarin so it's almost impossible for me. You could start with the article used for the unit test. I just want to know if there is any author name we can scrape. Right now its just a number.

@screw-44
Copy link
Contributor Author

@screw-44 I finished the parser for now. Can you do me a favor and check the publisher's articles if you can find any author by name? Apparently, I don't speak/read Mandarin so it's almost impossible for me. You could start with the article used for the unit test. I just want to know if there is any author name we can scrape. Right now its just a number.

Of course, The authors' name is in the html, I can try to add that.

@screw-44
Copy link
Contributor Author

screw-44 commented Apr 24, 2024

@MaxDall But I think I should say that the technical "author" of this news paper is usually not disclosed, that is why the author is showing up as only a number. But there are people's name on each news page, written as 责编, which means "someone who is responsible for the editing and quality control". Given this context, do you think I should try to use those peoples' name as the authors' name?

The structure of publisher and how news is made is different between the east and the west . I think it make sense to put those peoples' name into the author. But I think it is responsible to mention it

@MaxDall
Copy link
Collaborator

MaxDall commented Apr 24, 2024

@screw-44 Nice, that sounds like a plan!

@screw-44 screw-44 closed this Apr 25, 2024
@screw-44 screw-44 reopened this Apr 25, 2024
@MaxDall MaxDall merged commit 3735af0 into flairNLP:master May 6, 2024
10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants