-
Notifications
You must be signed in to change notification settings - Fork 81
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add people daily publisher #444
Conversation
I should also added the fact that the HTML page and the whole structure is quite different compare to websites in the tutorial, which means I have to find a lot of work arounds. Just look at the two sources you will know what I mean. sources=[Sitemap("https://www.people.cn/sitemap_index.xml"),
NewsMap("https://politics.people.com.cn/news_sitemap.xml")], |
Hey @screw-44 thanks for adding our first chinese news outlet 🚀 and sorry to hear that you ran into so much issues! The problem here is that
@function(priority=0)
def fix_encoding(self):
self.precomputed.html = self.precomputed.html.encode("latin-1").decode("gbk")
self.precomputed.doc = lxml.html.document_fromstring(self.precomputed.html) Now the rest can be parsed as usual. I will post the entire parser here. Click meimport datetime
from typing import List, Optional
import lxml.html
from lxml.cssselect import CSSSelector
from fundus.parser import ArticleBody, BaseParser, ParserProxy, attribute, function
from fundus.parser.utility import (
extract_article_body_with_selector,
generic_author_parsing,
generic_date_parsing,
parse_title_from_root,
)
class PeopleParser(ParserProxy):
class V1(BaseParser):
_paragraph_selector = CSSSelector("div.rm_txt_con > p")
@function(priority=0)
def fix_encoding(self):
self.precomputed.html = self.precomputed.html.encode("latin-1").decode("gbk")
self.precomputed.doc = lxml.html.document_fromstring(self.precomputed.html)
@attribute
def body(self) -> ArticleBody:
return extract_article_body_with_selector(self.precomputed.doc, paragraph_selector=self._paragraph_selector)
@attribute
def title(self) -> Optional[str]:
return parse_title_from_root(self.precomputed.doc)
@attribute
def authors(self) -> List[str]:
return generic_author_parsing(self.precomputed.meta.get("author"))
@attribute
def publishing_date(self) -> Optional[datetime.datetime]:
return generic_date_parsing(self.precomputed.meta.get("publishdate"))
|
@screw-44 Could you run |
@MaxDall Hi! Of course, thanks for the help in the PR. But it still produce this error |
Oh thank you! You are very kind and nice!!! |
@screw-44 I finished the parser for now. Can you do me a favor and check the publisher's articles if you can find any author by name? Apparently, I don't speak/read Mandarin so it's almost impossible for me. You could start with the article used for the unit test. I just want to know if there is any author name we can scrape. Right now its just a number. |
Of course, The authors' name is in the html, I can try to add that. |
@MaxDall But I think I should say that the technical "author" of this news paper is usually not disclosed, that is why the author is showing up as only a number. But there are people's name on each news page, written as 责编, which means "someone who is responsible for the editing and quality control". Given this context, do you think I should try to use those peoples' name as the authors' name? The structure of publisher and how news is made is different between the east and the west . I think it make sense to put those peoples' name into the author. But I think it is responsible to mention it |
@screw-44 Nice, that sounds like a plan! |
… are responsible for the article, not necessarily mean the actual author.
Adding People daily 人民日报
Hi! I tried to add people daily publisher(in short People) to the fundas, it now is working in my own test, but still have some isusses. People daily is a chinese publisher which is the first asia publisher(if added) to fundas.
I tested by myself, and it seems to be working. The test code folllowed the code in the tutorial.
Anyway, I hope it is just the testing problem, hope to be hearing from you guys soon.