Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crawler exception on non-ascii #66

Open
angrave opened this issue Nov 14, 2023 · 3 comments
Open

Crawler exception on non-ascii #66

angrave opened this issue Nov 14, 2023 · 3 comments
Assignees

Comments

@angrave
Copy link
Collaborator

angrave commented Nov 14, 2023

crawler | aslcore-engineering-karnaugh map-1-312769046
crawler |
crawler | Not Saved
crawler | agent.listener.TaskNames.PythonCrawler 2023-11-14 05:44:57,148 [1] PythonCrawler failed to look up for a specific term: 'latin-1' codec can't encode character '\u2019' in position 52: ordinal not in range(256)

@angrave
Copy link
Collaborator Author

angrave commented Nov 14, 2023

Looks like it threw an exception because of a char encoding issue then didn't download any more asl videos from this source.
FWIW 2019 = Unicode Character 'RIGHT SINGLE QUOTATION MARK' (U+2019)

@lijiaxi2018
Copy link
Collaborator

I checked the code and the error should be happening around this code block.

for i in range(num_glossary):
try:
glossary_single = raw_glossaries[i]
# remove '/' from the term
if '/' in glossary_single[2]:
glossary_single[7] = glossary_single[7].replace('/', '')
# Checked if this glossary has already been saved
resp = requests.get(f'{self.target_host}/api/ASLVideo/GetASLVideosByUniqueASLIdentifier',
headers={'Authorization': 'Bearer %s' % self.jwt},
params={'uniqueASLIdentifier' : glossary_single[7]})
resp.raise_for_status()
if len(resp.text) > 0:
print('Already Saved')
else:
print('Not Saved')
# Save to Database
asl_id = ''
resp = requests.post(url='%s/api/ASLVideo' % (self.target_host),
headers={'Content-Type': 'application/json', 'Authorization': 'Bearer %s' % self.jwt},
data=json.dumps({
"term": glossary_single[2],
"kind": glossary_single[3],
"text": glossary_single[4],
"websiteURL": glossary_single[5],
"downloadURL": glossary_single[6],
"source": glossary_single[0],
"licenseTag": "RIT/NTID",
"domain": glossary_single[1],
"likes": 0,
"uniqueASLIdentifier": glossary_single[7]
}))
resp.raise_for_status()
data = json.loads(resp.text)
asl_id = data['id']
# Download
if glossary_single[0] == 'ASLCORE':
vimeodownload.download_vimeo_video(glossary_single[6], glossary_single[5], ASLCORE_PATH, glossary_single[7])
elif glossary_single[0] == 'DeafTEC':
video_request = Request(glossary_single[6])
video_request.add_header('User-Agent', 'Mozilla/5.0 (X11; Ubuntu; Linux i686; rv:10.0) Gecko/20100101 Firefox/10.0')
video_content = urlopen(video_request).read()
with open(os.path.join(DEAFTEC_PATH , glossary_single[7] + '.mp4'), 'wb') as f:
f.write(video_content)
# Connect ASLVideos to Glossary
resp = requests.get(f'{self.target_host}/api/Glossary/GetGlossaryByTerm',
headers={'Authorization': 'Bearer %s' % self.jwt},
params={'term' : glossary_single[2]})
resp.raise_for_status()
# data is a list
data = json.loads(resp.text)
for g in data:
g_id = g['id']
resp = requests.post(url='%s/api/ASLVideoGlossaryMap' % (self.target_host),
headers={'Content-Type': 'application/json', 'Authorization': 'Bearer %s' % self.jwt},
data=json.dumps({
"glossaryId": g_id,
"aslVideoId": asl_id,
"published": True
}))
resp.raise_for_status()
for entry in glossary_single:
print(entry)
print('')
except Exception as e:
self.logger.error(' [%s] PythonCrawler failed to look up for a specific term: %s' % (source_id, str(e)))
return

It seems like the code is inside a try & exception statement which is again inside a for loop statement, hence the other asl videos should be able to be downloaded.

Could you please give me more docker container log information? I was also wondering if the error is happening in the production server or the staging server? Thank you very much.

Best,
Jiaxi

@lijiaxi2018
Copy link
Collaborator

Hi Professor @angrave, please review #67, which added progress message for the crawler pipeline.

After the new code is merged, please re-initialize the crawler pipeline so that we can look deeper in to the cause of the issue. Thank you very much.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants