Crawler exception on non-ascii #66

angrave · 2023-11-14T16:17:28Z

crawler | aslcore-engineering-karnaugh map-1-312769046
crawler |
crawler | Not Saved
crawler | agent.listener.TaskNames.PythonCrawler 2023-11-14 05:44:57,148 [1] PythonCrawler failed to look up for a specific term: 'latin-1' codec can't encode character '\u2019' in position 52: ordinal not in range(256)

angrave · 2023-11-14T16:18:42Z

Looks like it threw an exception because of a char encoding issue then didn't download any more asl videos from this source.
FWIW 2019 = Unicode Character 'RIGHT SINGLE QUOTATION MARK' (U+2019)

lijiaxi2018 · 2023-11-16T01:14:37Z

I checked the code and the error should be happening around this code block.

pyapi/pkg/agent/tasks/PythonCrawler.py

Lines 38 to 113 in ace8c9c

    
           for i in range(num_glossary): 
        
               try: 
        
                   glossary_single = raw_glossaries[i] 
        
                   # remove '/' from the term 
        
                   if '/' in glossary_single[2]: 
        
                       glossary_single[7] = glossary_single[7].replace('/', '') 
        
                   # Checked if this glossary has already been saved 
        
                   resp = requests.get(f'{self.target_host}/api/ASLVideo/GetASLVideosByUniqueASLIdentifier',  
        
                               headers={'Authorization': 'Bearer %s' % self.jwt},  
        
                               params={'uniqueASLIdentifier' : glossary_single[7]}) 
        
                   resp.raise_for_status() 
        
                   if len(resp.text) > 0: 
        
                       print('Already Saved') 
        
                   else: 
        
                       print('Not Saved') 
        
                       # Save to Database 
        
                       asl_id = '' 
        
                       resp = requests.post(url='%s/api/ASLVideo' % (self.target_host), 
        
                                               headers={'Content-Type': 'application/json', 'Authorization': 'Bearer %s' % self.jwt}, 
        
                                               data=json.dumps({ 
        
                                                   "term": glossary_single[2], 
        
                                                   "kind": glossary_single[3], 
        
                                                   "text": glossary_single[4], 
        
                                                   "websiteURL": glossary_single[5], 
        
                                                   "downloadURL": glossary_single[6], 
        
                                                   "source": glossary_single[0], 
        
                                                   "licenseTag": "RIT/NTID", 
        
                                                   "domain": glossary_single[1], 
        
                                                   "likes": 0, 
        
                                                   "uniqueASLIdentifier": glossary_single[7] 
        
                                               })) 
        
                       resp.raise_for_status() 
        
                       data = json.loads(resp.text) 
        
                       asl_id = data['id'] 
        
                       # Download 
        
                       if glossary_single[0] == 'ASLCORE': 
        
                           vimeodownload.download_vimeo_video(glossary_single[6], glossary_single[5], ASLCORE_PATH, glossary_single[7]) 
        
                       elif glossary_single[0] == 'DeafTEC': 
        
                           video_request = Request(glossary_single[6]) 
        
                           video_request.add_header('User-Agent', 'Mozilla/5.0 (X11; Ubuntu; Linux i686; rv:10.0) Gecko/20100101 Firefox/10.0') 
        
                           video_content = urlopen(video_request).read() 
        
                           with open(os.path.join(DEAFTEC_PATH , glossary_single[7] + '.mp4'), 'wb') as f: 
        
                               f.write(video_content) 
        
                       # Connect ASLVideos to Glossary 
        
                       resp = requests.get(f'{self.target_host}/api/Glossary/GetGlossaryByTerm',  
        
                                           headers={'Authorization': 'Bearer %s' % self.jwt},  
        
                                           params={'term' : glossary_single[2]}) 
        
                       resp.raise_for_status() 
        
                       # data is a list 
        
                       data = json.loads(resp.text) 
        
                       for g in data: 
        
                           g_id = g['id'] 
        
                           resp = requests.post(url='%s/api/ASLVideoGlossaryMap' % (self.target_host), 
        
                                               headers={'Content-Type': 'application/json', 'Authorization': 'Bearer %s' % self.jwt}, 
        
                                               data=json.dumps({ 
        
                                                   "glossaryId": g_id, 
        
                                                   "aslVideoId": asl_id, 
        
                                                   "published": True 
        
                                               })) 
        
                           resp.raise_for_status() 
        
                   for entry in glossary_single: 
        
                       print(entry) 
        
                   print('') 
        
               except Exception as e: 
        
                   self.logger.error(' [%s] PythonCrawler failed to look up for a specific term: %s' % (source_id, str(e))) 
        
                   return

It seems like the code is inside a try & exception statement which is again inside a for loop statement, hence the other asl videos should be able to be downloaded.

Could you please give me more docker container log information? I was also wondering if the error is happening in the production server or the staging server? Thank you very much.

Best,
Jiaxi

lijiaxi2018 · 2023-11-16T23:25:41Z

Hi Professor @angrave, please review #67, which added progress message for the crawler pipeline.

After the new code is merged, please re-initialize the crawler pipeline so that we can look deeper in to the cause of the issue. Thank you very much.

angrave assigned angrave and lijiaxi2018 Nov 14, 2023

lijiaxi2018 mentioned this issue Nov 16, 2023

feature: processing progress for crawler #67

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Crawler exception on non-ascii #66

Crawler exception on non-ascii #66

angrave commented Nov 14, 2023

angrave commented Nov 14, 2023 •

edited

Loading

lijiaxi2018 commented Nov 16, 2023

lijiaxi2018 commented Nov 16, 2023

Crawler exception on non-ascii #66

Crawler exception on non-ascii #66

Comments

angrave commented Nov 14, 2023

angrave commented Nov 14, 2023 • edited Loading

lijiaxi2018 commented Nov 16, 2023

lijiaxi2018 commented Nov 16, 2023

angrave commented Nov 14, 2023 •

edited

Loading