Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

edx-dl not able to download videos from edx platform #559

Closed
MATRIX30 opened this issue Oct 20, 2019 · 53 comments · Fixed by #570
Closed

edx-dl not able to download videos from edx platform #559

MATRIX30 opened this issue Oct 20, 2019 · 53 comments · Fixed by #570

Comments

@MATRIX30
Copy link

🚨Please review the Troubleshooting section
before reporting any issue. Don't forget also to check the current issues to
avoid duplicates.

Subject of the issue

edx-dl fails to extract and download videos for "https://courses.edx.org/courses/course-v1:EdinburghX+PA1.1x+3T2019/course/" on www.edx.org
it seems the videos for this course are sourced from "https://media.ed.ac.uk/" and not youtube
Need help on resolving this issue

Your environment

  • Operating System (name/version):windows 10 Professional
  • Python version: 3.7.0
  • youtube-dl version: 2019.09.28
  • edx-dl version: 0.1.10

Steps to reproduce

--- create an account on Edx

--- enroll for the course "https://courses.edx.org/courses/course-v1:EdinburghX+PA1.1x+3T2019/course/"

---- type the following into CMD
edx-dl -u username -p password -o path --ignore-errors --cache https://courses.edx.org/courses/course-v1:EdinburghX+PA1.1x+3T2019/course/

Expected behaviour

download to start normally

Actual behaviour

edx_dl version 0.1.10
Building initial headers for future requests.
Getting initial CSRF token.
Found CSRF token.
Logging into Open edX site: https://courses.edx.org/login_ajax
Extracting course information from dashboard.
Downloading Introduction to Predictive Analytics [course-v1:EdinburghX+PA1.1x+3T2019/co]
Downloading 0 section(s)
loading 2329 urls from cache [edx-dl.cache]
Extracting all units information in parallel.
No downloadable video found.

@YukunXia
Copy link

Having the same issue :(

@mor3dr3ad
Copy link

Confirmed with different url:
https://courses.edx.org/courses/course-v1:MITx+14.750x+3T2019/course/

Output of --debug:

root[main] edx_dl version 0.1.10
root[parse_file_formats] file_formats: ['e?ps', 'pdf', 'txt', 'doc', 'xls', 'ppt', 'docx', 'xlsx', 'pptx', 'odt', 'ods', 'odp', 'odg', 'zip', 'rar', 'gz', 'mp3', 'R', 'Rmd', 'ipynb', 'py']
root[edx_get_headers] Building initial headers for future requests.
root[_get_initial_token] Getting initial CSRF token.
root[_get_initial_token] Found CSRF token.
root[edx_get_headers] Headers built: {'User-Agent': 'edX-downloader/0.01', 'Accept': 'application/json, text/javascript, /; q=0.01', 'Content-Type': 'application/x-www-form-urlencoded;charset=utf-8', 'Referer': 'https://courses.edx.org/login_ajax', 'X-Requested-With': 'XMLHttpRequest', 'X-CSRFToken': 'PUsSLjqYvxBtMFO07I7RfYRpxPPZdHE0zWBVoJk4aqqo8AOSciOeEoSTr49FvNeH'}
root[edx_login] Logging into Open edX site: https://courses.edx.org/login_ajax
root[get_courses_info] Extracting course information from dashboard.
root[get_courses_info] Data extracted: ["lotsofcourseswhichidontwanttoshare"]
root[get_available_sections] Extracting sections for :https://courses.edx.org/courses/course-v1:MITx+14.750x+3T2019/course/
root[get_available_sections] Extracted sections: []
root[_display_selections] Downloading Political Economy and Economic Development [course-v1:MITx+14.750x+3T2019/co]
root[_display_sections] Downloading 0 section(s)
root[extract_all_units_in_sequence] Extracting all units information in sequentially.
root[extract_all_units_in_sequence] urls: []
root[parse_units] No downloadable video found.

@adizukerman
Copy link

@ozhaggis
Copy link

Same issue with multiple courses.

edx_dl version 0.1.10
Building initial headers for future requests.
Getting initial CSRF token.
Found CSRF token.
Logging into Open edX site: https://courses.edx.org/login_ajax
Extracting course information from dashboard.
Downloading Data Science: Machine Learning [course-v1:HarvardX+PH125.8x+2T2019/co]
Downloading 0 section(s)
Extracting all units information in parallel.
No downloadable video found.

@EugeneLoy
Copy link
Contributor

EugeneLoy commented Oct 26, 2019

Same issue. Course: https://courses.edx.org/courses/course-v1:MITx+18.6501x+3T2019/course/

It looks like edx-dl is missing most of the sections of the course. In my example, it sees only 1 section, while edx site displays more than 5 (at the moment):

> edx-dl.py -u <username> --list-sections https://courses.edx.org/courses/course-v1:MITx+18.6501x+3T2019/course/
edx_dl version 0.1.10
Password:
Building initial headers for future requests.
Getting initial CSRF token.
Found CSRF token.
Logging into Open edX site: https://courses.edx.org/login_ajax
Extracting course information from dashboard.
Fundamentals of Statistics [course-v1:MITx+18.6501x+3T2019/co] has 1 sections so far
 1 - Download Entrance Survey videos

@not-lucky
Copy link

not-lucky commented Oct 27, 2019

Here's mine...

Building initial headers for future requests.
Getting initial CSRF token.
Found CSRF token.
Logging into Open edX site: https://courses.edx.org/login_ajax
Extracting course information from dashboard.
Downloading Calculus Applied! [course-v1:HarvardX+CalcAPL1x+2T2019/co]
Downloading 3 section(s)
Section 1: Optional Sections (CHOOSE 1 of 3)
Optional Sections
Section 2: Section 12: Course Wrap Up
End of Course Survey
Course Feedback Forum (Optional)
Section 3: Acknowledgements
Course Team and Special Thanks
Section 1: What Makes a Good Test Question? Mathematical Models to Measure Knowledge and Improve Learning
Section 2: Economic Applications of Calculus: Elasticity and A Tale of Two Cities
Section 3: From X-rays to CT scans: Mathematics and Medical Imaging
Section 4: What is Middle Income? Thinking about Income Distributions with Statistics and Calculus
Section 5: Population Dynamics Part I: the Evolution of Population Models and Section 6: Population Dynamics II: A Biological Puzzle OR How Fishing Affects a Predator-Prey System
Section 7: Extinction, Chaos and other Bifurcation Behavior, Section 8: Bifurcation Part II: Outbreak! Budworm Populations and Bifurcations, Section 9: Bifurcation Part III: Species in Competition: Coexistence or Exclusion
Section 10: E = mc²: Taylor Approximation and the Energy Equation
Final Assessments
Extracting all units information in parallel.
Processing 'https://courses.edx.org/courses/course-v1:HarvardX+CalcAPL1x+2T2019/jump_to/block-v1:HarvardX+CalcAPL1x+2T2019+type@vertical+block@944fb6867b354e2cafb41415aae41415'
Processing 'https://courses.edx.org/courses/course-v1:HarvardX+CalcAPL1x+2T2019/jump_to/block-v1:HarvardX+CalcAPL1x+2T2019+type@vertical+block@2101c542ac614691acc54224d3c314a8'
Processing 'https://courses.edx.org/courses/course-v1:HarvardX+CalcAPL1x+2T2019/jump_to/block-v1:HarvardX+CalcAPL1x+2T2019+type@vertical+block@5864500159ef40f9839d66d2492fea58'
Processing 'https://courses.edx.org/courses/course-v1:HarvardX+CalcAPL1x+2T2019/jump_to/block-v1:HarvardX+CalcAPL1x+2T2019+type@vertical+block@13aed97186fd4c7588a5ea1399e096df'
Processing 'https://courses.edx.org/courses/course-v1:HarvardX+CalcAPL1x+2T2019/jump_to/block-v1:HarvardX+CalcAPL1x+2T2019+type@vertical+block@a53371a01e9c4fd28dcb1a1609614da7'
Processing 'https://courses.edx.org/courses/course-v1:HarvardX+CalcAPL1x+2T2019/jump_to/block-v1:HarvardX+CalcAPL1x+2T2019+type@vertical+block@ebf2c858d37e418583f839965631108f'
Processing 'https://courses.edx.org/courses/course-v1:HarvardX+CalcAPL1x+2T2019/jump_to/block-v1:HarvardX+CalcAPL1x+2T2019+type@vertical+block@d4e29c075ff14ad583a3750767faf698'
Processing 'https://courses.edx.org/courses/course-v1:HarvardX+CalcAPL1x+2T2019/jump_to/block-v1:HarvardX+CalcAPL1x+2T2019+type@vertical+block@6cc97f049d444c4f8470b88ad3fdbc52'
Processing 'https://courses.edx.org/courses/course-v1:HarvardX+CalcAPL1x+2T2019/jump_to/block-v1:HarvardX+CalcAPL1x+2T2019+type@vertical+block@0f7edf523c55490e8380b6e9a809df33'
Processing 'https://courses.edx.org/courses/course-v1:HarvardX+CalcAPL1x+2T2019/jump_to/block-v1:HarvardX+CalcAPL1x+2T2019+type@vertical+block@fb7c4d1c1a2649b29e472b2ef86a36ce'
Processing 'https://courses.edx.org/courses/course-v1:HarvardX+CalcAPL1x+2T2019/jump_to/block-v1:HarvardX+CalcAPL1x+2T2019+type@vertical+block@edb436fadf2c4b74b175b9b5b6334b48'
Processing 'https://courses.edx.org/courses/course-v1:HarvardX+CalcAPL1x+2T2019/jump_to/block-v1:HarvardX+CalcAPL1x+2T2019+type@vertical+block@1ccb65aca6b34beda14dedfa6bffafbc'
Removed 0 duplicated urls from 0 in total
Output directory: Downloaded

@abeckman
Copy link

Same issue with multiple courses.

@lubaroli
Copy link

Same issue here, edx-dl only sees the first section.

Heres the log:
root[main] edx_dl version 0.1.10 root[parse_file_formats] file_formats: ['e?ps', 'pdf', 'txt', 'doc', 'xls', 'ppt', 'docx', 'xlsx', 'pptx', 'odt', 'ods', 'odp', 'odg', 'zip', 'rar', 'gz', 'mp3', 'R', 'Rmd', 'ipynb', 'py'] Password: root[edx_get_headers] Building initial headers for future requests. root[_get_initial_token] Getting initial CSRF token. root[_get_initial_token] Found CSRF token. root[edx_get_headers] Headers built: {'User-Agent': 'edX-downloader/0.01', 'Accept': 'application/json, text/javascript, */*; q=0.01', 'Content-Type': 'application/x-www-form-urlencoded;charset=utf-8', 'Referer': 'https://courses.edx.org/login_ajax', 'X-Requested-With': 'XMLHttpRequest', 'X-CSRFToken': 'wWr0eKCgnA1uusK8rQvzPJHFK8bXmxn4i1pxyGtnuxsy0MRE8LXYh87mk8DN1eST'} root[edx_login] Logging into Open edX site: https://courses.edx.org/login_ajax root[get_courses_info] Extracting course information from dashboard. root[get_courses_info] Data extracted: [Fundamentals of Statistics: https://courses.edx.org/courses/course-v1:MITx+18.6501x+3T2019/course/, TOEFL Test Preparation: The Insider’s Guide: https://courses.edx.org/courses/course-v1:ETSx+TOEFLx+3T2017/course/, Minds and Machines: https://courses.edx.org/courses/course-v1:MITx+24.09x+3T2015/course/, Practical Learning Analytics: https://courses.edx.org/courses/course-v1:MichiganX+PLAx+2T2016/course/, Embedded Systems - Shape the World: https://courses.edx.org/courses/course-v1:UTAustinX+UT.6.03x+1T2016/course/, The Science of Everyday Thinking: https://courses.edx.org/courses/course-v1:UQx+Think101x+2T2015/course/, Electronic Interfaces: https://courses.edx.org/courses/course-v1:BerkeleyX+EE40LX+2T2015/course/, Autonomous Navigation for Flying Robots: https://courses.edx.org/courses/TUMx/AUTONAVx/2T2014/course/, Next Generation Infrastructures - Part 2: https://courses.edx.org/courses/DelftX/NGI102x/3T2014/course/, Solar Energy: https://courses.edx.org/courses/DelftX/ET.3034TU/3T2014/course/, Circuits and Electronics: https://courses.edx.org/courses/MITx/6.002_4x/3T2014/course/] root[get_available_sections] Extracting sections for :https://courses.edx.org/courses/course-v1:MITx+18.6501x+3T2019/course/ root[get_available_sections] Extracted sections: [<edx_dl.common.Section object at 0x1042f6110>] root[_display_selections] Downloading Fundamentals of Statistics [course-v1:MITx+18.6501x+3T2019/co] root[_display_sections] Downloading 1 section(s) root[_display_sections] Section 1: Entrance Survey root[_display_sections] 1. Entrance Survey root[extract_all_units_in_parallel] Extracting all units information in parallel. root[extract_all_units_in_parallel] urls: ['https://courses.edx.org/courses/course-v1:MITx+18.6501x+3T2019/jump_to/block-v1:MITx+18.6501x+3T2019+type@vertical+block@entrancesurvey-tab1'] root[extract_units] Processing 'https://courses.edx.org/courses/course-v1:MITx+18.6501x+3T2019/jump_to/block-v1:MITx+18.6501x+3T2019+type@vertical+block@entrancesurvey-tab1' root[main] Removed 0 duplicated urls from 0 in total root[download] Output directory: Downloaded

@wzhuwz
Copy link

wzhuwz commented Nov 1, 2019

Looks like edx-dl is missing most of the sections of the course. My case https://courses.edx.org/courses/course-v1:GTx+ISYE6669+2T2018/course/.

Building initial headers for future requests.
Getting initial CSRF token.
Found CSRF token.
Logging into Open edX site: https://courses.edx.org/login_ajax
Extracting course information from dashboard.
Downloading FA18: Deterministic Optimization [course-v1:GTx+ISYE6669+2T2018/co]
Downloading 5 section(s)
Section 1: Getting Started
Welcome Message
Syllabus
Getting Help
Getting to Know Each Other
Section 2: Discussions and Q&A
Discussions and Q&A Forums
Section 3: Proctoring Information - Verified Learners
Section 4: Midterm Exam - Verified Learners
Section 5: Final Exam - Verified Learners
Extracting all units information in parallel.
Processing 'https://courses.edx.org/courses/course-v1:GTx+ISYE6669+2T2018/jump_to/block-v1:GTx+ISYE6669+2T2018+type@vertical+block@b4e0e428596e4a438b61d9c44a66ff45'
Processing 'https://courses.edx.org/courses/course-v1:GTx+ISYE6669+2T2018/jump_to/block-v1:GTx+ISYE6669+2T2018+type@vertical+block@6e0eef9f7a9b4eed99ea9c1ad8e37b16'
Processing 'https://courses.edx.org/courses/course-v1:GTx+ISYE6669+2T2018/jump_to/block-v1:GTx+ISYE6669+2T2018+type@vertical+block@d827bed0374e46b5a0abe62978b7cca8'
Processing 'https://courses.edx.org/courses/course-v1:GTx+ISYE6669+2T2018/jump_to/block-v1:GTx+ISYE6669+2T2018+type@vertical+block@3247cb48d14b4f1e97bb9dd74d1ec8a2'
Processing 'https://courses.edx.org/courses/course-v1:GTx+ISYE6669+2T2018/jump_to/block-v1:GTx+ISYE6669+2T2018+type@vertical+block@c49832c367cc47be96ba15a3ce5e9d8c'
Removed 0 duplicated urls from 0 in total
Output directory: Downloaded

@dorianherle
Copy link

I have the same issue:

edx_dl version 0.1.10
Password:
Building initial headers for future requests.
Getting initial CSRF token.
Found CSRF token.
Logging into Open edX site: https://courses.edx.org/login_ajax
Extracting course information from dashboard.
Downloading Introduction to Discrete Choice Models [course-v1:EPFLx+DiscreteChoiceX+3T2017/co]
Downloading 0 section(s)
Extracting all units information in parallel.
No downloadable video found.

@mor3dr3ad
Copy link

mor3dr3ad commented Nov 4, 2019

So, I've dug into the code a bit and I think I found the issue: for some courses, edx has again updated the structure of their website. The issue is with line 397 in /edx-dl/.parsing.py

    sections_soup = soup.find_all('li', class_='outline-item section')

In the new format, the sections have a different class, namely "outline-item section scored".

Should be easily fixed. will try to hack sth together, but this better be checked by so experienced.

@mor3dr3ad
Copy link

Alright, quick fix:

replace as follows in /edx_dl/parsing.py:

Line 385:
subsections_soup = section_soup.find_all('li', class_='vertical outline-item focusable') with subsections_soup = section_soup.find_all('li', class_=['vertical outline-item focusable', 'vertical outline-item focusable scored'])

and line 397:

sections_soup = soup.find_all('li', class_='outline-item section') with sections_soup = soup.find_all('li', class_=['outline-item section', 'outline-item section scored'])

This should work for both the 'old' and new format. Will try to run some tests and create a merge request sometime this week.

@not-lucky
Copy link

Thanks a lot.
Its working now.

@malawadd
Copy link

malawadd commented Nov 4, 2019

thank you it works now

@malawadd
Copy link

malawadd commented Nov 4, 2019

Alright, quick fix:

replace as follows in /edx_dl/parsing.py:

Line 385:
subsections_soup = section_soup.find_all('li', class_='vertical outline-item focusable') with subsections_soup = section_soup.find_all('li', class_=['vertical outline-item focusable', 'vertical outline-item focusable scored'])

and line 397:

sections_soup = soup.find_all('li', class_='outline-item section') with sections_soup = soup.find_all('li', class_=['outline-item section', 'outline-item section scored'])

This should work for both the 'old' and new format. Will try to run some tests and create a merge request sometime this week.

this partially works , it still misses some weeks and module i tried it on this course

https://courses.edx.org/courses/course-v1:CurtinX+MKT1x+1T2019/course/

and the entire module 3 didnt download

@mor3dr3ad
Copy link

@malawadd
can you please share error messages/debug info? Do the sections just not download or does it exit with a message?

@malawadd
Copy link

malawadd commented Nov 4, 2019

@mor3dr3ad

it download an empty folder but skips all the content, then processed to downloading the following module and all it's content, there are no error messages or anything

@mor3dr3ad
Copy link

Just ran the course you mentioned and it seems to be working for me. Will do some more testing this week. In the meanwhile maybe download missing vids manually

@malawadd
Copy link

malawadd commented Nov 4, 2019

@mor3dr3ad
do you mind telling me more about the testing you plan to run , because i would like to try and fix this but am not sure where to start nor what exactly i should look for.

@mor3dr3ad
Copy link

@malawadd
well for starters you could help by providing some more debugging info by using the --debug flag when running edx with the course you mentioned and providing information.

For me, my fix is working, even with your course. So without being able to reproduce your error I can only assume there is a different issue (maybe using a different version of edx-dl?)

@rbrito
Copy link
Member

rbrito commented Nov 5, 2019

If something fixes a program, why don't you submit your changes as a pull request to fix things (or get things slightly improved) for other users?

@mor3dr3ad
Copy link

mor3dr3ad commented Nov 5, 2019 via email

@rbrito
Copy link
Member

rbrito commented Nov 5, 2019

Thanks, please do and I can do a round of code review and merge everything. That will be awesome!

@maxshatskiy
Copy link

maxshatskiy commented Nov 6, 2019

Hello,

Alright, quick fix:

replace as follows in /edx_dl/parsing.py:

Line 385:
subsections_soup = section_soup.find_all('li', class_='vertical outline-item focusable') with subsections_soup = section_soup.find_all('li', class_=['vertical outline-item focusable', 'vertical outline-item focusable scored'])

and line 397:

sections_soup = soup.find_all('li', class_='outline-item section') with sections_soup = soup.find_all('li', class_=['outline-item section', 'outline-item section scored'])

This should work for both the 'old' and new format. Will try to run some tests and create a merge request sometime this week.

This solution works for many courses, but now old courses are not supported:
https://courses.edx.org/courses/course-v1:KTHx+DTS02.1x+1T2018/course/

@adizukerman
Copy link

adizukerman commented Nov 6, 2019

For class https://courses.edx.org/courses/course-v1:MITx+2.830.2x+3T2019/course/ it worked partially. Not all videos and attachments were downloaded.

By the way, thank you to everyone who is working on this. This tool is so helpful as a time saver to allow working on classes offline.

@WajdiBenSaad
Copy link

Alright, quick fix:

replace as follows in /edx_dl/parsing.py:

Line 385:
subsections_soup = section_soup.find_all('li', class_='vertical outline-item focusable') with subsections_soup = section_soup.find_all('li', class_=['vertical outline-item focusable', 'vertical outline-item focusable scored'])

and line 397:

sections_soup = soup.find_all('li', class_='outline-item section') with sections_soup = soup.find_all('li', class_=['outline-item section', 'outline-item section scored'])

This should work for both the 'old' and new format. Will try to run some tests and create a merge request sometime this week.

This should be integrated into a new release. Edx has changed their website structure and this new change breaks all download operations with edx-dl.

@antoniosereno
Copy link

Thanks everyone! I'm facing the same issue and unfortunately the solution provided does not work with this course:
https://courses.edx.org/courses/course-v1:EdinburghX+CCSx+3T2019/course/
any hint?

@EugeneLoy
Copy link
Contributor

@malawadd I've checked the course you are having problem with and it looks like some of the videos are no longer available:

[download] https://www.youtube.com/watch?v=N9SFeRNAfEA => Downloaded\Digital_Branding_and_Engagement\02-Module_1-_The_Digital_Consumer\02-%(title)s-%(id)s.%(ext)s
Downloading video with URL https://www.youtube.com/watch?v=N9SFeRNAfEA from YouTube.
[youtube] N9SFeRNAfEA: Downloading webpage
[youtube] N9SFeRNAfEA: Downloading video info webpage
WARNING: Unable to extract video title
WARNING: unable to extract description; please report this issue on https://yt-dl.org/bug . Make sure you are using the latest version; see  https://yt-dl.org/update  on how to update. Be sure to call youtube-dl with the --verbose flag and include its complete output.
ERROR: This video is no longer available because the YouTube account associated with this video has been terminated.
Sorry about that.

It is likely that your specific problem was caused by deletion of the video from youtube itself, not bug in edx-dl

@antoniosereno
Copy link

Hi @EugeneLoy , thank you for your help!
May I ask if you were able to download this course?
https://courses.edx.org/courses/course-v1:EdinburghX+CCSx+3T2019/course/
I'm having trouble with it but not with others

@EugeneLoy
Copy link
Contributor

@antoniosereno yes, I've been able to download that course.

@antoniosereno
Copy link

Ok I've downloaded the edx-dl-cummulative, made everything you suggested and now it gives me an HTTP Error 400: Bad Request

Yesterday I was able to access the courses list, now I'm not able anymore..

It there anything I'm missing?

@EugeneLoy
Copy link
Contributor

EugeneLoy commented Dec 6, 2019

@antoniosereno are you sure you running code from cummulative branch of the repo and not the one installed globally in your system?

The error you are getting looks like the one that should be fixed by #569 .

One way to run code from repo is to cd into repo root and point python to .py file directly, like this:

python edx-dl.py -u <user> <course_url>

If this wont help, please, post the full debug output, so I could figure out what went wrong.

@naefl
Copy link

naefl commented Dec 8, 2019

Hi @EugeneLoy,

Doesn't work on my end as well.

From your fork root dir:

In:

python edx-dl.py -u <name>@gmail.com https://courses.edx.org/courses/course-v1:DavidsonX+D001x+3T2018/course/

Out:

rses.edx.org/courses/course-v1:DavidsonX+D001x+3T2018/course/ --debug
root[main] edx_dl version 0.1.10
root[parse_file_formats] file_formats: ['e?ps', 'pdf', 'txt', 'doc', 'xls', 'ppt', 'docx', 'xlsx', 'pptx', 'odt', 'ods', 'odp', 'odg', 'zip', 'rar', 'gz', 'mp3', 'R', 'Rmd', 'ipynb', 'py']
Password:
root[edx_get_headers] Building initial headers for future requests.
root[_get_initial_token] Getting initial CSRF token.
Traceback (most recent call last):
  File "edx-dl.py", line 6, in <module>
    edx_dl.main()
  File "/root/workspace/edx-dl/edx_dl/edx_dl.py", line 1000, in main
    headers = edx_get_headers()
  File "/root/workspace/edx-dl/edx_dl/edx_dl.py", line 425, in edx_get_headers
    'X-CSRFToken': _get_initial_token(EDX_HOMEPAGE),
  File "/root/workspace/edx-dl/edx_dl/edx_dl.py", line 167, in _get_initial_token
    opener.open(url)
  File "/opt/conda/lib/python3.7/urllib/request.py", line 531, in open
    response = meth(req, response)
  File "/opt/conda/lib/python3.7/urllib/request.py", line 641, in http_response
    'http', request, response, code, msg, hdrs)
  File "/opt/conda/lib/python3.7/urllib/request.py", line 569, in error
    return self._call_chain(*args)
  File "/opt/conda/lib/python3.7/urllib/request.py", line 503, in _call_chain
    result = func(*args)
  File "/opt/conda/lib/python3.7/urllib/request.py", line 649, in http_error_default
    raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 400: Bad Request

@EugeneLoy
Copy link
Contributor

@naefl @antoniosereno I think I know what the problem is. However, I'll need a bit more cooperation from you to make sure, since I cannot reproduce this in my environment.

I've added commit with test fix and some debug output to cummulative branch. Grab it and, please, let me know if this works for you now.

If this won't fix this issue, please post full debug output as before as well as output of the following:

curl -v https://courses.edx.org/user_api/v1/account/login_session/

@antoniosereno
Copy link

Thank you Eugene..
This is my output when I try to list courses:

(base) C:\edx-dl-cummulative\edx-dl-cummulative>edx-dl -u [email protected] --list-courses edx_dl version 0.1.10 Password: Building initial headers for future requests. Getting initial CSRF token. Traceback (most recent call last): File "c:\users\anton\anaconda3\lib\runpy.py", line 193, in _run_module_as_main "__main__", mod_spec) File "c:\users\anton\anaconda3\lib\runpy.py", line 85, in _run_code exec(code, run_globals) File "C:\Users\anton\Anaconda3\Scripts\edx-dl.exe\__main__.py", line 9, in <module> File "c:\users\anton\anaconda3\lib\site-packages\edx_dl\edx_dl.py", line 1000, in main headers = edx_get_headers() File "c:\users\anton\anaconda3\lib\site-packages\edx_dl\edx_dl.py", line 425, in edx_get_headers 'X-CSRFToken': _get_initial_token(EDX_HOMEPAGE), File "c:\users\anton\anaconda3\lib\site-packages\edx_dl\edx_dl.py", line 167, in _get_initial_token opener.open(url) File "c:\users\anton\anaconda3\lib\urllib\request.py", line 531, in open response = meth(req, response) File "c:\users\anton\anaconda3\lib\urllib\request.py", line 641, in http_response 'http', request, response, code, msg, hdrs) File "c:\users\anton\anaconda3\lib\urllib\request.py", line 569, in error return self._call_chain(*args) File "c:\users\anton\anaconda3\lib\urllib\request.py", line 503, in _call_chain result = func(*args) File "c:\users\anton\anaconda3\lib\urllib\request.py", line 649, in http_error_default raise HTTPError(req.full_url, code, msg, hdrs, fp) urllib.error.HTTPError: HTTP Error 400: Bad Request

and this one is of the previous line you asked us to launch

`(base) C:\edx-dl-cummulative\edx-dl-cummulative>curl -v https://courses.edx.org/user_api/v1/account/login_session/

  • Trying 54.85.51.136:443...
  • TCP_NODELAY set
  • Connected to courses.edx.org (54.85.51.136) port 443 (#0)

GET /user_api/v1/account/login_session/ HTTP/1.1
Host: courses.edx.org
User-Agent: curl/7.65.3
Accept: /

  • schannel: failed to decrypt data, need more data
  • Mark bundle as not supporting multiuse
    < HTTP/1.1 200 OK
    < Allow: GET, POST, HEAD, OPTIONS
    < Cache-control: no-cache="set-cookie"
    < Content-Language: en
    < Content-Type: application/json
    < Date: Sun, 08 Dec 2019 12:09:58 GMT
    < P3P: CP="edX does not have a P3P policy. Review our privacy policy at https://edx.org/privacy"
    < Server: nginx
    < Set-Cookie: csrftoken=5657a4q6CepadqTkeWzFuSVnvpVqaJlrFmdBbyGDtSQZsdL7uRjpUGCCMPSWJVw1; expires=Sun, 06-Dec-2020 12:09:58 GMT; Max-Age=31449600; Path=/; secure
    < Set-Cookie: prod-edx-sessionid="1|jnhlbvh7w39f44dwj782otpg3042k98f|3QLYtGo6h2Dw|IjVjZWI5MjkwZjkxZjA4OTg5Y2MwMmFiZTI2Y2JlY2E1NDZiNTNiYjFmMjIyZTEyM2I4NDJhYTE0OGExNDI1MDki:1idvNu:2YNry_Y95HcfEbPNwxVfnjrbwtE"; Domain=.edx.org; expires=Sun, 22-Dec-2019 12:09:58 GMT; httponly; Max-Age=1209600; Path=/; secure
    < Set-Cookie: AWSELB=D1EF6B6510E347E5B895826CD53CF4FD55E0CFA9A951F8E39A00AC86C5195B42EB656E552F728A68C9A3299E8F6AFF2A1A23123006583EAE591F65FD084E6693F1009EDC31;PATH=/;MAX-AGE=120
    < Strict-Transport-Security: max-age=3600; includeSubDomains
    < Vary: Accept-Encoding
    < Vary: Cookie, Accept-Language
    < X-Content-Type-Options: nosniff
    < X-Frame-Options: DENY
    < Content-Length: 650
    < Connection: keep-alive
    <
    {"submit_url": "/user_api/v1/account/login_session/", "fields": [{"errorMessages": {}, "supplementalLink": "", "placeholder": "[email protected]", "instructions": "The email address you used to register with edX", "restrictions": {"min_length": 3, "max_length": 254}, "name": "email", "defaultValue": "", "required": true, "label": "Email", "supplementalText": "", "type": "email"}, {"errorMessages": {}, "supplementalLink": "", "placeholder": "", "instructions": "", "restrictions": {"max_length": 5000}, "name": "password", "defaultValue": "", "required": true, "label": "Password", "supplementalText": "", "type": "password"}], "method": "post"}* Connection #0 to host courses.edx.org left intact`

@EugeneLoy
Copy link
Contributor

EugeneLoy commented Dec 8, 2019

@antoniosereno Thanks, but from your debug output I can say for sure that edx-dl from your environment is used, as indicated by this part of stack trace:

File "c:\users\anton\anaconda3\lib\site-packages\edx_dl\edx_dl.py", line 167, in _get_initial_token opener.open(url)

Please point your python directly to the edx-dl.py from repo to avoid using version that is installed in your system.

Looking at your post, command should look something like this:

C:\edx-dl-cummulative\edx-dl-cummulative>python edx-dl.py -u [email protected] --list-courses

@adizukerman
Copy link

@EugeneLoy , works great with https://courses.edx.org/courses/course-v1:MITx+2.830.2x+3T2019/course/ , thank you so much for the time and effort! I hope it gets integrated into the master build soon.

@antoniosereno
Copy link

It worked! I was able to download all the videos in the course! Thank you !
May I ask if there's a command to download not only medias (video and pdf) but also the written contents?

@EugeneLoy
Copy link
Contributor

As far as I know if file is "attached" to course page it will be treated a resource by edx-dl and will be downloaded. At least this was my experience so far.

Sometimes, however, you have extra content that is present on the page inline (like errata, tables, extra recitations and text explanations, etc). As far as I understand this is what you interested in.

Now, it just so happens that lately I've been working on a tool that saves this kind of content :)

It is also helpful if you want to save exercises and homework (with explanations), or, any other type of content that is displayed on the course pages.

This tool is meant to complement edx-dl and is called edx-archive and can be found here: https://github.com/EugeneLoy/edx-archive

I only released it recently, so if you guys check it out that would be great!

@antoniosereno
Copy link

wow, I'll take a look at it! I was initially thinking of doing it manually, but it would be a long work! Thank you Eugene!

@naefl
Copy link

naefl commented Dec 8, 2019

@EugeneLoy that worked, thanks for troubleshooting!

@balta2ar
Copy link
Member

balta2ar commented Dec 8, 2019

@EugeneLoy from your tool's page

-c, --concurrency number of pages to save in parallel (default: 4)

I don't know what's the current state of their implementation on the backend now, but my impression was that hammering edx servers is generally not a good idea. FWIW, couple of years ago they blocked me by IP for several months after me flooding their servers with requests (debugging this edx-dl, by the way). It's not that the ban could not be surmounted, but the message was clear. So if you ask me, it's more of a courtesy to not put extra pressure on them by default. If you're still not convinced, please take your time to read this thread: #377

@EugeneLoy
Copy link
Contributor

@balta2ar Thanks, will take my time to read though #377 , however, motivation behind adding concurrency to the tool is not to speed things up on expense of edx servers but to shave some waste time taken by page render.

The tool makes snapshot of the page once it fully rendered (including math processing) and since edx pages can be pretty bloated (I saw pages taking more than a minute to render) this leads to a lot of time being wasted waiting for render (with no network activity).

The actual workload in terms of average request rate is not high and should not cause any issues with default settings. In fact I used much higher concurrency factor and I can say that the memory is much more of a bottleneck candidate than request rate overload.

@antoniosereno
Copy link

Sorry for the late answer.
Can you please mention the entire procedure to run the edx-archive-master? I'm not able to install it, anaconda prompt says that npm is not recognised as an internal or external command

@EugeneLoy
Copy link
Contributor

@antoniosereno Hi.

npm is "node package manager". It is distributed along with node.

If I am not mistaken, you can get node through conda by installing nodejs package. Otherwise, you can get it from here.

Once you get npm on your system, install edx-archive:

npm install edx-archive -g

I'll update readme to clear this npm part shortly.

@antoniosereno
Copy link

it works perfectly @EugeneLoy ! Thanks a lot, you saved me a big amount of time!

@gaber86
Copy link

gaber86 commented Feb 5, 2020

still empty folders not working with https://courses.edx.org/courses/course-v1:UCSanDiegoX+DSE230x+3T2019a/course/

@Navid-Alipour-96
Copy link

i have empty folders i tried the codes above but doesn't work.
https://courses.edx.org/courses/course-v1:CurtinX+IOT4x+3T2019/course/

@ghost
Copy link

ghost commented Apr 29, 2020

Is there a way to Download a Particular video and not the whole course...

@sasidhar22
Copy link

edx_dl version 0.1.13
Building initial headers for future requests.
Getting initial CSRF token.
Found CSRF token.
Logging into Open edX site: https://courses.edx.org/login_ajax
Extracting course information from dashboard.
Traceback (most recent call last):
File "c:\users\asus\appdata\local\programs\python\python38\lib\runpy.py", line 193, in _run_module_as_main
return _run_code(code, main_globals, None,
File "c:\users\asus\appdata\local\programs\python\python38\lib\runpy.py", line 86, in run_code
exec(code, run_globals)
File "C:\Users\Asus\AppData\Local\Programs\Python\Python38\Scripts\edx-dl.exe_main
.py", line 7, in
File "c:\users\asus\appdata\local\programs\python\python38\lib\site-packages\edx_dl\edx_dl.py", line 1020, in main
all_selections = {selected_course:
File "c:\users\asus\appdata\local\programs\python\python38\lib\site-packages\edx_dl\edx_dl.py", line 1021, in
get_available_sections(selected_course.url.replace('info', 'course'),
File "c:\users\asus\appdata\local\programs\python\python38\lib\site-packages\edx_dl\edx_dl.py", line 184, in get_available_sections
page = get_page_contents(url, headers)
File "c:\users\asus\appdata\local\programs\python\python38\lib\site-packages\edx_dl\utils.py", line 58, in get_page_contents
result = urlopen(Request(url, None, headers))
File "c:\users\asus\appdata\local\programs\python\python38\lib\urllib\request.py", line 222, in urlopen
return opener.open(url, data, timeout)
File "c:\users\asus\appdata\local\programs\python\python38\lib\urllib\request.py", line 531, in open
response = meth(req, response)
File "c:\users\asus\appdata\local\programs\python\python38\lib\urllib\request.py", line 640, in http_response
response = self.parent.error(
File "c:\users\asus\appdata\local\programs\python\python38\lib\urllib\request.py", line 569, in error
return self._call_chain(*args)
File "c:\users\asus\appdata\local\programs\python\python38\lib\urllib\request.py", line 502, in _call_chain
result = func(*args)
File "c:\users\asus\appdata\local\programs\python\python38\lib\urllib\request.py", line 649, in http_error_default
raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 403: Forbidden

@MuradShafiyev
Copy link

edx_dl version 0.1.13
Building initial headers for future requests.
Getting initial CSRF token.
Found CSRF token.
Logging into Open edX site: https://courses.edx.org/login_ajax
Extracting course information from dashboard.
Traceback (most recent call last):
File "c:\users\asus\appdata\local\programs\python\python38\lib\runpy.py", line 193, in _run_module_as_main
return run_code(code, main_globals, None,
File "c:\users\asus\appdata\local\programs\python\python38\lib\runpy.py", line 86, in run_code exec(code, run_globals) File "C:\Users\Asus\AppData\Local\Programs\Python\Python38\Scripts\edx-dl.exe__main
.py", line 7, in
File "c:\users\asus\appdata\local\programs\python\python38\lib\site-packages\edx_dl\edx_dl.py", line 1020, in main
all_selections = {selected_course:
File "c:\users\asus\appdata\local\programs\python\python38\lib\site-packages\edx_dl\edx_dl.py", line 1021, in
get_available_sections(selected_course.url.replace('info', 'course'),
File "c:\users\asus\appdata\local\programs\python\python38\lib\site-packages\edx_dl\edx_dl.py", line 184, in get_available_sections
page = get_page_contents(url, headers)
File "c:\users\asus\appdata\local\programs\python\python38\lib\site-packages\edx_dl\utils.py", line 58, in get_page_contents
result = urlopen(Request(url, None, headers))
File "c:\users\asus\appdata\local\programs\python\python38\lib\urllib\request.py", line 222, in urlopen
return opener.open(url, data, timeout)
File "c:\users\asus\appdata\local\programs\python\python38\lib\urllib\request.py", line 531, in open
response = meth(req, response)
File "c:\users\asus\appdata\local\programs\python\python38\lib\urllib\request.py", line 640, in http_response
response = self.parent.error(
File "c:\users\asus\appdata\local\programs\python\python38\lib\urllib\request.py", line 569, in error
return self._call_chain(*args)
File "c:\users\asus\appdata\local\programs\python\python38\lib\urllib\request.py", line 502, in _call_chain
result = func(*args)
File "c:\users\asus\appdata\local\programs\python\python38\lib\urllib\request.py", line 649, in http_error_default
raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 403: Forbidden

Same issue :(

@AshMp
Copy link

AshMp commented Sep 11, 2020

Greetings
please kindly assist with the problem depicted below. I am failing to download courses from edx. I have followed everything that has been given on github's edx-dl page, but I am stuck at the point depicted below. Please kindly assist, the courses on edx are of great help, I don't want the knowledge they offer to pass me by. Thank you.

edx_dl version 0.1.13
Building initial headers for future requests.
Getting initial CSRF token.
Found CSRF token.
Logging into Open edX site: https://courses.edx.org/login_ajax
Extracting course information from dashboard.
Traceback (most recent call last):
File "c:\users\asus\appdata\local\programs\python\python38\lib\runpy.py", line 193, in _run_module_as_main
return run_code(code, main_globals, None,
File "c:\users\asus\appdata\local\programs\python\python38\lib\runpy.py", line 86, in run_code exec(code, run_globals) File "C:\Users\Asus\AppData\Local\Programs\Python\Python38\Scripts\edx-dl.exe__main.py", line 7, in
File "c:\users\asus\appdata\local\programs\python\python38\lib\site-packages\edx_dl\edx_dl.py", line 1020, in main
all_selections = {selected_course:
File "c:\users\asus\appdata\local\programs\python\python38\lib\site-packages\edx_dl\edx_dl.py", line 1021, in
get_available_sections(selected_course.url.replace('info', 'course'),
File "c:\users\asus\appdata\local\programs\python\python38\lib\site-packages\edx_dl\edx_dl.py", line 184, in get_available_sections
page = get_page_contents(url, headers)
File "c:\users\asus\appdata\local\programs\python\python38\lib\site-packages\edx_dl\utils.py", line 58, in get_page_contents
result = urlopen(Request(url, None, headers))
File "c:\users\asus\appdata\local\programs\python\python38\lib\urllib\request.py", line 222, in urlopen
return opener.open(url, data, timeout)
File "c:\users\asus\appdata\local\programs\python\python38\lib\urllib\request.py", line 531, in open
response = meth(req, response)
File "c:\users\asus\appdata\local\programs\python\python38\lib\urllib\request.py", line 640, in http_response
response = self.parent.error(
File "c:\users\asus\appdata\local\programs\python\python38\lib\urllib\request.py", line 569, in error
return self._call_chain(*args)
File "c:\users\asus\appdata\local\programs\python\python38\lib\urllib\request.py", line 502, in _call_chain
result = func(*args)
File "c:\users\asus\appdata\local\programs\python\python38\lib\urllib\request.py", line 649, in http_error_default
raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 403: Forbidden

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.