Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can this download actual media files? #216

Open
billbeans opened this issue Sep 21, 2024 · 1 comment
Open

Can this download actual media files? #216

billbeans opened this issue Sep 21, 2024 · 1 comment

Comments

@billbeans
Copy link

billbeans commented Sep 21, 2024

Maybe I'm a bit confused about what this software does, but can it actually grab a user's uploaded media (jpg, mp4) from their tweets and download them? I ran user_media on a profile, and I just got a bunch of stdout in my terminal. I saved that output to a text file and had a hell of a time grepping the links out of it to make wget work, and even then, it didn't grab all of the media from the profile I wanted scraped

@vladkens
Copy link
Owner

vladkens commented Oct 6, 2024

@billbeans user_media is api call to twitter to get list of media – list of links to photos and videos. Its reason why use see many log in terminal.

There are no real media download in twscrape now, because no request about it before.

You can download media with this simple script now:

import asyncio
import os

import httpx

from twscrape import API


async def download_file(client: httpx.AsyncClient, url: str, outdir: str):
    filename = url.split("/")[-1].split("?")[0]
    outpath = os.path.join(outdir, filename)

    async with client.stream("GET", url) as resp:
        with open(outpath, "wb") as f:
            async for chunk in resp.aiter_bytes():
                f.write(chunk)


async def load_user_media(api: API, user_id: int, outdir: str):
    os.makedirs(outdir, exist_ok=True)
    all_photos = []
    all_videos = []

    async for doc in api.user_media(user_id):
        all_photos.extend([x.url for x in doc.media.photos])
        for video in doc.media.videos:
            variant = sorted(video.variants, key=lambda x: x.bitrate)[-1]
            all_videos.append(variant.url)

    async with httpx.AsyncClient() as client:
        await asyncio.gather(
            *[download_file(client, url, outdir) for url in all_photos],
            *[download_file(client, url, outdir) for url in all_videos],
        )


async def main():
    api = API()
    await load_user_media(api, 2244994945, "output")


if __name__ == "__main__":
    asyncio.run(main())

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants