a GPT2 traininer using twitter posts
-
Clone the repo.
-
run the following command to download all dependencies
pip3 install pyyaml python-twitter pandas gpt-2-simple
-
You need to create a file called
/twitter/settings.yaml
with the following informationtwitter_consumer_key: your twitter comsumer key twitter_comsumer_secret: your twitter comsumer secret twitter_access_token_key: your twitter access token twitter_access_token_secret: your twitter token secret handles: - KeetPotato - david8hughes - Shen_the_Bird
-
Please take a look at the Twitter API page for key information
- Pleace check the function remove_unwated, to check current filters or add any new filters you need to your tweets
- To create a tweet dataset populate the
settings.yaml
with the handles you want to fetch the tweets for. - Run
twitter/tgpt_twiiter.py
by running the commandpython3 tgpt_twitter.py --csv_file_name=my_file_name
- Once done the script will save the dataset in a
my_file_name.csv
file in./csv
folder
- To start training run the file
./gpt/train.py
by runningpython3 train.py --model_name=124M --csv_file=../csv/my_file_name.csv --steps=1000 --run_name=myrun
- If the model has not been downloaded it will download the model and save it in
./gpt/models
directory unless changed by--models_dir
parameter - The training results will be saved in the directory
./gpt/checkpoint
unless changed by the--checkpoint_dir
parameter - To see all the possible parameters for trianing run
python3 train.py -h
- Once done training we can generate text using our model by running the
./gpt/generate.py
- We can run the file by running the command
python3 generate.py --run_name=myrun --model_name=124M
- If you want to save the generated text to the file you the paramerter
--destination_path
- To see all the possible parameters for generation run
python3 generate.py -h
The dataset was created by using the following tech news accounts:
- observer
- mashable
- TechCrunch
- thenextweb
- WIRED
- verge
- DigitalTrends
- arstechnica
- CNET
- androidcentral
- engadget
- ForbesTech
- Gizmodo
- BBCTech
- cnntech
- HuffPostTech
- guardiantech
- WiredUK
- techreview
- WIREDScience
- gadgetlab
- Recode
- Techmeme
- slashdot
- WSJTech
- technology
- fttechnews
- ZDNet
- ReutersTech
- usatodaytech
Download the dataset here
The pretrained models 124M was finetuned on the dataset using a NVIDIA Tesla v100 GPU:
The trained model zip can be found here
The command used to generate the text is python3 generator.py --run_name=tech124M --model_name=124M --return_as_list=True --truncate="<|endoftext|>" --prefix="<|startoftext|>" --nsamples=10 --batch_size=10 --include_prefix=False --temperature=1.6
Trained for 60000 steps and a average loss of 0.08
Mathematicians have been searching, but the answer lies in physics
Former LEGO designerRyan C Smith is creating some select pieces for mix-and-match amputees
Weed edibles aren’t as green
Swami Releases Sunny Mar setThanks to Hong Kong movements
successfully started device jailbreaking, raises US public profile
Oakland must Faces $25 Million Class-Action Lawsuit Over Police Trespassing Face-Collection
project involve suing writers before they turn over #oncology