GPT2 – reddit title generator – IML FINAL – Andrew Huang

Project: GPT2 Model trained on reddit titles – Andrew Huang

Motivation

After working in primarily image data for the semester, I thought it would be interesting to start working on text data for my final project. Initially I wanted to present my final thesis about vocal sentiment analysis, but I was not sure if IMA policy would allow me to do so, and also the contents of that may be outside the scope of this course, and not really in the spirit of IMA and more in the spirit of CS courses. After seeing a model on Hackernoon about GPT2 and how good it is at generating coherent titles from it’s own website, I thought about creating and training my own model so that I can generate my own titles. Because it is near graduation time, I have been spending a lot of my time on r/cscareerquestions looking at my (not) so hopeful future. I realized most of the threads have a common theme and I realized, what if a machine learning system can generalize and help me make my own titles?  I also have tried a model called char-rnn in the past, and I have seen language models work decently with machine learning at text generation. So I decided after looking online about how great the GPT-2 Model made by OpenAI was at generating text with good context awareness, I decided to train my own model which was good at generating computer science oriented titles. 

Process

To start I extracted several thousand subreddit titles using a tool called praw.  I did not have much trouble getting this code to work online for me, and all of the APIs were very clear to use.

With this I made a user application, and was able to get to get the top 1000 posts from all the respective categories, (top, controversial, etc). Once I had sufficient model, I loaded the code onto jupyter notebook from the local model initially. I discovered from the original author (Max Praw) that he had his model code in a Google Notebook called Colab. I found out that the compute performance on these notebook environments are very powerful and additionally I did not have to deal with awful dependency issues across different systems (NYUSH HPC vs INTEL Cluster) etc, so I decided to start training on there and the speed on those servers would be much faster than on both HPC and the intel cluster. I trained my dataset on the 117M parameter version of the model, as I thought that would not take too much time as the larger version of the model.

After 1000 epochs and an hour and a half of training, the model was trained. The notebooks that google offers have a Nvidia Tesla T4 GPU built for machine learning, so the models that were trained there trained very quickly. 

Results

The results from my experiments were decent. I think I may have needed to get my titles from a more diverse pool of subreddit content, but because I wanted all of the generated titles to be from one central theme, I did not explore other options. As with training on all machine learning tasks, there is always that issue of “overfitting”. I believe I ran into this issue. I google’d a lot of the titles generated by the model and lot of them were direct copy of the titles that I had in the original source training set, so this is an indication of overfitting. However, it did a good job in creating coherent generation because none of the samples had cut off words or any malformed words. 

Conclusion

The GPT2 Python Apis provide us users with a high level way to train an advanced language text model on any dataset with pretty good results. I would like to understand for myself in the future how GPT2 works on a deeper level, while working on my capstone project I did look into attention models and transformer, and I know that GPT2 is a variant of those models with a large number of model parameters, and the implications of this model are pretty good. I see a use of this as a good start for building a chatbot when you have a large corpus of human – human interactions for example in call centers when humans commonly resolve tasks that employees find extremely repetitive. These tasks can be automated away in favor of more productive tasks. Perhaps my GPT2 models can also be trained on the body of the posts, and I know that GPT2 models can be “seeded” so I feel a lazy user could input their own title and have a relevant user generated from it from key advice that it may wish to know, instead of using the search bar which may just use keyword matching and link it to irrelevant information. If I had more time, I would definitely make a frontend/backend using these weights and allow the user to use these prompts (consider it future work) In this use case, text generation could also be a kind of a search engine. Overall, this project was a success for me and taught me a lot about IML and how helpful interactive text generation can be for all sorts of different use cases, and how adaptive and robust these models can be.

For those who are interested the code is attached below :

https://colab.research.google.com/drive/1SvnOatUCJAhVU1OcUB1N3su3CViL0iam

IML | Week12 – Training CycleGan – Andrew Huang

Introduction

This week’s assignment was to train cyclegan. As I have previously explored this during my midterm, a lot of this wasn’t really new to me. I decided to train the van gogh to photo dataset.

Process

As expected the training task is very annoying because of the walltime issue. I actually tried training this on the NYUSH HPC servers but because of the weird issues with not having enough space on the compute servers. I could not get the requirements installed, so I did not have the chance to train the model with gtx 1080s… Additionally I think because my capstone was also training I could not get the compute quota I needed.. very troubling issues. Also I realized I did not get the images outputted during the training process so I can only get the images from inference.

Results

I did not spend a lot of epochs training because of walltime so the results are not good. There are baseline photos online of this model which I will share. 

Conclusion

For most models it seems much better to train on GPU. I wanted to get better results on the NYU HPC, but I am still unsure why the requirements filled my disk quota, perhaps I will try that again in the future if I have time.

Week11 – Final Topic Proposal

Final Project Proposal – GPT2 Model

Link to Proposal

For my final project, I wasn’t sure what I wanted to do because I was split among three ideas for the proposal. First, I wanted to train the GPT2 Model, which is a model released by OpenAI. I hoped to then create interface for a user to feed a prompt in and have it continue a paragraph or two about the particular content. This project is the most feasible, and most achievable among the three for the due date, so I have decided to do this for the final project. The other two ideas I previously had were GST-Tacotron which is a model for training voice models made by Google, which can accurately train a human voice and make them say whatever you want to say via text input. This model is very impressive results wise, however, I found out that it is particularly difficult to train, and you need three hours of spoken audio of somebody else. The third idea I wanted to do was present my own senior thesis, because I think my thesis is very machine learning oriented and would be the culmination of a semester worth of work, I thought it was worth presenting, However I did couldn’t get permission from Professor Moon and Aven, and the fact that I most likely could not finish my project by the deadline,  so ultimately I abandoned the idea for GPT2.

Tools

I will mostly be using keras for training the model. I will also most likely be using the school’s cluster or the intel server to have the compute power, and I will also be using a variety of models. 

Datasets

I haven’t decided which dataset I wanted to use yet, but I mostly will be trying to make my own corpus instead of using some commonly preused ones – like the shakesphere or movie script corpus commonly used to train models. Particularly comments off forums or social media are very interesting to me, so I wanted to see if I can generate a realistic type of comment or a discussion. 

Goal

My main goal for this project is to create a way for users to generate their very realistic text generation model in order for users to accurately create something off a prompt. I was always interested in text generation – and using previous works like char-rnn, I was very excited to see GPT2 work so well. Perhaps having the model feed itself prompts, It can novelly generate enough content for a website. I hope to add more interaction in the spirit of IMA. For example, I can let it start its own dialogue, and have it comment on that, and then have it continue off those comments. It will be very interesting to see the kind of output it can make. At this point, I am not fully aware how GPT2 even works, I know it is a variety of the transformer types of models with attention embeddings.. which is rather complicated. I hope making this project will give me further insight into text generation and machine learning models. 

Week10 – DeepDream

Training deep dream

Author: Andrew Huang

Deep dream has always been very fascinating to me as the illustrations that it creates are similar to visuals one see’s on psychedelics or LSD, but I never really knew how it worked. After reading some articles about it online, I found that what the neural network does is slightly perturb the image so that the classification of the neural network has slightly less loss. I think this is similar about how GANs train because they both slightly perturb something ( in Gan case it is the latent vector). It is kind of like doing back-propogation but instead of changing the weights wrt loss it is changing the image itself and freezing the weights. I noticed in all deep dream images generated there appears to be spirals everywhere, I’m still not sure why these get generated. I’m also unsure why the algorithm needs to use different scales of the image and scale up to compare to the original, so in that regard I need to do more study to learn how the algorithm really works.  That being said, the code worked smoothly for me, and I simply just did conda activate with my midterm environment and I just scp’d the pictures I wanted to transform. I transformed 2 city landscapes and one anime style picture. The distortions that we get from this model aren’t as good as the one’s I see online; perhaps more iterations are needed to get  the proper distortion. In particular, when a class of the image is present in the training set of the original model (imagenet), then the results are especially good because I feel the network will disturb the image more to fit the loss better. Even though deep-dream has a pretty limited use case and it’s relatively old now in terms of new machine learning models and techniques, it’s still interesting to see the algorithm work and how far we have come. 

Week09 – Training Style transfer model

Author: Andrew Huang

For this week’s assignment we had to retrain the fast style transfer model so that we can use it with ml5js. Training the model was pretty straight forward, as I mostly just followed the instructions and results given by the professor. One thing I did have to do was run the generate weights? python file function afterwards, because of the 24 hour walltime, the python function didn’t actually reach 2 epochs through the very large dataset, luckily, it checkpointed throughout, so I just manually called the function afterwards to get the model saved. I didn’t really get the result I wanted, because I think I did not use a picture with enough contrast in the image. I used a picture of the desert with very little contrast and minimal sky in the background.

the desert picture i used

However, the results I got weren’t really to my satisfaction, but it was to be expected. Pretty much the webcam image just looks like theres a yellow and white sand filter, not really an example of style transfer, however, there seemed to at least be the proper texture, so in regards to that, the fast style transfer seemed to have worked well. Perhaps the model required more than 24 hours of training. Compared to my midterm project, I  think the results of this gan aren’t as good as cyclegan, which transfers not just style but also content and context. It seems this fast style transfer seems to just transfer the texture and color of those photo but doesn’t really seem to understand the deeper meaning of objects in the photo. Also, this model seems to only train off one image only, while cyclegan and other gan architectures need an entire dataset in order to be effective. It seems like it trains faster with only one photo, but it doesn’t accurately represent an entire domain. 

Screen Recording 2019-04-13 at 5.58.23 AM

Conclusion

Next time, I will train a style transfer model on a training image with more contrast, patterns, and textures, because I find the results to be most prominent and most visually pleasing. It really amazes me how fast style transfer can work so well in realtime, while traditional style transfer methods require very long training and other prerequisites.