What does the BooksCorpus dataset look like?

Chris Staff asked 4 weeks ago
1 Answers
Chris Staff answered 4 weeks ago

Unfortunately, the BooksCorpus dataset is no longer distributed (according to https://github.com/soskek/bookcorpus).
 
However, the GitHub repository provides scripts with which you can compose the dataset yourself.
 
The dataset itself literally contains book texts. According to Radford et al. (2018): “It contains over 7,000 unique unpublished books from a variety of genres including Adventure, Fantasy, and Romance. Crucially, it contains long stretches of contiguous text, which allows the generative model to learn to condition on long-range information”
 
An excerpt found at https://twitter.com/theshawwn/status/1301852133319294976:

April Johnson had been crammed inside an apartment in San Francisco for two years, as the owners of the building refurbished it, where they took a large three story prewar home and turned it into units small enough where she felt a dog’s kennel felt larger than where she was living and it would be a step up. And with the walls so thin, all she could do was listen to the latest developments of her new neighbors. Their latest and only developments were the sex they appeared to be having late at night on the sofa, on the kitchen table, on the floor, and in the shower. But tonight the recent development occurred in the bed. If she had her way she would have preferred that they didn’t use the bed for sex because for some reason it was next to the paper thin wall which separated her apartment from theirs.

 
Radford, A., Narasimhan, K., Salimans, T., & Sutskever, I. (2018). Improving language understanding by generative pre-training.

Your Answer

6 + 19 =