This is the code for a CVPR 2019 paper, calledGoodNews Everyone! Context driven entity aware captioning for news images. Enjoy!
Model preview:
Huge Thanks goes to New York Times API for providing such a service for FREE!
Another Thanks to @ruotianluo for providing the captioning code.
Dependencies/Requirements:
pytorch==1.0.0spacy==2.0.11h5py==2.7.0bs4==4.5.3joblib==0.12.2nltk==3.2.3tqdm==4.19.5urllib2==2.7goose==1.0.25urlparseunidecodeIntroductionWe took the first steps to move the captioning systems to interpretation (see the paper for more detail).To this end, we have used New York Times APIto retrieve the articles, images and captions.
The structure of this repo is as follows:
Getting the dataCleaning and formating the dataHow to train modelsGet the dataYou have 3 options to get the data.
Images onlyIf you want to download the images only and directly start working on the same dataset as ours,then download the cleaned version of the dataset without images:article+caption.jsonand put it to data/ folder anddownload the img_urls.jsonand put it in the get_data/get_images_only/ folder.
Then run
python get_images.py --num_thread 16Then, you will get the images. After that move to Clean and Format Data section.
PS: I have recieved numerous emails regarding some of the images not present/broken in the img_urls.json. Which is why I decided to put the images on the drive to download in the name of open science.Download all images
Images + articlesIf you would like the get the raw version of the article and captions to do your own cleaning and processing,no worries! First download the article_urlsand go to folder get_data/with_article_urls/ and run
python get_data_with_urls.py --num_thread 16python combine_dataset.pyThis will get you the raw version of the caption, articles and also the images.After that move to Clean and Format Data section.
I want more!As you know, New York Times is huge. Their articles starts from 1881 (It is crazy!) until well today.So in case you want to get ALL the data or expand the data to more years, then first step is go toNew York Times API and get an API key. All you have to do is just sign up for the API key.
Once you have the key go to folder get_data/with_api/ and run
python retrieve_all_urls.py --api-key XXXX --start_year XXX --end_year XXXThis is for getting the article urls and then saving in the format of month-year.Once you have the all urls from the API, then you run
python get_data_api.pypython combine_dataset.pyget_data_api.py retrieves the articles, captions and images.combine_dataset.py combines yearly data into one file after removing data pointsif they have corrupt image, empty articles or empty captions. After that move to Clean and Format Data section.
Small NoteI also provide the links to images and their data splits (train, val, test).Even though I always use random seed to decide the split, just in caseIf the GODS meddles with the random seed, here is the link to a json where you can find each image and its split:img_splits.json
Clean and Format the DataNow that we have the data, it is time to clean, preprocess and format the data.
PreprocessWhen you reach this part, you must have captioning_dataset.json in your data/ folder.You can also download captioning_dataset.json.You can also download news_dataset.json.
CaptionsThis part is for cleaning the captions (tokenizing, removing non-ascii characters, etc.),splitting train, val, and test and creating anonymize captions.
In other words, we change the caption "Alber Einstein taught in Princeton in 1926" to "PERSON_ taught in ORGANIZATION_ in DATE_."Move to preprocess/ folder and run
python clean_captions.pyResize ImagesTo resize the images to 256x256:
python resize.py --root XXXX --img_size 256ArticlesGet the article format that is needed for the encoding methods by running: create_article_set.py
python create_article_set.pyFormatNow to create H5 file for captions, images and articles,just need to go to scripts/ folder and run in order
python prepro_labels.py --max_length 31 --word_count_threshold 4python prepro_images.pyWe proposed 3 different article encoding method. You can download each of encoded article methods,articles_full_avg_,articles_full_wavg,articles_full_TBB.
Or you can use the code to obtain them:
python prepro_articles_avg.pypython prepro_articles_wavg.pypython prepro_articles_tbb.pyTrainFinally we are ready to train. Magical words are:
python train.py --cnn_weight [YOUR HOME DIRECTORY]/.torch/resnet152-b121ed2d.pthYou can check the opt.py for changing a lot of the options such dimension size, different models,hyperparameters, etc.
EvaluateAfter you train your models, you can get the score according commonly used metrics: Bleu, Cider, Spice, Rouge, Meteor.Be sure to specify model_path, cnn_model_path, infos_path and sen_embed_path when runing eval.py.eval.py is usually used in training but it is necessary to run it to get the insertion.
InsertionLast but not least insert.py. After you run eval.py, it will produce you a json file with the idsand their template captions. To fill the correct named entity, you have to run insert.py:
python insert.py --output [XXX] --dump [True/False] --insertion_method ['ctx', 'att', 'rand']PS: I have been requested to provide model's output, so I thought it would be best to share it with everyone.Model OutputIn this folder, you have:
test.json: Test set with raw and template version of the caption.
article.json: Article sentences which is needed in the insert.py.
w/o article folder: All the models output on template captions, without articles.
with article folder: Our models output in the paper with sentence attention(sen_att) and image attention(vis_att), provided in the json. Hope this is helpful to more of you.
ConclusionThank you and sorry for the bugs!