A Guide to Querying the New York Times API with Python
A versatile function to make searching simple and scalable
A few months ago I worked on a project which used information from the Yelp API as well as various data sources from NYC Open Data and the New York Department of Health to analyze the factors that allowed restaurants to survive during the dramatically reduced business of the COVID-19 pandemic.
At the time I was able to find a list of around 200 restaurants that closed since March in NYC, but because I was relying on sites like Eater and Grubhub to report these closures, plenty of businesses were slipping through the cracks. It shouldn’t be surprising if you live in Brooklyn or queens that your local Crown Fried Chicken or bodega-style restaurant might close or abruptly change names or move across the street, and these changes are incredibly difficult to track. For the most part, food journalists just don’t.
Knowing that the data I was working with was biased toward the ‘news-worthy’ restaurants (big openings from restaurant groups opening unique concepts, for the most part) I decided to reframe my thinking a bit to work within the data I had and try to consider what effects press had on restaurants in the long term.
There’s a perception in the food industry that a positive review from the New York Times is a make or break for any restaurant, but over the years I’ve seen plenty of places like Pearl and Ash or Alder open to acclaim only to shutter a short time later.
Of course, anecdotes like that aren’t particularly helpful in terms of understanding the big picture of a Times review on a restaurant, so I decided to pull all of the Pete Wells reviews (Pete has been the food critic for the Times for about 10 years now) from the Times and evaluate their longevity versus the star counts. To do that I’d query the Times API first to find the names and ratings of the restaurants that had been reviewed, then making second calls to find closure dates and web scraping for additional information or using the Yelp API to confirm closures if necessary.
To access the New York Times API you need a key, which you can easily register for here. From there, it’s extremely easy to make your first query using a simple base URL along with the requests and json libraries.
import requests
import json
import timeapi_key = (your api key here)
your_query = (search terms)url = 'https://api.nytimes.com/svc/search/v2/articlesearch.json?q={your_query}&api-key={api_key}'query = requests.get(url)
data = query.json()
Of course, if you just wanted to search for one term you could just use the search function on the Times website and gather data manually.
There are quite a few additional terms that can be added into a query, the fields of which are listed in some depth on the Developer site but in the interest of automating the process to gather information quickly I wrote a base function to hold the keyword arguments and then used it for larger sets.
The basic anatomy of a Times Article API call is a query, filters, facets, and the api key. The query is your search term. Filters can remove irrelevant data by searching by geographical locations, news desk or type parameters (i.e. ‘Article’, ‘Video’, ‘Review’, etc.) or filter by date. Facets are used to return the number of returns in a given category, i.e. the most common days of the week or news desk reporting on a specific search term.
The function below doesn’t take in facet parameters but you could easily edit it to include them. It takes in the other parameters (listed in more full detail at the developer site linked above) and returns the first 10 entries. It’s a good base function to use as a helper in more specific instances.
def nytimes_query(
api_key, query, glocations = None, headline = None, news_desk = None, organizations = None, persons = None, byline = None, subject = None, news_type = None, type_of_material = None, begin_date = None, end_date = None, n_page = 0
):# Set the base url for the querybase_url = f'https://api.nytimes.com/svc/search/v2/articlesearch.json?q={query}'
# Empty dictionary for filtersfilter_queries = {}# empty dictionary for datesdates = {}# Populate the filter dictionaryif glocations: filter_queries.update({'glocations': glocations})if headline: filter_queries.update({'headline': headline})if news_desk: filter_queries.update({'news_desk': news_desk})if organizations: filter_queries.update({'organizations': organizations})if persons: filter_queries.update({'persons': persons})if subject: filter_queries.update({'subject': subject})if byline: filter_queries.update({'byline': byline})if news_type: filter_queries.update({'news_type': news_type})if type_of_material: filter_queries.update({'type_of_material': type_of_material})# Populate the date dictionaryif begin_date: dates.update({'begin_date': begin_date})if end_date: dates.update({'end_date':end_date})# If 1 filter is present, and/or date params, add to URL and execute queryif len(filter_queries) == 1: base_url += f'&fq={list(filter_queries.keys())[0]}: ({list(filter_queries.values())[0]})'if len(dates) == 1: base_url += f'&{list(filter_queries.keys())[0]}={list(filter_queries.values())[0]}&'elif len(dates) == 2: base_url += '&'for i in dates.keys(): base_url += f'{i}={dates[i]}&'# If 2 or more filters are present, concatenate with AND, add dates if present and executeelif len(filter_queries) > 1: base_url += '&fq=' for i in filter_queries.keys(): base_url += f'{i}:({filter_queries[i]}) AND ' # remove the last 'AND ' at the end of the loop
base_url = base_url[:-5]if len(dates) == 1: base_url += f'&{list(filter_queries.keys())[0]}={list(filter_queries.values())[0]}&'elif len(dates) == 2: base_url += '&'for i in dates.keys(): base_url += f'{i}={dates[i]}&'# concatenate page number and api key and make the request.
# Returns a truncated JSON indexed past the metadata
# If you want the full json, simply remove ['response']['docs']
# from the return linebase_url += f'&page={n_page}'base_url += f'&api-key={api_key}'r = requests.get(base_url)json_data = r.json()return r.json()['response']['docs']
With the keyword arguments defined, the subsequent functions can be much simpler. The next function grabs the url from each review. Unfortunately, the star count doesn’t appear in the data, so I parsed the name of the restaurant from the URL and used it to scrape the star count.
One thing to note about the Times API is that it has a limit of 10 queries per minute or 4000 per day. In order to circumvent that but still get the data I need, I added a sleep timer of 6 seconds to the function, allowing the maximum of 10 queries per minute.
Each page returns 10 entries, with a maximum of 100 pages containing 1000 entries. I wrote this function with min and max page parameters in order to allow subqueries of ranges of pages- say if you wanted to call in blocks of 20 in order to avoid going over your daily limit.
def review_url_names(api_key, query, n_pages_min, n_pages_max, **kwargs):# empty lists for URLs and restaurant names
urls = []
names = []# Iterate through the specified number of pages (min and max
# to allow multiple subqueries to build a dataset)for i in range(n_pages_min, n_pages_max):# Run the query helper function, it inherits kwargs from the main
# function
qref = nytimes_query(api_key, query, n_page = i, **kwargs) # Strip the url to include the restaurant name and add to list
names += [j['web_url'].split('dining/')[1][:-5].replace('-', ' ').replace('reviews/', '') for j in qref] # Append urls to list
urls += [k['web_url'] for k in qref] # Wait for 6 seconds before restarting the loop
time.sleep(6)# Both lists are returned
return urls, names
These functions allowed me to start building a DataFrame with urls, restaurants names and publication dates. In the next step I’ll scrape the star counts, find closure dates and perform analysis on them.
I hope you find this helpful as a quick start to querying the Times’ amazing database.