Bulk data downloads enable users to download scholarly and patent bulk data files via the API or UI. The API allows users to track the availability of new bulk data files from Lens and automate the process of downloading and keeping data up-to-date. To use the bulk data downloads API, you will need an access token, which can be generated in the Subscriptions tab of your Lens account.
Note: The Lens allows the same access token to be used across all API products.
File Format
Bulk data download files are structured in JSON Lines format, where each line
contains a separate JSON document representing an individual record. Files are compressed in .gz
format. The data schema for records is the same as the search API
response schema. Sample bulk data files can be downloaded here:
Using the API
The access token can be submitted using the token
request parameter, or alternatively in the
Authorization field of the HTTP request header ( see Getting Started > API Access in the API
documentation for details).
Latest Data Release Endpoint
This endpoint is useful for getting the latest bulk data file listing and associated metadata. Since the bulk data files are released fortnightly, these endpoints can be used in a scheduler to automate the availability check for new files.
https://api.lens.org/bulk/scholarly/release
https://api.lens.org/bulk/patent/release
The fields and data schema used in the Release Endpoint are listed below.
Field | Description | Example |
bulkDataProduct | Product associated with download file | SCHOLARLY , PATENT |
dataVersion | Bulk data version (usually YEAR-WEEK format) | 202336 |
records | Number of records in the file | 264329940 |
md5sum | MD5 Hash of the compressed file | de9acdacff6e099d5e6c7aa22d0eaeac |
fileName | Name of the file (usually product-week-UUID.jsonl.gz) | scholarly-202336-0bd17e86-8bed-4c9b-a863-9f3928e4c71b.jsonl.gz |
firstRecordId | LensId of the first record in the file | 043-486-543-492-05X |
lastRecordId | LensId of the last record in the file | 145-007-429-245-212 |
dateProcessed | Date the file was processed by Lens | 2023-09-16 |
downloadAccessKey | Access key to download the file using the Download Endpoint | hN74L3upIqYNU4qK-sdU7-g1YwakLw1Hu3edL6wa7YP3cS... |
size | Compressed size in bytes | 204149739824 |
rawSize | Actual size of the file in bytes after uncompressing | 716653744007 |
Example Request:
[GET] https://api.lens.org/bulk/patent/release?token={access_token}
Example Response:
{
"bulkDataProduct": "PATENT",
"dataVersion": "202338",
"records": 150757805,
"md5sum": "c1fe56f4dd327823f3be9294efc6b2a8",
"firstRecordId": "054-211-414-158-561",
"lastRecordId": "168-549-513-683-443",
"dateProcessed": "2023-09-29",
"downloadAccessKey": "5jJnJVadp-wAFweuaGN_XRMVinSNA5nP4TM1cOjpUUrIMvTaqb68...",
"size": 458139434490
}
Download Endpoint
Bulk data files can be downloaded using either the API or UI. Downloads via the API and UI are both subject to rate-limiting based on your subscription plan.
Download using API
To download bulk data files using the API, you will need the file download access key and your API access token. The
download access key (downloadAccessKey
) is available from the Release Endpoint listed above. The
download API endpoint is:
https://api.lens.org/bulk/download/{downloadAccessKey}
The integrity of the downloaded file can be verified using the md5sum available from the Release Endpoint. Similarly, the number of records and first/last record LensId can be used to verify the file after extraction.
Note: Take note of the uncompressed file size from the release endpoint to ensure you have enough disk space
if you are extracting it. Also, whatever HTTP client you use will need to be able to follow 302
redirects. E.g.
wget 'https://api.lens.org/bulk/download/{downloadAccessKey}?token={access_token}' -O filename.gz
Where {downloadAccessKey}
is from the Release Endpoint and {access_token}
is your Lens
subscription access token.
Example Scripts
Download AutomationThe below Python script can be used to automate the download of the latest bulk data file to a location. The script allows users to implement storage of the download file information from the Latest Data Release Endpoint and periodically perform a check for newer files and downloads new files to a specified location.
import requests
from datetime import datetime
import os
api_host = 'https://api.lens.org'
api_token = 'YOUR TOKEN'
output_location = os.getcwd()
class ReleaseFileInfo:
def __init__(self, product, data_version, num_records, md5sum, file_name, first_record_id, last_record_id,
date_processed, download_access_key, compressed_size, uncompressed_size):
self.product = product
self.data_version = data_version
self.num_records = num_records
self.md5sum = md5sum
self.file_name = file_name
self.first_record_id = first_record_id
self.last_record_id = last_record_id
self.date_processed = date_processed
self.download_access_key = download_access_key
self.compressed_size = compressed_size
self.uncompressed_size = uncompressed_size
def __get_current_release(data_type) -> ReleaseFileInfo:
release_url = api_host + '/bulk/%s/release' % data_type
print(release_url)
headers = {'Authorization': 'Bearer ' + api_token, 'Content-Type': 'application/json'}
release = requests.get(release_url, headers=headers).json()
return ReleaseFileInfo(release['bulkDataProduct'], release['dataVersion'], release['records'], release['md5sum'],
release['fileName'], release['firstRecordId'], release['lastRecordId'],
datetime.strptime(release['dateProcessed'], '%Y-%m-%d'), release['downloadAccessKey'],
release['size'], release['rawSize'])
# Check if the provided download has been already processed
def __download_already_exists(release_info):
raise NotImplementedError('Implement release update file check')
# Persist the latest parse info in your store. Use the same to check against `__download_already_exists`
def __persist_latest_parse_info(release_info):
raise NotImplementedError('Implement persistence of downloaded file info')
# Download and check the file integrity
def __download_file(release_info):
download_url = api_host + '/bulk/download/' + release_info.download_access_key
headers = {'Authorization': 'Bearer ' + api_token}
output_filename = output_location + '/' + release_info.product + '/' + release_info.data_version + '/' + release_info.file_name
os.makedirs(os.path.dirname(output_filename), exist_ok=True)
with requests.get(download_url, headers=headers, stream=True) as r:
if r.status_code == requests.codes.too_many_requests:
print('Download is rate limited. Please check your usage')
return
downloaded = 0
with open(output_filename, 'wb') as f:
for chunk in r.iter_content(chunk_size=1024*8):
f.write(chunk)
downloaded += len(chunk)
if downloaded != release_info.compressed_size:
raise ValueError('Mismatched download file size')
print('Finished downloading file: %s' % output_filename)
def start_updator(data_type):
release_info = __get_current_release(data_type)
if __download_already_exists(release_info):
print('This release has been already processed: %s(%s)' % (release_info.product, release_info.data_version))
else:
__download_file(release_info)
__persist_latest_parse_info(release_info)
# Usage
# start_updator('scholarly')
# start_updator('patent')
Data Import/Processing
This example Python script utilises a single producer with multiple consumers to make the data import process more performant. It is a simple processor implementation with a single producer to read the downloaded file and 3 consumers to process the records in parallel. You can tune the performance by increasing number consumer threads and size of the queue.
import gzip
import json
import queue
from threading import Thread
from queue import Empty
from time import sleep
def consume(data_queue, consumer_id):
while True:
try:
record = data_queue.get(block=False)
except Empty:
# wait for data to be available
sleep(0.5)
continue
if record is None:
break
# Do something with the data
print('%s: processing record > %s' % (consumer_id, json.loads(record)['lens_id']))
print('Completed: ' + consumer_id)
def produce(location, data_queue, consumers_count):
with gzip.open(location, 'rt') as f:
for record in f:
data_queue.put(record)
# poison pill to stop the consumers
for i in range(consumers_count):
data_queue.put(None)
num_consumers = 3
file_location = 'scholarly-202346-6f64a086-94a7-47bc-b21d-5cec945e3705.jsonl.gz'
record_queue = queue.Queue(maxsize=100)
# Start multiple consumers
consumers = [Thread(target=consume, args=(record_queue, 'consumer-%s' % n,)) for n in range(num_consumers)]
for consumer in consumers:
consumer.start()
# Single producer to read the file and push record into the queue
producer = Thread(target=produce, args=(file_location, record_queue, len(consumers),))
producer.start()
# Finish the processing pipeline
producer.join()
for consumer in consumers:
consumer.join()
Batch Download and Filtering
The below example script uses Shell utilities to download the bulk data file and and split it on the fly into smaller chunks of approximately 1 Gigabyte each and compresses them using zstd
. You can add a line filter anywhere before the split using a regexp
, grep
or a Python script, to filter out whatever you need, making the local storage requirements smaller. This assumes you’re using a device/environment where gnu
split is available.
curl -L {download-url-with-token} | gunzip | split --line-bytes=1G --numeric-suffixes --suffix-length=6 --filter='zstd -5 > $FILE.jsonl.zst' - blk-
Credit: National Science Foundation | Dawid Weiss
Download using UI
Bulk data files can also be downloaded manually from the Subscriptions tab in your Lens account. Active
bulk data plans include details of the latest bulk data file and a Download Bulk Data
button for downloading the latest bulk data file.

Usage Endpoint
Please Note: Bulk data downloads via the Download API endpoint and UI are rate-limited. The number of allowed and remaining download requests per file can be viewed from the subscription usage endpoints. When downloading bulk data, you will receive status code 429
if the numnber of download requests has been exceeded. The usage endpoints are:
https://api.lens.org/subscriptions/patent_bulk/usage
https://api.lens.org/subscriptions/scholarly_bulk/usage
Example response:
[
{
"remaining": 3,
"allowed": 3,
"frequency": "1 MONTH",
"type": "RESOURCE",
"resetDate": "2023-12-27T04:41:33.701Z",
"note": "The allowed quota is applicable to unique resource access."
}
]