1. Home
  2. Accounts & Work Area
  3. Bulk Data Downloads

Bulk Data Downloads

Bulk data downloads enable users to download scholarly and patent bulk data files via the API or UI. The API allows users to track the availability of new bulk data files from Lens and automate the process of downloading and keeping data up-to-date. To use the bulk data downloads API, you will need an access token, which can be generated in the Subscriptions tab of your Lens account.

Note: The Lens allows the same access token to be used across all API products.

File Format

Bulk data download files are structured in JSON Lines format, where each line contains a separate JSON document representing an individual record. Files are compressed in .gz format. The data schema for records is the same as the search API response schema. Sample bulk data files can be downloaded here:

Using the API

The access token can be submitted using the token request parameter, or alternatively in the Authorization field of the HTTP request header ( see Getting Started > API Access in the API documentation for details).

Latest Data Release Endpoint

This endpoint is useful for getting the latest bulk data file listing and associated metadata. Since the bulk data files are released fortnightly, these endpoints can be used in a scheduler to automate the availability check for new files.

  • https://api.lens.org/bulk/scholarly/release
  • https://api.lens.org/bulk/patent/release

The fields and data schema used in the Release Endpoint are listed below.

Field Description Example
bulkDataProduct Product associated with download file SCHOLARLY, PATENT
dataVersion Bulk data version (usually YEAR-WEEK format) 202336
records Number of records in the file 264329940
md5sum MD5 Hash of the compressed file de9acdacff6e099d5e6c7aa22d0eaeac
fileName Name of the file (usually product-week-UUID.jsonl.gz) scholarly-202336-0bd17e86-8bed-4c9b-a863-9f3928e4c71b.jsonl.gz
firstRecordId LensId of the first record in the file 043-486-543-492-05X
lastRecordId LensId of the last record in the file 145-007-429-245-212
dateProcessed Date the file was processed by Lens 2023-09-16
downloadAccessKey Access key to download the file using the Download Endpoint hN74L3upIqYNU4qK-sdU7-g1YwakLw1Hu3edL6wa7YP3cS...
size Compressed size in bytes 204149739824
rawSize Actual size of the file in bytes after uncompressing 716653744007

Example Request:

[GET] https://api.lens.org/bulk/patent/release?token={access_token}

Example Response:


{
    "bulkDataProduct": "PATENT",
    "dataVersion": "202338",
    "records": 150757805,
    "md5sum": "c1fe56f4dd327823f3be9294efc6b2a8",
    "firstRecordId": "054-211-414-158-561",
    "lastRecordId": "168-549-513-683-443",
    "dateProcessed": "2023-09-29",
    "downloadAccessKey": "5jJnJVadp-wAFweuaGN_XRMVinSNA5nP4TM1cOjpUUrIMvTaqb68...",
    "size": 458139434490
}

Download Endpoint

Bulk data files can be downloaded using either the API or UI. Downloads via the API and UI are both subject to rate-limiting based on your subscription plan.

Download using API

To download bulk data files using the API, you will need the file download access key and your API access token. The download access key (downloadAccessKey) is available from the Release Endpoint listed above. The download API endpoint is:

  • https://api.lens.org/bulk/download/{downloadAccessKey}

The integrity of the downloaded file can be verified using the md5sum available from the Release Endpoint. Similarly, the number of records and first/last record LensId can be used to verify the file after extraction.

Note: Take note of the uncompressed file size from the release endpoint to ensure you have enough disk space if you are extracting it. Also, whatever HTTP client you use will need to be able to follow 302 redirects. E.g.

  • wget 'https://api.lens.org/bulk/download/{downloadAccessKey}?token={access_token}' -O filename.gz

Where {downloadAccessKey} is from the Release Endpoint and {access_token} is your Lens subscription access token.

Example Scripts
Download Automation

The below Python script can be used to automate the download of the latest bulk data file to a location. The script allows users to implement storage of the download file information from the Latest Data Release Endpoint and periodically perform a check for newer files and downloads new files to a specified location.


import requests
from datetime import datetime
import os

api_host = 'https://api.lens.org'
api_token = 'YOUR TOKEN'
output_location = os.getcwd()

class ReleaseFileInfo:
    def __init__(self, product, data_version, num_records, md5sum, file_name, first_record_id, last_record_id,
                 date_processed, download_access_key, compressed_size, uncompressed_size):
        self.product = product
        self.data_version = data_version
        self.num_records = num_records
        self.md5sum = md5sum
        self.file_name = file_name
        self.first_record_id = first_record_id
        self.last_record_id = last_record_id
        self.date_processed = date_processed
        self.download_access_key = download_access_key
        self.compressed_size = compressed_size
        self.uncompressed_size = uncompressed_size

def __get_current_release(data_type) -> ReleaseFileInfo:
    release_url = api_host + '/bulk/%s/release' % data_type
    print(release_url)
    headers = {'Authorization': 'Bearer ' + api_token, 'Content-Type': 'application/json'}
    release = requests.get(release_url, headers=headers).json()
    return ReleaseFileInfo(release['bulkDataProduct'], release['dataVersion'], release['records'], release['md5sum'],
                           release['fileName'], release['firstRecordId'], release['lastRecordId'],
                           datetime.strptime(release['dateProcessed'], '%Y-%m-%d'), release['downloadAccessKey'],
                           release['size'], release['rawSize'])

# Check if the provided download has been already processed
def __download_already_exists(release_info):
    raise NotImplementedError('Implement release update file check')

# Persist the latest parse info in your store. Use the same to check against `__download_already_exists`
def __persist_latest_parse_info(release_info):
    raise NotImplementedError('Implement persistence of downloaded file info')

# Download and check the file integrity
def __download_file(release_info):
    download_url = api_host + '/bulk/download/' + release_info.download_access_key
    headers = {'Authorization': 'Bearer ' + api_token}
    output_filename = output_location + '/' + release_info.product + '/' + release_info.data_version + '/' + release_info.file_name
    os.makedirs(os.path.dirname(output_filename), exist_ok=True)
    with requests.get(download_url, headers=headers, stream=True) as r:
        if r.status_code == requests.codes.too_many_requests:
            print('Download is rate limited. Please check your usage')
            return
        downloaded = 0
        with open(output_filename, 'wb') as f:
            for chunk in r.iter_content(chunk_size=1024*8):
                f.write(chunk)
                downloaded += len(chunk)
        if downloaded != release_info.compressed_size:
            raise ValueError('Mismatched download file size')
    print('Finished downloading file: %s' % output_filename)

def start_updator(data_type):
    release_info = __get_current_release(data_type)
    if __download_already_exists(release_info):
        print('This release has been already processed: %s(%s)' % (release_info.product, release_info.data_version))
    else:
        __download_file(release_info)
        __persist_latest_parse_info(release_info)

# Usage
# start_updator('scholarly')
# start_updator('patent')

Data Import/Processing

This example Python script utilises a single producer with multiple consumers to make the data import process more performant. It is a simple processor implementation with a single producer to read the downloaded file and 3 consumers to process the records in parallel. You can tune the performance by increasing number consumer threads and size of the queue.


import gzip
import json
import queue
from threading import Thread
from queue import Empty
from time import sleep


def consume(data_queue, consumer_id):
    while True:
        try:
            record = data_queue.get(block=False)
        except Empty:
            # wait for data to be available
            sleep(0.5)
            continue
        if record is None:
            break
        # Do something with the data
        print('%s: processing record > %s' % (consumer_id, json.loads(record)['lens_id']))
    print('Completed: ' + consumer_id)


def produce(location, data_queue, consumers_count):
    with gzip.open(location, 'rt') as f:
        for record in f:
            data_queue.put(record)
        # poison pill to stop the consumers
        for i in range(consumers_count):
            data_queue.put(None)


num_consumers = 3
file_location = 'scholarly-202346-6f64a086-94a7-47bc-b21d-5cec945e3705.jsonl.gz'
record_queue = queue.Queue(maxsize=100)

# Start multiple consumers
consumers = [Thread(target=consume, args=(record_queue, 'consumer-%s' % n,)) for n in range(num_consumers)]
for consumer in consumers:
    consumer.start()

# Single producer to read the file and push record into the queue
producer = Thread(target=produce, args=(file_location, record_queue, len(consumers),))
producer.start()

# Finish the processing pipeline
producer.join()
for consumer in consumers:
    consumer.join()

Batch Download and Filtering

The below example script uses Shell utilities to download the bulk data file and and split it on the fly into smaller chunks of approximately 1 Gigabyte each and compresses them using zstd. You can add a line filter anywhere before the split using a regexp, grep or a Python script, to filter out whatever you need, making the local storage requirements smaller. This assumes you’re using a device/environment where gnu split is available.

  • curl -L {download-url-with-token} | gunzip | split --line-bytes=1G --numeric-suffixes --suffix-length=6 --filter='zstd -5 > $FILE.jsonl.zst' - blk-

Credit: National Science Foundation | Dawid Weiss

Download using UI

Bulk data files can also be downloaded manually from the Subscriptions tab in your Lens account. Active bulk data plans include details of the latest bulk data file and a Download Bulk Data button for downloading the latest bulk data file.

Usage Endpoint

Please Note: Bulk data downloads via the Download API endpoint and UI are rate-limited. The number of allowed and remaining download requests per file can be viewed from the subscription usage endpoints. When downloading bulk data, you will receive status code 429 if the numnber of download requests has been exceeded. The usage endpoints are:

  • https://api.lens.org/subscriptions/patent_bulk/usage
  • https://api.lens.org/subscriptions/scholarly_bulk/usage

Example response:


[
    {
        "remaining": 3,
        "allowed": 3,
        "frequency": "1 MONTH",
        "type": "RESOURCE",
        "resetDate": "2023-12-27T04:41:33.701Z",
        "note": "The allowed quota is applicable to unique resource access."
    }
]

Updated on November 29, 2023
Was this article helpful?

Related Articles

Need Support?
Can't find the answer you're looking for? Submit a ticket and we'll get you an answer.
Submit Ticket