PatSeq Bulk Download

This page provides a description of the PatSeq bulk download file formats offered and instructions for accessing and downloading the bulk data using the PatSeq Data* app or download API.

We provide the entire collection of biological sequences disclosed in patents for download in FASTA and Rich file formats. To request access, please register and/or sign in to your Lens account and go to the PatSeq tab on the API & Data page, select the option that best suits your needs and complete the PatSeq Bulk Download request form.

*Note: Only FASTA files are available for download via the PatSeq Data app.


File Formats

FASTA File Format

Sequence data files in FASTA format are provided both for individual jurisdictions and also for the entire sequence database. Each set consists of multiple files, grouped by:

  • sequence type (nucleotides, peptides),
  • document type (grants, applications), and

Sequence location in the patent document (in claims, all locations – please note: “in claims” is a subset of the “all” dataset). For example, “Grants: Nucleotides (all)” refers to nucleotide sequences disclosed in granted patent documents regardless of where they are referenced in the documents whereas “Grants: Nucleotides (in Claims)” refers to nucleotide sequences referenced in the claims of the granted patent documents

FASTA File Example

gnl|patseq|US_7510834_B2-23062 Sequence 23062 from Patent US_7510834_B2
TCTCAAGTACTCAGTGATCCAGGAGAGCAAGGACATGTGAGGTCAATGGACCTCTATGTGAGGATATTGGCTGAGAAAACAAAACAAAACAAAACAAAACAAAACAAAACAAAACAAAAACTCCTATGAAGGATTTTCTCTTAACCGGCCTAATGCAGACATAAGCTATACAAACACATTGCACCAAGATTATTTGGGGCACAGGGCATGAAATAGTGAGATGGGAATAAGAAGGGCATAAAAATGATTCTTAAATACTCCATGTTTCAGTAACAGCTCCTAACA

Here you can download some sample data (1,000 sequences, FASTA format, 920KB).

Rich File Format

Sequence data is also provided in a Rich data format, which is a custom annotated flat file format based on the European Molecular Biology Laboratory (EMBL) flat file format, but modified to accommodate rich patent metadata. Rich data files are also provided both for individual jurisdictions and the entire sequence database. Sequence and patent metadata fields available in the Rich data format include:

Field Description Example
ID Identifiers US_2002_0040130_A1_17; 162025 BP
AC Accession Number US_2002_0040130_A1_17
PN Patent publication key US_2002_0040130_A1
PD Patent publication date 04-APR-2002
PF Patent filing number and date US_83470001_A; 12-APR-2001
PS Simple family size 17
PX Patent priority with date US_21725100; 10-JUL-2000
PT Patent title Polymorphic kinase anchor proteins and nucleic acids encoding the same
PL Patent Sequence Document Location (background info for you: we currently only include claims as a location with claim number if available) Claim 47
PB Patent applicants SEQUENOM INC
PI Patent inventors BRAUN ANDREAS
PO Patent assignees/owners SEQUENOM INC
OS Declared organism/species Homo Sapien
DR Database references (LensIDand patent office publication key) lens.org; 158-517-731-452-986
US; US_2002_0040130_A1
SQ Sequence Sequence 162025 BP;
GAATTCCTAT TTCAAAAGAA ACAAATGGGC CAAGTATGGT GGCTCATACC TGTAATCCCA 60
GCACTTTGGG AGGCCGAGGT GAGTGGGTCA CTTGAGGTCA GGAGTTCCAG GCCAGTCTGG 120
XX Field terminator

Rich File Example

ID   US_7510834_B2_11947; DNA; 465 BP.
XX
AC   US_7510834_B2_11947;
XX
PN   US_7510834_B2
PD   31-MAR-2009
PF   US_67412403_A; 26-SEP-2003
XX
PS   2
XX
PX   JP_2000112699; 13-APR-2000.
PX   JP_0007621; 30-OCT-2000.
PX   JP_2002327516; 28-SEP-2002.
PX   JP_2002383869; 09-DEC-2002.
PX   US_25751103; 07-MAR-2003.
PX   US_67412403; 26-SEP-2003.
XX
PT   Gene mapping method using microsatellite genetic polymorphism markers
XX
PL   Claim 1;
PL   Claim 3;
PL   Claim 5;
XX
PB   INOKO HIDETOSHI
PB   TAMIYA GEN
XX
PI   INOKO HIDETOSHI
PI   TAMIYA GEN
XX
PO   INOKO HIDETOSHI
PO   TOKAI UNIVERSITY
XX
OS   Homo sapiens
XX
DR   lens.org; 071-967-192-244-830.
DR   US; US_7510834_B2.
XX
SQ   Sequence 465 BP; 124 A; 118 C; 82 G; 141 T; 0 U; 0 other;
actgtagcca tgcactcaca taatgctaat attgcctaat catataatct taaagacttc        60

Here you can download some sample data (1,000 sequences, Rich format, 2.48MB).

Note: Rich file downloads are only available for download via the PatSeq Bulk Download API


How to download

  1. Before you are able to download, you will need to register and/or sign in to request access to PatSeq Bulk Downloads.
  2. Once you are signed in, you can request access by going to the PatSeq Data tab on the API & Data page and selecting the option that best suits your needs, as shown below.
  3. In the PatSeq Data tab, you can check out the pricing and options. Access plans and pricing are structured based on the:
    • Product type (FASTA Human Genome dataset, FASTA full data set or Rich full dataset),
    • Download frequency (monthly or one-off),
    • Use type (academic vs commercial), and
    • The commercial use Tier, which is based on the organisation’s size and and license requirements (e.g. whether the data will be used internally or as part of a commercial product).

Please see the PatSeq Bulk Download Terms of Use for license details and definitions of academic and commercial use.

  1. Read the PatSeq Bulk Download Terms of Use, and the general Lens Terms of Use
  2. Complete and submit the request form. You will receive an email confirming your request
  3. Once your request is approved, an invoice from the Lens team (if applicable) will be sent. For payment details, please see the PatSeq Bulk Download Terms of Use.
  4. Once the bulk download access is granted, you will receive an email confirming your access and it will be enabled in your Lens account. You can then download the sequence files using your web browser and the PatSeq Data app, or programmatically using the download API.

Instructions for both download options are provided below.


PatSeq Data App

The PatSeq Data app allows users to download bulk data* using the user interface with no programming required. Within the PatSeq data app you can use the provided “Sequence download” buttons to download data files for individual jurisdictions or the entire PatSeq database.

*Note: Only FASTA files are available for download via the PatSeq Data app.

psd bulk download

Bulk Download files in “PatSeq Data” app

The data is provided in multiple files, grouped by

  • document type (grants, applications),
  • sequence type (nucleotides, peptides), and
  • document location (in claims, all locations – please note: “in claims” is a subset of the “all” dataset). For example, “Grants: Nucleotides (all)” refers to nucleotide sequences disclosed in granted patent documents regardless of where they are referenced in the documents whereas “Grants: Nucleotides (in Claims)” refers to nucleotide sequences referenced in the claims of the granted patent documents

Download API

An API is also provided for downloading PatSeq bulk data programmatically (e.g. in automated scheduled scripts). To use the API, you will need to create an API access token to authenticate your application/client and access the download files.

Create your access token

You can create and manage your API access tokens in the Active Access tab of the API & Data page. Click on the “Create Token” button to create a new token. You can generate up to 5 tokens, and to allow you to distinguish them you can label them individually.

Note: When generating a new access token, ensure you copy it somewhere safe. For security reasons, it won’t be displayed again. API access tokens can be revoked by deleting the API access token.

psd bulk download

Creating an API access token in the Active Access tab.

Using the API

The API endpoint is https://www.lens.org/lens/bio/psd/api.

We also provide a technical documentation with direct access to the API using a Swagger interface which is available here https://support.lens.org/patseq-bulkdownload-apidoc/

The Bulk Download API provides two endpoints:

  • /files – to retrieve a list of available files
    Method: GET
    URL: https://www.lens.org/lens/bio/psd/api/files
    Request Parameters:

    • access_token : (String) your API access token
  • /download – to download specific files
    Method: GET
    URL: https://www.lens.org/lens/bio/psd/api/download
    Request Parameters:

    • access_token : (String) your API access token
    • type: (String) the file type of your access plan (hg_fasta, fasta or rich).
    • file : (String) relative file path as returned by the /files API call.

Note: All request parameters are required and the access token can either be submitted using the HTTP header Authorization field, or alternatively using the URI access_token request parameter. See rfc6750 – Bearer Token Usage for more details.

API Examples

Get Files

To return a list of all the available download files, you should use the following URL, replacing {your_personal_token} with your API access token:

https://www.lens.org/lens/bio/psd/api/files?access_token={your_personal_token}

A list of download files will be returned in JSON format, see example below.

PatSeq Bulk Downloads files JSON endpoint

Get Download

To download a specific file, you will need to use the following URL and parameters

https://www.lens.org/lens/bio/psd/api/download?access_token={your_personal_token}&type={file_type}&file={file}

Where {your_personal_token} is your API access token, {file_type} is the file type for your access plan and {file} is the relative file path as returned by the /files API call. For example to download all the latest nucleotide sequences extracted from US grants and referenced in the claims in FASTA file format, you should use the following link:

https://www.lens.org/lens/bio/psd/api/download?access_token={your_personal_token}&type=fasta&file=us/grant/na-claims.fa.gz

with

wget "https://www.lens.org/lens/bio/psd/api/download?access_token={your_personal_token}&type=fasta&file=us/grant/na-claims.fa.gz" -O us-grants-na-claims.fa.gz

Or you can specify the access token in the HTTP Authorization header:

  • HTTP Authorization header:"Authorization: Bearer {your_personal_token}"
  • Request URL: "https://www.lens.org/lens/bio/psd/api/download?type=fasta&file=us/grant/na-claims.fa.gz"

Please note, whatever HTTP client you use will need to be able to follow 302 redirects.