PatSeq Data

Patent Sequence Data is released to the public from 3 major sources: 1) national patent offices, 2) collaborating 3rd party public sequence listings repositories, and 3) regional or global intellectual property organizations.

In the app, users will be able to request permission to bulk download the data.  Access is free for non-commercial use and for a fee for commercial use. For additional information, please see the Terms of use of Bulk sequence download.

In addition, users can select a specific jurisdiction, view statistical details regarding sequence listings available for that jurisdiction from the three sources, their corresponding patent documents (grants and applications), compare patent sequence data that was officially declared as available by a patent office in Cambia 2011 survey (whenever available) to the data available in the Lens PatSeq databases, and download the data provided access was granted.

Additional Statistics re sequence type and location in patent document are provided. Users will also have information re patentability requirement for nucleotide or peptide in that jurisdiction and some major legal changes introduced to accommodate patenting activities for biological inventions.

Sources of sequence listings

Below are the primary sources of various sequence data, together with additional bibliographic and full-text information from published application or issued patent documents.

  1. National Patent offices
  2. Collaborating 3rd party public sequence listings repositories
  3. Regional or global IP organisations

Each of the sources provides unique but complementary set of data. Cambia is trying to get a complete coverage of sequence data by seeking these various sources from the patent offices around the globe. Below are some examples of the 3 sources. For more details see PatSeq Data page.

1. National patent offices
1.a. United States Patent and Trademark Office (USPTO)

1992 was the first year U.S. patent grants contained separate sequence listings. Prior to that year (1976-1991), sequences could be anywhere in a patent, rather than broken out separately – and are harder to parse out. Their format is ASCII-text.

Since probably 1994 or 1995 , USPTO has been forwarding sequence listings for granted patents to NCBI which makes the sequences available in GenBank flatfile formats.

From 2001 up until now, the USPTO provides full text patent sequences in xml format (ST.25) for both granted and published applications. The sequences are either located within the fulltext documents, or split out into bulk sequence listing files at the Publication Site for Issues and Published Sequences (PSIPS).

Current USPTO practice: it processes sequence listings from granted patents and published applications on a weekly basis with a tool provided by NCBI. NCBI chooses what to publish in their GenBank Patent database subdivision.

1.b. Japan’s patent office (JPO)

Makes sequence listings per-document available through http://www.ipdl.inpit.go.jp/homepg_e.ipdl

FULL DATA ACCESS

“we do not offer download service or feed the data by FTP, for all our product lines.”

1.c. The Chinese Intellectual Property Office (SIPO)

WAITING FOR REPLY

1. d. The German Patent and Trade Mark Office (DPMA)

Provide megafiles with sequence listing for documents from 2001 onwards
https://datenabgabe.dpma.de/mega/en/mega.jsp

2. Collaborating 3rd party sequence listings repositories

2.a. NCBI

NCBI doesn’t enter any patent data themselves. They receive just grants. Receive weekly updates of newly issued US patents that contain sequence data.

2.b. EMBL-EBI

Website states:

“Provide various levels of merged sequence datasets for both aa and nt as EMBL flatfile format which are updated quarterly. Looks like the protein dataset only includes patent applications (http://www.ebi.ac.uk/patentdata/proteins).”

Patent sequence data is released to EBI quarterly. Data is not filtered by EBI – just converted into EMBL flatfile format. The NR database is filtered to remove redundancy

2.c. DDBJ

Website states:

“Provide sequences in DDBJ flat file format for KIPO and JPO including patents since 1997. Available for ftp download since 2010.”

Receives nt & aa data from JPO once or twice a month. “Therefore we distribute the sequence of only published patent publication data, not patent application data.  KIPO resumed submitting data to DDBJ through Kobic.

http://www.ddbj.nig.ac.jp/ddbjnew/mag/column/patent-e.html

 3. Regional IP organizations

3.a. EPO

Have sequence listings available for EP applications after 2000 that were lodged electronically. http://www.epo.org/searching/free/publication-server/sequence-listings.html

Forward application sequences to EBI on a weekly basis for filing since early 80s.

3.b. WIPO

All patent applications that contain sequences are published in PCT sequence listing at PATENTSCOPE which has sequence listings since 1999. WIPO listings were downloaded and 30% were full text files, and the rest are pdf or image files and need to be OCRed to render them machine searchable.

International Searching Authorities (the national Offices of Australia, Austria, Canada, China, Finland, Japan, the Republic of Korea, the Russian Federation, Spain, Sweden and the United States of America, and the European Patent Office) submit sequence data directly to external database partners (NCBI, EBI, DDBJ) – so occasionally, WO sequence might pop up in these databases.

FAQ

How often is the PatSeq data updated?

We are trying to have regular updates at least on monthly basis for now and we hope to even attempt bi-weekly updates provided additional resources.

Can I download the whole PatSeq data?

The PatSeq data is downloadable under a Creative Commons 3.0 BY-NC-SA license.  For any commercial use, please contact Osmat@cambia.org.  The data is downloadable based on document type (whether the sequence is derived from a granted patents  “Grants” or from published patent applications “Applications” and based on the location of the sequence in the patent document.  For example, if you are interested in the sequences that are referenced in the claims of granted patents, you may want to download the sequences from grants “in claims”.  If you want all the sequences that are disclosed in granted patents regardless of whether they are or are not referenced in the claims section, then you may download “all” the datasets from Grants.

How can I find the information on all the sequences in a given patent or publication number in the Lens?

In the Lens, patent documents containing biological sequences are mixed with other patents.  Thus the patent number search for sequences has been merged into the normal patent search and all you need to do is simply enter the patent number that you are interested in into the broad search field at  lens.org  The search results will indicate sequence listing availability with the the double helix icon.  The double helix icon also provides you with a link directly to the sequences of the patent.

 

As an example:

Enter a patent number into the search field

PastedGraphic-6-1

Examine the results for relevant hits (NB: the double helix indicates that a sequence listing is available)

PSDsequence

The double helix provides a direct link to the sequence listing

Browse through the sequence listing

PastedGraphic-9

Access is free for regular sequence downloads from the sequence tab or through PatSeq Explorer and Analyzer tools.