PatSeq Data

PatSeq Data enables access to patents disclosing genetic sequences and bulk downloads of disclosed sequence data based on jurisdiction, document type, and either sequence type or sequence location. It also serves as a global, open repository for national systems to enable public sharing of sequence data associated with patents.

Patent Sequence Data is released to the public from 3 major sources: 1) national patent offices, 2) collaborating 3rd party public sequence listings repositories, and 3) regional or global intellectual property organizations.

In the app, users will be able to request permission to bulk download the data.  Access is free for non-commercial use and for a fee for commercial use. For additional information, please see the Terms of use of Bulk sequence download.

In addition, users can select a specific jurisdiction, view statistical details regarding sequence listings available for that jurisdiction from the three sources, their corresponding patent documents (grants and applications), compare patent sequence data that was officially declared as available by a patent office in Cambia 2011 survey (whenever available) to the data available in the Lens PatSeq databases, and download the data provided access was granted.

Additional Statistics re sequence type and location in patent document are provided. Users will also have information re patentability requirement for nucleotide or peptide in that jurisdiction and some major legal changes introduced to accommodate patenting activities for biological inventions.

Sources of sequence listings

Below are the primary sources of various sequence data, together with additional bibliographic and full-text information from published application or issued patent documents.

  1. National Patent offices
  2. Collaborating 3rd party public sequence listings repositories
  3. Regional or global IP organisations

Each of the sources provides unique but complementary set of data. Cambia is trying to get a complete coverage of sequence data by seeking these various sources from the patent offices around the globe. Below are some examples of the 3 sources. For more details see PatSeq Data page.

1. National patent offices

1.a. United States Patent and Trademark Office (USPTO)

1992 was the first year U.S. patent grants contained separate sequence listings. Prior to that year (1976-1991), sequences could be anywhere in a patent, rather than broken out separately – and are harder to parse out. Their format is ASCII-text.

Since probably 1994 or 1995 , USPTO has been forwarding sequence listings for granted patents to NCBI which makes the sequences available in GenBank flatfile formats.

From 2001 up until now, the USPTO provides full text patent sequences in xml format (ST.25) for both granted and published applications. The sequences are either located within the fulltext documents, or split out into bulk sequence listing files at the Publication Site for Issues and Published Sequences (PSIPS). Current USPTO practice: it processes sequence listings from granted patents and published applications on a weekly basis with a tool provided by NCBI. NCBI chooses what to publish in their GenBank Patent database subdivision.

1.b. Japan’s patent office (JPO)

Makes sequence listings per-document available through https://www.inpit.go.jp/english/index.html

FULL DATA ACCESS

“we do not offer download service or feed the data by FTP, for all our product lines.”

1.c. The Chinese Intellectual Property Office (SIPO)

WAITING FOR REPLY

1. d. The German Patent and Trade Mark Office (DPMA)

Provide megafiles with sequence listing for documents from 2001 onwards
https://www.dpma.de/english/index.html

2. Collaborating 3rd party sequence listings repositories

2.a. NCBI

NCBI doesn’t enter any patent data themselves. They receive just grants. Receive weekly updates of newly issued US patents that contain sequence data.

2.b. EMBL-EBI

Website states:

“Provide various levels of merged sequence datasets for both aa and nt as EMBL flatfile format which are updated quarterly. Looks like the protein dataset only includes patent applications (http://www.ebi.ac.uk/patentdata/proteins).”

Patent sequence data is released to EBI quarterly. Data is not filtered by EBI – just converted into EMBL flatfile format. The NR database is filtered to remove redundancy

2.c. DDBJ

Website states:

“Provide sequences in DDBJ flat file format for KIPO and JPO including patents since 1997. Available for ftp download since 2010.”

Receives nt & aa data from JPO once or twice a month. “Therefore we distribute the sequence of only published patent publication data, not patent application data.  KIPO resumed submitting data to DDBJ through Kobic.

https://www.ddbj.nig.ac.jp/ddbj/patent-data-e.html

 3. Regional IP organizations

3.a. EPO

Have sequence listings available for EP applications after 2000 that were lodged electronically. https://www.epo.org/searching-for-patents/data/bulk-data-sets/sequence-listing.html

Forward application sequences to EBI on a weekly basis for filing since early 80s.

3.b. WIPO

All patent applications that contain sequences are published in PCT sequence listing at PATENTSCOPE which has sequence listings since 1999. WIPO listings were downloaded and 30% were full text files, and the rest are pdf or image files and need to be OCRed to render them machine searchable.

International Searching Authorities (the national Offices of Australia, Austria, Canada, China, Finland, Japan, the Republic of Korea, the Russian Federation, Spain, Sweden and the United States of America, and the European Patent Office) submit sequence data directly to external database partners (NCBI, EBI, DDBJ) – so occasionally, WO sequence might pop up in these databases.

Updated on May 29, 2023

Was this article helpful?

Related Articles

Need Support?
Can't find the answer you're looking for? Submit a ticket and we'll get you an answer.
Submit Ticket