The PatSeq Finder is a sequence similarity search tool based on BLAST, allowing you to search the Lens patent sequence (PatSeq) databases for matches to a sequence of your interest. This tool is unique since it enables you to conduct sequence-based searches within more than 250 million patent sequences that we serve in either a nucleotide-based or protein-based databases.
How to Use
The PatSeq Finder Search page is simple and allows you to either upload a sequence (in fasta file) or simply paste a sequence of your interest.
At the top of the page is the sequence input section where you can enter a sequence in FASTA format. FASTA is a standard format for biological sequences. You can also upload a sequence file and optionally select to only search for part of the sequence. In this upgraded version, you can name and save your searches, but you will need to log in into the Lens before you start your search.
Next you need to select the appropriate database to search in based on sequence type you have pasted above. If you entered a DNA or RNA sequence, you would select the nucleotide PatSeq database but if you entered a protein sequence, then you would need to select the amino acid PatSeq database. If you enter PatSeq Finder from the sequence tab, the appropriate database is automatically selected.
In BLAST options you set the basic type of BLAST search you wish to make. You can select from:
- blastn – nucleotide query vs nucleotide database
- blastp – protein query vs protein database
- blastx – translated nucleotide query vs protein database
- tblastx- translated nucleotide query vs translated nucleotide database
- tblastn – protein query vs translated nucleotide query
You can also optimise your search based on how similar you expect them to be. When searching for sequences which match each other you will want to keep it on the default “Highly Similar Sequences (>95%) (megablast)”.
Standard NCBI scoring BLAST parameters are set for you as defaults and you do not need to worry about these in a general search. If you use very short sequences, we automate BLAST parameters to enable you to conduct such search automatically. To understand the recommended parameters for short sequences, please see below.
The default general parameters allow you to see 500 hits now. However, you can choose up to 20,000 hits, by simply changing the “Maximum number of hits to Show” in the advanced options menu available. In the advanced option menu you can also choose the Substitution Matrix.
Once you have checked your search parameters and labeled your search, you can click the Submit Search button to be taken to the PatSeq Finder Result Page. Please note that this process can take up between 30 sec-20 min or longer for very large sequences and during high load, your query may fail to complete. You may want to check to make sure you have selected the appropriate database to search in and if you experience some delay, please come back and try your query later and send us some feedback. We are currently working on improving our server to accommodate more users and the rate of searches one can perform at any one time.
Once you are sure that the right database and sequence type is selected you can push the button search. If your sequence is short, a small tool-tip will appear asking you whether you would like to optimise the search.
Results page displays the BLAST search results i.e. the aligned patent sequences to your query sequence with and without their corresponding patent documents. On top of the page, you can check the number of PatSeq Finder results (hits) obtained running the query and learn more about the “Search summary details” in the header. If there is a star next to the hits, then you may be interested in searching for a larger number of hits. To do so, you can click on “modify your search parameters” and increase that number in the advanced option menu on the search form page until the star next to the hits disappear.
Regions of similarity between your query sequence (nucleotide or protein) and patent sequences referenced in claims of patent documents or simply disclosed in patent documents (grants and applications) are displayed in an alignment browser. Patent sequences are identified by their sequence identification number (SEQ ID No) and in the detailed view, depicted along with their corresponding patent document number, E-value, Similarity and Coverage percentage to the query sequence.
Each row represents a patent sequence. By clicking on the sequence bar, you can view more details on the sequence and corresponding patent attributes in a pop-up window.
You can also filter/sort and select aligned patent sequences based on the location of the sequence in patent documents. For example, the red category represents patent sequences referenced in the claims of granted patents and number of sequences aligned with the query sequences is shown between parenthesis.
As our parsers haven’t been tuned yet to extract the exact position in the full-text that a SEQ ID NO has been referenced in, we assign the “WDESC” for location code, of course, depending on jurisdiction and full-text formats available to us. The CLAIM location is not affected by this and extracted separately.
When granular full text information is available we assign:
- DDESC : the SEQ ID NO has been referenced in the “detailed description” section of the full-text
- BSUMM: the SEQ ID NO has been referenced in the “brief summary” section of the full-text
- BDRAW: the SEQ ID NO has been referenced in the “brief description of drawings section” of the full-text
If no granular full-text location information is available we assign:
WDESC : the SEQ ID NO has been referenced in the full-text the exact section is unknown (includes all three above). When you look at the corresponding sequence listings for each patent you can see more user friendly location descriptions.
For example a sequence listing with WDESC locations (called “Full-text”):
Sequence listing with DDESC locations:
If there are more than one segment in a particular sequence that is aligned to the query sequence, they will also be shown in the lower part of the row. Please note that the segments don’t necessarily have to be located in the mapped order or orientation within the patent sequence.
If you are interested in filtering further the BLAST results, you can use the left side filter box and select another e-value score, for example, or choose a different coverage, or similarity percentage.
You can download results in multiple formats using the download button and you can even embed the sequence results on your own site using the embed button in the bottom left of the page. You can also export results in an excel or Word document file.
Patent and Sequence details
Click on “VIEW DOCUMENT” to go to the full-text information about that patent.
If you are interested in the pairwise alignment view for that particular SEQ ID No, you can view the actual alignment in a new tab “Alignments” of the pop up window.
Frequently Asked Questions
1. Is there a minimum length of sequence under which Patseq Finder will not perform a BLAST search?
We don’t have any imposed limitations and are using the default NCBI BLAST parameters. For short sequences we recommend:
- deactivate the “low complexity filter”
- use the blastn algorithm (3rd option) – the default megablast algorithm has a word size of 28 is too “coarse” for your short sequences.
- increase the E-value to 100 or higher. (“The lower the E-value, or the closer it is to zero, the more “significant” the match is. However, keep in mind that virtually identical short alignments have relatively high E values. This is because the calculation of the E value takes into account the length of the query sequence. These high E values make sense because shorter sequences have a higher probability of occurring in the database purely by chance.”)
- reduce the size of the initial search seed (word size) – default for blastn is 11 you could use 7 instead to increase specificity for your short sequences.
2. What are low complexity regions? Why should low complexity regions be filtered?
Regions of a sequence containing few kinds of elements are called low complexity regions and including them may lead to misleading results. These regions should be filtered out to allow the program to find the significant and related sequences in the database. For more information on low complexity regions see:
3. How should I set general parameters in a BLAST search?
For guidelines on BLAST search parameters, see:
(see section B: BLAST search parameters)
4. How should I set scoring parameters?
“Reward” and “penalty” is a scoring system constituted by a “reward” for a match and a “penalty” for a mismatch. For more information see: http://www.ncbi.nlm.nih.gov/BLAST/blastcgihelp.shtml#Reward-penalty
The substitution matrix is an element of the calculation of the bit score. The bit score indicates how good the alignment of the query is to the hit. The BLOSUM62 matrix is the default matrix for BLAST programs, except for blastn and MegaBLAST. The latter two perform nucleotide-nucleotide comparisons and do not use protein-specific matrices. For more information see: http://www.ncbi.nlm.nih.gov/books/NBK21097/#A611
5. How can I save a BLAST search? How can I embed a BLAST search result?
Once you have performed a BLAST search, the url used to view the results can be saved and bookmarked to view at a later time. Alternatively, you can download the search results in plain text, HTML and XML formats and save the results. Using the embed link in PatSeq Finder Results page, you can easily embed your BLAST search results
6. How may I combine the search results of two sequences?
We currently do not support combining the results of multiple sequence searches but have it on the roadmap for development. In the meantime, we would recommend using the Excel export feature for bulk processing and then combining the exported files.
- Apply filters on PatSeq Finder results (optional)
- Click on “Export results…” on the left hand side
- Select the fields of interest (“Patent publication number”)
- Click on “Export as Excel” which should download the Excel file.
- Perform steps above (1-4) for all remaining queries that you want to combine
- Determine the common publication numbers in the exported files.
There are multiple ways for doing that, for example: filter-common-values-from-three-columns-in-excel or programmatically using command-line tools line sort/join.
If you want to create a Lens patent collection with the common publications, you can use the “import” feature in your Lens work area and just paste the list of “patent publication numbers” for the export file above.
7. PatSeq Finder is not working, help!!if you face an issue with PatSeq finder, please provide the full search url that is displaying the error message:
E.g. https://www.lens.org/lens/bio/patseqfinder#results/23d58224-54ad-4b56-9e49-7f185aec74e0 to firstname.lastname@example.orgBe aware, that we are utilising NCBI Blast for our sequence searches. This tool requires valid FASTA formatted input sequences – i.e. a single-letter amino acid or nucleotide query sequence.You can find more details at wikipedia/FASTA format or NCBI Blast.
8. I am trying to find out when sequences from certain patents were added to the database of sequences searched at lens.org using the PatSeq Finder function. Is that information available?
We currently do not capture the date when a sequence listing was added to our database. We provide the date a sequence listing was made available publicly in our data sources. You can find this information in the tooltip on the top right of the sequence listing page. We pull sequence data from our feeds on a monthly basis, so for recent patent publications the date that a sequence became available in the Lens won’t be too far of the publication date in the source.
9. Is there a way to find out which patents’ sequences are indexed by PatSeq Finder?
Doing an empty search in PatSeq Text will list the patents that contain sequence listings: https://www.lens.org/lens/search?q=&sat=N%2CP
All these sequences are searchable in PatSeq Finder.
10. When searching for a sequence in PatSeq Finder and the search results do not return a specific patent matching this sequence despite that manually we know that the patent exist in lens.org and it discloses and claims this sequence…Could you explain why and whether this an isolated problem due to some bug, or is your database generally incomplete and/or unreliable?
Sometimes a particular sequence may not been extracted into a public patent databas. Or even the patent office SEQL product does not seem to include the sequence listing. Most likely explanation is that the sequence has not been provided yet in a standardised format for the PatSeq team to include it in the database. However, you may want to check whether the sequence can be found as part of a family member. So, when searching for it in PatSeq Finder, filter the results by “Grants, in claims” and check some of the family members in the results.
11. Is there a way to search for sequences with less than 90% similarity? How does the filter showing minimum similarity work please?
In PatSeq Finder results page, the bar of minimum %Similarity shows the range found in your results. For example, in the screenshot below the range found was between 100% and 50%. To view the lower end of the range, all you need to do is SORT by %Similarity. Sorting by %Similarity Asc order, enables you to screen those sequences with lowest % similarity. You have sort by coverage, organism, and other attributes. Enjoy exploring sequence similarity searches in the Lens.
12. Searching for the presence of a primer in PatSeq database with PatSeq Finder, do I need to search the reverse and complement strand of that primer too?
In PatSeq Finder, the Lens uses the BLAST algorithm, and for a primer search, if you use blastn, both forward and reverse strands will automatically be aligned. Thus there is no need to perform another search with the reverse complement.
13. Is there a way you can search for motifs within a peptide sequence? Do you provide wildcard characters for amino acids and nucleotide searches?
PatSeq Finder does allow for wildcards as per the IUB/IUPAC alphabet:
X for amino acids and
N for nucleotide searches (e.g. https://en.wikipedia.org/wiki/FASTA_format#Sequence_representation, https://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=Web&PAGE_TYPE=BlastDocs&DOC_TYPE=BlastHelp). Note, lowercase letters can have a special meaning depending on your
Mask lower case letters query parameter, which is disabled by default.
This PatSeq Finder search provides an example using amino acid wildcards: https://www.lens.org/lens/bio/patseqfinder#results/044bb5b9-7e64-496b-85e0-008a237069c2. N.B. BLAST treats non-explicit bases/residues as mismatches, so they are not truly wildcards that would receive a “reward” score. You can view the resulting alignments with
X in your PatSeq Finder result.
14. In the PatSeq Finder advanced search parameters (for nucleotide sequences), what are the default values for “Reward and Penalty for (mis-)matching bases”?
The Reward and Penalty Blast parameter varies depending on the search strategy (i.e.
megablast). The default reward for a nucleotide match in
megablast is 1 and in
blastn the default value is 2, or 1 when optimizing for short queries. The default penalty for a nucleotide mismatch in
megablast is -2 and in
blastn it’s -3. More details can be found on the NCBI BLAST help pages: https://www.ncbi.nlm.nih.gov/books/NBK279684/?report=reader&%2F%3Freport=reader#!po=12.5000, for
blastn see Table C2.
15. What is the best way to conduct an exact protein search?
There are no initial search parameters that can be adjusted to search for exact matches for an protein, however the best matching, ungapped alignments will always get a higher score and will be included in the results. You can then use the Minimum Coverage and Minimum Similarity filters in the filter sidebar to filter the results after running the search to view only those results with 100% minimum coverage and 100% minimum similarity.