Skip to content

Refining Patent Searches via Machine Learning Models

Google Develops Phrase Database for Patent Search Models Enhancement

Refining Artificial Intelligence Algorithms for Patent Research
Refining Artificial Intelligence Algorithms for Patent Research

Refining Patent Searches via Machine Learning Models

In the realm of patent search, having a comprehensive and accurate dataset is crucial for effective model training. However, Google does not provide a publicly available dataset of phrases specifically designed for training patent search models.

For those seeking bulk patent data, the popular choice is to use third-party tools or open-source scripts, such as those found on GitHub, which automate the downloading of multiple patent documents from Google Patents or databases like Espacenet. These scripts mimic user actions using browser automation tools like Selenium with ChromeDriver to download associated patent PDFs in batches.

While these approaches don't provide a "dataset of phrases," they do allow for the collection of large volumes of patent text, essential for further processing. For instance, the European Patent Office's Espacenet offers free access to over 130 million worldwide patent documents, providing a substantial resource for those in the patent search field.

To create a dataset of phrases or queries, the typical approach is to collect patent full texts in bulk, extract phrases or terms of interest using natural language processing (NLP) techniques, and curate your own dataset from the raw patent texts.

It's important to note that many patent owners use non-standard language to describe their patents' subjects, which can result in widely varied and impractical search returns. To counter this, the dataset under consideration contains approximately 50,000 phrase-to-phrase pairs, each labelled to denote how phrases are related to one another, with relationship labels including synonyms, exact matches, and unrelated.

The image used in this article is a user-submitted image from Flickr, courtesy of Nick Normal, and is not part of the patent search dataset, nor does it serve as a search tool for patents. Instead, it serves to illustrate the diverse and creative language used in patents, with examples of non-standard language such as describing a soccer ball as a spherical recreation device.

In conclusion, while Google does not offer a ready-made dataset for patent search training, the practical approach for building such datasets involves collecting patent texts in bulk, extracting phrases of interest, and curating your own dataset. This approach, combined with the use of large patent databases like Espacenet, provides a valuable resource for those looking to improve the accuracy of patent search models.

In the process of constructing a dataset for patent search models, one can utilize data-and-cloud-computing technology, such as third-party tools, open-source scripts, Selenium with ChromeDriver, to collect patent full texts in bulk. Subsequently, artificial intelligence (AI) techniques, specifically natural language processing (NLP), can be employed to extract phrases of interest and create a personalized dataset.

With the use of such AI and large-scale patent databases like data from Espacenet, it becomes possible to build a dataset containing approximately 50,000 phrase-to-phrase pairs, aiding in improving the accuracy of patent search models.

Read also:

    Latest