Additional value by indexing
Experts assume that the knowledge of mankind doubles every five to six years and predict an exponential growth for the years to come. This knowledge is described and presented in scientific publications, patents, compilations of facts, or pure research data and stored in print or electronic form. To make this knowledge retrievable, it is necessary to extract and index information and to make it available through appropriate information services.
What does “indexing” mean?
Indexing means to edit information so that it can be retrieved and used. There are three different steps in this process:
- Traditional indexing of information according to formal guidelines by systematically transferring existing information such as bibliographic data. In addition, keywords, classification, in some cases, also abstracts, are added by human editors.
- Extended indexing through automatic standardization of information, including, patent numbers or chemical structures.
- Indexing of knowledge by systematic, automatic extraction and interpretation of information through mathematical algorithms and semantic methods (extraction of mathematical formulae, identification of chemical names in full texts, etc.).
What information objects do we index?
We work in various disciplines and index many different types of information objects:
- Patent literature from all technological disciplines, mainly from full-texts provided by patent authorities worldwide (STN)
- Scientific literature for our mathematics databases (zbMATH, MathEduc) and the International Nuclear Information System (INIS) for the civil use of nuclear energy
- Inorganic crystal structures from scientific publications (ICSD)
- Mathematical software from mathematics publications (swMATH)
What do we index, and how?
The traditional indexing according to formal guidelines comprises the recording of information on scientific publications or patent information for patent documents, followed by content verification, relevance and plausibility checks, and finally the correction of errors.
Some of the abstracts provided by the authors are directly included; mathematical publications are also accompanied by stand-alone reviews written by researchers.
Besides, keywords from a controlled vocabulary or thesaurus are assigned and the objects are classified by subject.
During the extended indexing process a large amount of information has be standardized. Normally, existing standards can be adopted, but in some cases own standards need to be defined. For example, we developed our own standard for patent numbers to be searched across various STN patent databases.
Each producer of structure databases uses its own system of recording chemical structures. These different systems have to be standardized as much as possible to enable the users to search various structure databases using just one search structure.
In certain cases, automatic indexing methods using special algorithms can be applied. In an ambitious project for its zbMATH database, FIZ Karlsruhe has unambiguously assigned the names of the authors to the corresponding mathematical publications by means of algorithmic methods and, in some cases, also by subsequent processing by human editors.
As a result, author profiles with co-authors, classifications and the corresponding journals can be generated. References were also added to these profiles so that unambiguous mathematical citation networks can be established. These citation profiles are presently available as a beta version.
By means of mathematical algorithms and semantic methods it is possible to extract complex entities from full texts and to classify and standardize them for searching (scientific indexing). In a joint project with Jacobs-Universität Bremen, mathematical formulae were extracted from publications and made available via a special search engine.
A new project with the company InfoChem is dedicated to the indexing of chemical information from patent documents. The chemical names are extracted and transformed into searchable chemical structures. In both projects, ambiguous text information is transformed into standardized and classified entities. This means that very complex information objects can be precisely found.
Dynamic integration of information objects
An important topic in information indexing is the dynamic integration of information objects. Similar documents form networks of related units, e.g. patent families or author networks.
There are very different definitions for merging patent applications and patents granted for an invention. Patents of one invention are often distributed across different patent offices around the globe. Therefore, they have different legal statuses.
To merge the different patent families into a homogeneous concept, FIZ Karlsruhe has, together with its partner Chemical Abstracts Service, developed the Patent Family Index (PFI). The PFI allows to identify the different patent families in one patent document and to access the corresponding documents in the databases from different producers.
For sophisticated information searches it is important that the results are complete (recall) and that as many unwanted results as possible are eliminated (precision).
FIZ Karlsruhe’s excellent information indexing results in a significant improvement and broadening of the searches, achieved through improved recall and precision.
Based on the standardized information, the results can be scientifically analyzed and visualized in diagrams and then be used for research or for business-critical decisions.