Scientometrics is a science that provides quantitative measures for evaluation of scientific output through analysis of bibliographic information. It is most commonly used in studies on impact and reach of scientific works. In economics its most prominent use can be found in policy evaluation and research on innovation. Such studies make use of the fact that bibliographic data is often linked to other economic phenomena. One example of such a relation are citations of scientific publications in patent publications. Figure 1 shows an excerpt from a patent publication.
Fig 1. scientific citations in a patent publication do not allow scientometric measures
" data-orig-size="814,550" sizes="(max-width: 790px) 100vw, 790px" aperture="aperture" />Fig 1. scientific citations in a patent publication do not allow scientometric measuresScientometrics is a way to measure how science, technology, and economy are intrinsically connected with each other. Understanding how undeniably constitutes valuable economic knowledge. However, in order to understand the connections between those phenomena a suitable methodological environment needs to be created.
The following example represents a classical use case of investigation
Scientometrics
PATSTAT is a product of the European Patent Office (EPO, https://www.epo.org). It is a periodical snapshot of patent related information organized in a relational database model. Records on patent applications, their applicants and publications are available. A table with a code name “tls214_npl_publn” (often referred to as TLS214) stores information on bibliographic references like the one shown in Figure 1. This table contains more than 30 million records. The records, however, are often duplicated or inaccurate. Moreover, a full bibliographic reference is stored in only one attribute. This makes it problematic to query the table for relevant information, for example, to retrieve an author’s name or the date of a specific publication.
Disambiguation in the context of data management refers to the identification of unique entities within a dataset. Such entities are identified by a unique identifier that can be assigned to many database records, which effectively describe the same bibliographic entity. The problem of data duplication and ambiguity arises due to (among other reasons):
- Lack of consistent input (transcription) convention;
- Variable level of input (detail) accuracy;
- Missing data;
- Different order of transcription of the same information;
- Typos.
The following table excerpt illustrates the problem:
Example of 18 out of 56 records found by a simple search on the exact title match. Thus, even more records referring to the same entity may exist in the database.All of the records shown above refer to the same entity – the paper by E.F. Codd on the relational database model (Codd, 1970). However, the references to the same entity are given in different ways or are simply duplicated. For example, record 7 does not contain Codd’s name initial or month information, while record 8 contains full transcription of the abbreviated name “ACM” at the end of the string. All the records describe the same bibliographic entity but are treated as distinct entities by the primary key of the TLS214 relation – the “npl_publn_id” attribute.
Such a design makes it very difficult to use information in the table in a correct way. For example, say a researcher is interested in the relation between science and technology. She assesses that a scientific discovery is well proxied by a publication of a scientific paper, while a piece of technology can be modelled as a patent publication. However, due to the unsupervised procedure in which citations are added to the PATSTAT database it is difficult for her to specify a query that takes into account all the possible variation in records that describe the same bibliographic entity. As a result, the researcher is unable to properly count all of the scientific references to the same bibliographic entity. This results in incorrect patent statistics. As much as it may be possible to identify a single researcher and his body of work (like E.F. Codd), studies on a population of researchers are impossible without a prior cleansing and de-duplication of the PATSTAT database.
Scientometrics: Problem of disambiguating scientific references
The objective of an automated method for cleaning is the disambiguation of all scientific references in the original table of the PATSTAT database proposing techniques for advancing the
Scientometrics research field. “Scientific references” refer to the types of records in the table that describe entities that can be classified as publications. Notice that not all records in the table are scientific references, but can also be references to other patents. The final result of the procedure is a table with clusters of name variants (records) for each, unique scientific entity. The next table presents an example cluster with label 231 for the paper by E.F. Codd.
cluster_id npl_publn_id npl_biblio
231 2219025 Codd, E. F., A Relational Model of Data for Large Shared Data Banks, Communications of the Association for Computing Machinery, Association for Computing Machinery 13: Jun. 6, 1970, pp. 377-387, XP002219025.
231 950805382 CODD, E.F.: A Relational Model of Data for Large Shared Data Banks. In: Comm. of the ACM, Vol. 13, Nr. 6, Juni 1970, S. 377-387
231 953756074 Codd, E.F., A Relational Model of Data for Large Shared Data Banks, Communications of the ACM, 13(6):377-387 (1970).
231 955210884 E. Codd, A Relational Model of Data for Large Shared Data Banks, Communications of the ACM,vol. 13, No. 6, Jun. 1970, pp. 377-387.
231 955405309 Codd, E.F., A Relational Model of Data for Large Shared Data Banks, Jun. 1970, Communications of the ACM, vol. 13, No. 6, pp. 377-387.
… … …
… … …
This introduction should serve as framework for understanding the following thesis proposal in the area of scientometrics: Human Enhanced Machine Driven-Categorization