Bibliometrics
Bibliometrics is the application of statistical methods to the study of bibliographic data, especially in scientific and library and information science contexts. It is closely associated with scientometrics to the point that both fields largely overlap.
Bibliometrics studies first appeared in the late 19th century. They have known a significant development after the Second World War in a context of "periodical crisis" and new technical opportunities offered by computing tools. In the early 1960s, the Science Citation Index of Eugene Garfield and the citation network analysis of Derek John de Solla Price laid the fundamental basis of a structured research program on bibliometrics.
Citation analysis is a commonly used bibliometric method based on constructing the citation graph, a network or graph representation of the citations shared by documents. Many research fields use bibliometric methods to explore the impact of their field, the impact of a set of researchers, the impact of a particular paper, or to identify particularly impactful papers within a specific field of research. Bibliometrics tools have been commonly integrated in descriptive linguistics, the development of thesauri, and evaluation of reader usage. Beyond specialized scientific use, popular web search engines, such as the pagerank algorithm implemented by Google have been largely shaped by bibliometrics methods and concepts.
The emergence of the Web and the open science movement has gradually transformed the definition and the purpose of "bibliometrics." In the 2010s historical proprietary infrastructures for citation data such as the Web of Science or Scopus have been challenged by new initiatives in favor of open citation data. The Leiden Manifesto for Research Metrics opened a wide debate on the use and transparency of metrics.
Definition
The term bibliométrie was first used by Paul Otlet in 1934, and defined as "the measurement of all aspects related to the publication and reading of books and documents." The anglicized version bibliometrics was first used by Alan Pritchard in a paper published in 1969, titled "Statistical Bibliography or Bibliometrics?" He defined the term as "the application of mathematics and statistical methods to books and other media of communication." Bibliometrics was conceived as a replacement for statistical bibliography, the main label used by publications in the field until then: for Pritchard, statistical bibliography was too "clumsy" and did not make it very clear what was the main object of study.The concept of bibliometrics "stresses the material aspect of the undertaking: counting books, articles, publications, citations". In theory, bibliometrics is a distinct field from scientometrics, which relies on the analysis of non-bibliographic indicators of scientific activity. In practice, bibliometrics and scientometrics studies tend to use similar data sources and methods, as citation data has become the leading standard of quantitative scientific evaluation during the mid-20th century: "insofar as bibliometric techniques are applied to scientific and technical literature, the two areas of scientometrics and bibliometrics overlap to a considerable degree." The development of the web and the expansion of bibliometrics approach to non-scientific production has entailed the introduction of broader labels in the 1990s and the 2000s: infometrics, webometrics or cybermetrics. These terms have not been extensively adopted, as they partly overlap with pre-existing research practices, such as information retrieval.
History
Scientific works, studies and researches that have a bibliometric character can be identified, depending on the definition, already for the 12th century in the form of Jewish indexes.Early experiments (1880–1914)
Bibliometric analysis appeared at the turn of the 19th and the 20th century. These developments predate the first occurrence of the concept of bibiometrics by several decades. Alternative label were commonly used: bibliography statistics became especially prevalent after 1920 and continued to remain in use until the end of the 1960s. Early statistical studies of scientific metadata were motivated by the significant expansion of scientific output and the parallel development of indexing services of databases that made this information more accessible in the first place. Citation index were first applied to case law in the 1860s and their most famous example, Shepard's Citations will serve as a direct inspiration for the Science Citation Index one century later.The emergence of social sciences inspired new speculative research on the science of science and the possibility of studying science itself as a scientific object: "The belief that social activities, including science, could be reduced to quantitative laws, just as the trajectory of a cannonball and the revolutions of the heavenly bodies, traces back to the positivist sociology of Auguste Comte, William Ogburn, and Herbert Spencer." Bibliometric analysis was not conceived as a separate body studies but one of the available methods for the quantitative analysis of scientific activity in different fields of research: science history, bibliography or sociology of science.
Early bibliometrics and scientometrics work were not simply descriptive but expressed normative views of what science should be and how it could progress. The measurement of the performance of individual researchers, scientific institutions or entire countries was a major objective. The statistical analysis of James McKeen Cattell acted as a preparatory work for a large scale evaluation of American researchers with eugenicists undertones: American Men of Science, "with its astoundingly simplistic rating system of asterisks attached to individual entries in proportion to the estimated eminence of the starred scholar."
Development of bibliography statistics (1910–1945)
After 1910, bibliometrics approach increasingly became the main focus in several study of scientific performance rather than one quantitative method among others. In 1917, Francis Joseph Cole and Nellie B. Eales argued in favor of the primary statistical value of publications as a publication "is an isolated and definite piece of work, it is permanent, accessible, and may be judged, and in most cases it is not difficult to ascertain when, where, and by whom it was done, and to plot the results on squared paper." Five years later, Edward Wyndham Hulme expanded this argument to the point that publications could be considered as the standard measure of an entire civilization: "If civilization is but the product of the human mind operating upon a shifting platform of its environment, we may claim for bibliography that it is not only a pillar in the structure of the edifice, but that it can function as a measure of the varying forces to which this structure is continuously subjected." This shift toward publication had a limited impact: well until the 1970s, national and international evaluation of scientific activities "disdained bibliometric indicators" which were deemed too simplistic, in favor of socological and economic measures.Both the enhanced value attached to scientific publications as a measure of knowledge and the difficulties met by libraries to manage the growing flow of academic periodicals entailed the development of the first citation indexes. In 1927, P. Gross and E. M. Gross compiled the 3,633 references quoted by the Journal of the American Chemical Society during the year 1926 and ranked journals depending on their level of citation. The two authors created a set of tools and methods still commonly used by academic search engines, including attributing a bonus to recent citations since "the present trend rather than the past performance of a journal should be considered first." Yet the academic environment measured was markedly different: German rather than English ranked by far the main language of science of chemistry with more than 50% of all references.
In the same period, fundamental algorithms, metrics and methods of bibliometrics were first identified in several unrelated projects, most of them being related to the structural inequalities of scientific production. In Alfred Lotka introduced its law of productivity from an analysis of the authored publications in the Chemical Abstracts and the Geschichtstafeln der Physik: the number of authors producing an n number of contributions is equal to the 1/n^2 number of authors that only produced one publication. In, the chief librarian of the London Science Museum, Samuel Bradford derived a law of scattering from his experience in bibliographic indexing: there are exponentially diminishing returns of searching for references in science journals, as more and more work need to be consulted to find relevant work. Both the Lotka and Bradford law have been criticized as they are far from universal and rather uncovers a rough power law relationship rendered by deceivingly precise equations.
Periodical crisis, digitization and citation index (1945–1960)
After the Second World War, the growing challenge in managing and accessing scientific publications turned into a full-fledged "periodical crisis": existing journals could not keep up with the rapidly increasing scientific output spurred by the big science projects. The issue became politically relevant after the successful launch of Sputnik in 1957: "The Sputnik crisis turned the librarians' problem of bibliographic control into a national information crisis.." In a context of rapid and dramatic change, the emerging field of bibliometrics was linked to large scale reforms of academic publishing and nearly utopian visions of the future of science. In 1934, Paul Otlet introduced under the concept of bibliométrie or bibliology an ambitious project of measuring the impact of texts on society. In contrast with the bounded definition of bibliometrics that will become prevalent after the 1960s, the vision of Otlet was not limited to scientific publication nor in fact to publication as a fundamental unit: it aimed for "by the resolution of texts into atomic elements, or ideas, which he located in the single paragraphs composing a book." In 1939 John Desmond Bernal envisioned a network of scientific archives, which was briefly considered by the Royal Society in 1948: "The scientific paper sent to the central publication office, upon approval by an editorial board of referees, would be microfilmed, and a sort of print-on-demand system set in action thereafter." While not using the concept of bibliometrics, Bernal had a formative influence of leading figures of the field such as Derek John de Solla Price.The emerging computing technologies were immediately considered as a potential solution to make a larger amount of scientific output readable and searchable. During the 1950s and 1960s, an uncoordinated wave of experiments in indexing technologies resulted in the rapid development of key concepts of computing research retrieval. In 1957, IBM engineer Hans Peter Luhn introduced an influential paradigm of statistical-based analysis of word frequencies, as "communication of ideas by means of words is carried out on the basis of statistical probability." Automated translation of non-English scientific work has also significantly contributed to fundamental research on natural language processing of bibliographic references, as in this period a significant amount of scientific publications were not still available in English, especially the one coming from the Soviet bloc. Influent members of the National Science Foundation like Joshua Ledeberg advocated for the creation of a "centralized information system", SCITEL, partly influenced by the ideas of John Desmond Bernal. This system would at first coexist with printed journals and gradually replace them altogether on account of its efficiency. In the plan laid out by Ledeberg to Eugen Garfield in November 1961, a centralized deposit would index as much as 1,000,000 scientific articles per year. Beyond full-text searching, the infrastructure would also ensure the indexation of citation and other metadata, as well as the automated translation of foreign language articles.
The first working prototype on an online retrieval system developed in 1963 by Doug Engelbart and Charles Bourne at the Stanford Research Institute proved the feasibility of these theoretical assumptions, although it was heavily constrained by memory issues: no more than 10,000 words of a few documents could be indexed. The early scientific computing infrastructures were focused on more specific research areas, such as MEDLINE for medicine, NASA/RECON for space engineering or OCLC Worldcat for library search: "most of the earliest online retrieval system provided access to a bibliographic database and the rest used a file containing another sort of information—encyclopedia articles, inventory data, or chemical compounds." Exclusive focus on text analysis proved limitative as the digitized collections expanded: a query could yield a large number results and it was difficult to evaluate the relevancy and the accuracy of the results.
The periodical crisis and the limitations of index retrieval technologies motivated the development of bibliometric tools and large citation index like the Science Citation Index of Eugene Garfield. Garfield's work was initially primarily concerned with the automated analysis of text work. In contrast with ongoing work largely focused on internal semantic relationship, Garfield highlighted "the importance of metatext in discourse analysis", such as introductory sentences and bibliographic references. Secondary forms of scientific production like literature reviews and bibliographic notes became central to Garfield's vision as they have already been to John Desmond Bernal's vision of scientific archives. By 1953, Garfield's attention was permanently shifted to citation analysis: in a private letter to William C. Adair, the vice-president of the publisher of the Shepard's Citation index, "he suggested a well tried solution to the problem of automatic indexing, namely to "shepardize" biomedical literature, to untangle the skein of its content by following the thread of citation links in the same way the legal citator did with court sentences." In 1955, Garfield published his seminal article "Citation Indexes for Science", that both laid out the outline of the Science Citation Index and had a large influence on the future development of bibliometrics. The general citation index envisioned by Garfield was originally one of the building block of the ambitious plan of Joshua Lederberg to computerize scientific literature. Due to lack of funding, the plan was never realized. In 1963, Eugene Garfield created the Institute for Scientific Information that aimed to transform the projects initially envisioned with Lederberg into a profitable business.