Scientometrics Analysis of RStudio applications in PubMed Database

By Reza AA Khoei

October 16, 2022

Introduction

Evaluation and interpretation of scientific productions can be so helpful in determination of prominent authors, active departments and hot topics in a specific field. This process is called scientometrics and bibliometrix. Scientometrics refers to “all quantitative aspects science and scientific research” (Sengupta 1992). On the other hand, Bibliometrics refers to “the application of mathematics and statistical methods to books and other forms of written communication” (Pritchard 1969). Visualization and Statistical methods of these published documents can be analyzed using R bibliometrix package. This package is created and developed by [Massimo Aria] (https://masimoaria.com) and [Corrado Coccurullo] (https://www.corradococcurullo.com).

Our purpose is to investigate the RStudio applications of published scientific papers of PubMed database. In order to, we used bibliometrix package in Rstudio.

Call required packages

First of all we should install and call all of required packages for our analysis.

install.packages("bibliometrix")
install.packages("kableExtra")
install.packages("dplyr")
install.packages("ggplot2")
install.packages("reshape2")
install.packages("pubmedR")
options(scipen = 999)
library(bibliometrix)
library(kableExtra)
library(dplyr)
library(ggplot2)
library(reshape2)
library(pubmedR)

Search strategy

The search strategy should be considered based on a predefined search text

An API key is required to better and faster searches.Nevertheless, NULL can be put instead of a specific value.

As, RStudio and R programming language are interchangeably used, then both of them are considered in search strategy. After determining the search strategy we searched and finally found 649 documents based on searched terms including articles, book chapters, conference papers and etc.

Now, this data set is necessary to be converted to data frame for statistical analyses. In order to, the following commands are used.

D <- pmApiRequest(query = query, res$total_count, api_key = NULL)
## Documents  200  of  649 
## Documents  400  of  649 
## Documents  600  of  649 
## Documents  649  of  649
M <- pmApi2df(D)
## ================================================================================
M <- convert2df(D, dbsource = "pubmed", format = "api")
## 
## Converting your pubmed collection into a bibliographic dataframe
## 
## ================================================================================
## Done!

Now we use the following commands to get an overview of the data. This information can be gattered in a table suitable for html files. Some attributes like cell positions, cell alignment and so on can be set with different arguments of kable function.

results <- biblioAnalysis(M)

Sometimes, researchers may prefer TO do their analysis in a specific type of document, as only articles. On the other hand, since Rstudio company have been found in 2011, searches are limited to after 2011.

M <- filter(M, M$DT == "JOURNAL ARTICLE" & M$PY >= 2011)
results <- biblioAnalysis(M)
a <- summary(results)
knitr::kable(a$MainInformationDF, caption = "Main information of articles",align = "llccl", format = "html") %>% 
    kable_classic(full_width = F, position = "center")
Table 1: Main information of articles
Description Results
MAIN INFORMATION ABOUT DATA
Timespan 2011:2022
Sources (Journals, Books, etc) 410
Documents 606
Annual Growth Rate % 36.77
Document Average Age 2.53
Average citations per doc 0
Average citations per year per doc 0
References 1
DOCUMENT TYPES
journal article 606
DOCUMENT CONTENTS
Keywords Plus (ID) 1420
Author’s Keywords (DE) 1933
AUTHORS
Authors 3156
Author Appearances 3596
Authors of single-authored docs 22
AUTHORS COLLABORATION
Single-authored docs 23
Documents per Author 0.192
Co-Authors per Doc 5.93
International co-authorships % 0

General information about scientific documents

Here, we can look at some tables and plots which are distracted from data set based on our search strategy.

Scientific documents production year by year

knitr::kable(a$AnnualProduction, caption = 
           "Annualy Production for scientific Documents",
           align = "cc", format = "html") %>%
           kable_classic(full_width = F, position = "center")
Table 2: Annualy Production for scientific Documents
Year Articles
2011 3
2012 6
2013 4
2014 21
2015 19
2016 24
2017 35
2018 34
2019 72
2020 139
2021 155
2022 94

Based on this table, it seems that number of published documents has increased in recent years. Maybe because of Covid-19 pandemi.

Top 10 Authors of PubMed papers analyzed by Rstudio

Lets take a look at top 10 authors and some indexes like number of articles and articles fractionalized.

knitr::kable(a$MostProdAuthors, caption = "Top 10 Authors", align = "lclc", format = "html") %>% kable_classic(full_width = F, position = "center")
Table 3: Top 10 Authors
Authors Articles Authors Articles Fractionalized
WANG Z 10 OH KK 3.25
LIU Y 9 ADNAN M 2.25
WANG Y 9 CHO DH 2.25
OH KK 8 HU K 2.25
ZHANG Y 8 TENAN MS 1.50
ADNAN M 7 WANG Y 1.47
CHO DH 7 LIU Y 1.35
WANG C 7 YANG J 1.17
WANG H 7 WANG Z 1.16
XU Y 7 LI H 1.10

Top 10 most cited papers of PubMed papers analyzed by Rstudio

knitr::kable(a$MostCitedPapers[,1:2], caption = "10 Most Cited Papers",  
             align = "ll",format = "html") %>%  
    kable_classic(full_width = F, position = "center")
Table 4: 10 Most Cited Papers
Paper DOI
LIU Y, 2022, FRONT CELL DEV BIOL 10.3389/fcell.2022.946363
CAI Z, 2022, ENVIRON HEALTH PREV MED 10.1265/ehpm.22-00023
ALKHAYYAT S, 2022, MEDICINE (BALTIMORE) 10.1097/MD.0000000000030576
WANG Z, 2022, J ONCOL 10.1155/2022/5300523
DA SILVA TORRES MK, 2022, FRONT CELL INFECT MICROBIOL 10.3389/fcimb.2022.932563
PRIMATIKA RA, 2022, VET WORLD 10.14202/vetworld.2022.1814-1820
CUI QQ, 2022, MEDICINE (BALTIMORE) 10.1097/MD.0000000000030728
PADAR C, 2022, CUREUS 10.7759/cureus.28414
HUANG X, 2022, HEMATOLOGY 10.1080/16078454.2022.2127462
YENEW C, 2021, ITAL J FOOD SAF 10.4081/ijfs.2022.10221

10 most cited papers are seen in the following table.

Top 10 journals in of PubMed papers analyzed by Rstudio

Here, we can see top 10 journals based on their frequency of published documents. Apparently, PLOS ONE and BIOINFORMATICS are prominent in this field.

knitr::kable(a$MostRelSources, caption = "Top 10 Journals", align =       
        "lc", format = "html") %>% kable_classic(full_width = F, position = "center")
Table 5: Top 10 Journals
Sources Articles
PLOS ONE 21
BIOINFORMATICS (OXFORD ENGLAND) 20
METHODS IN MOLECULAR BIOLOGY (CLIFTON N.J.) 11
BMC BIOINFORMATICS 8
F1000RESEARCH 8
BIOMED RESEARCH INTERNATIONAL 7
CUREUS 7
INTERNATIONAL JOURNAL OF ENVIRONMENTAL RESEARCH AND PUBLIC HEALTH 7
MEDICINE 7
DATA IN BRIEF 6

Top 10 keywords: DE and ID of PubMed papers analyzed by Rstudio

Here, we can see top 10 keywords based on their frequency of published documents. important We should notice that there are two types of keywords which we investigate them separately. DEs are keywords extracted from article. IDs are keywords of references of articles.

knitr::kable(a$MostRelKeywords, caption = "Top 10 Keywords", align = "lclc", format = "html") %>%
     add_footnote(c("DE: Keywords Extracted from Articles","ID: Keywords
     Extracted from References of Articles"), notation="alphabet") %>%  
     kable_classic(position = "center")
Table 6: Top 10 Keywords
Author Keywords (DE) Articles Keywords-Plus (ID) Articles
COVID-19 20 HUMANS 320
META-ANALYSIS 19 FEMALE 91
R PROGRAMMING LANGUAGE 18 MALE 85
R 17 SOFTWARE 82
RSTUDIO 16 ADULT 54
PROGNOSIS 14 MIDDLE AGED 47
BIOINFORMATICS 11 COMPUTATIONAL BIOLOGY 43
MACHINE LEARNING 11 AGED 41
CANCER 10 ANIMALS 37
SARS-COV-2 9 RETROSPECTIVE STUDIES 37
a DE: Keywords Extracted from Articles
b ID: Keywords
Extracted from References of Articles

Top 10 authors and their timeline production

res <- authorProdOverTime(M, k=10)

knitr::kable(res$dfAU[1:3], caption = "Top 10 authors and their timeline,as well annually production, total citations, total citations per year", format = "html") %>% kable_classic(position = "center")
Table 7: Top 10 authors and their timeline,as well annually production, total citations, total citations per year
Author year freq
ADNAN M 2020 4
ADNAN M 2021 1
ADNAN M 2022 2
CHO DH 2020 4
CHO DH 2021 1
CHO DH 2022 2
LIU Y 2017 1
LIU Y 2020 5
LIU Y 2021 2
LIU Y 2022 1
OH KK 2020 4
OH KK 2021 2
OH KK 2022 2
WANG C 2015 1
WANG C 2019 2
WANG C 2020 1
WANG C 2021 3
WANG H 2014 1
WANG H 2019 1
WANG H 2020 1
WANG H 2021 3
WANG H 2022 1
WANG Y 2017 1
WANG Y 2019 1
WANG Y 2020 3
WANG Y 2021 2
WANG Y 2022 2
WANG Z 2019 1
WANG Z 2020 2
WANG Z 2021 4
WANG Z 2022 3
XU Y 2016 1
XU Y 2017 1
XU Y 2019 1
XU Y 2020 2
XU Y 2021 2
ZHANG Y 2016 1
ZHANG Y 2020 3
ZHANG Y 2022 4

Timeline production of best journal of PubMed papers analyzed by Rstudio

topSO = sourceGrowth(M, top=1, cdf=FALSE)
DF = melt(topSO, id='Year')
ggplot(DF,aes(Year,value, group=variable, color=variable))+geom_line()

topSO = sourceGrowth(M, top=3, cdf=FALSE)
DF = melt(topSO, id='Year')

Some Information about Top 10 Authors

DF = dominance(results)
knitr::kable(DF, caption = "Some Information about Top 10 Authors", digits = 3, align = "lccccccc", format = "html") %>%
    kable_classic(position = "center")       
Table 8: Some Information about Top 10 Authors
Author Dominance Factor Tot Articles Single-Authored Multi-Authored First-Authored Rank by Articles Rank by DF
OH KK 1.000 8 1 7 7 4 1
WANG X 0.333 6 0 6 2 9 2
WANG C 0.286 7 0 7 2 5 3
WANG H 0.286 7 0 7 2 5 3
LIU Y 0.222 9 0 9 2 2 5
WANG Z 0.200 10 0 10 2 1 6
LI H 0.167 6 0 6 1 9 7
XU Y 0.143 7 0 7 1 5 8
ZHANG X 0.143 7 0 7 1 5 8
WANG Y 0.111 9 0 9 1 2 10

Dominance factor indicates the ratio of first authored papers to total of articles for top 10 authors.

Top countries based on frequency of publications in their journals

knitr::kable(head(sort(table(M$SO_CO),decreasing=TRUE),10), caption = "Top Countries based on Frequency of published articles in Journals", col.names =         c("Country", "Frequency"), align = "lc", format = "html") %>%
        kable_classic(full_width = F, position = "center" )
Table 9: Top Countries based on Frequency of published articles in Journals
Country Frequency
UNITED STATES 199
ENGLAND 168
SWITZERLAND 81
NETHERLANDS 43
GERMANY 14
CANADA 13
CHINA 10
BRAZIL 7
GREECE 7
NEW ZEALAND 7

As can be seen, United states and England are two prominent countries based on publishing articles.

What does say Lotka’s Law us about these data set?

L=lotka(results)
lotkaTable=cbind(L$AuthorProd[,1],L$AuthorProd[,2],L$AuthorProd[,3],L$fitted)
knitr::kable(lotkaTable, caption = "Frequency Of Authors Based on Lotka's Law", digits = 3, align = "cccc", format = "html",col.names = c("Number of article", "Number of authors", "Frequency based on data", "Frequency based on Lotka's law")) %>%
    kable_classic(full_width = F, position = "center")
Table 10: Frequency Of Authors Based on Lotka’s Law
Number of article Number of authors Frequency based on data Frequency based on Lotka’s law
1 2893 0.917 0.626
2 180 0.057 0.064
3 48 0.015 0.017
4 12 0.004 0.007
5 7 0.002 0.003
6 5 0.002 0.002
7 6 0.002 0.001
8 2 0.001 0.001
9 2 0.001 0.000
10 1 0.000 0.000

Pvalue of two-sample Kolmogorov-Smirnov test between the frequency based on data and the Lotka’s Law is 0.0148932. In significance level of 0.05, this value says us that our data do not follow Lotka’s law.

Collaboration networks for authors

Collaboration network of authors are plotted. As well, the network can be plotted for keywords, universities and countries.

NetMatrix <- biblioNetwork(M, analysis = "collaboration", 
                         network = "authors", sep = ";")
net <- networkPlot(NetMatrix, n = 10, type = "auto", Title = "collaboration Network",labelsize=1, halo = TRUE) 

Thematic map

Thematic Maps are plotted based on (keywords) DE AS follows:

remove.terms.1word = c("aged","map","allergy","demand","rest","workflow","data collection","r","rstudio","data analysis","conservation","review","functional",
    "clinical","identification","data","analysis","network","systematic","r programming","r package","maternal","reproducibility","r language","methods","treatment","r programming language","sars-cov-2","retention","calcium","statistics","open source","quality","methodology","complications","statistical analysis","prognosis","algorithms","software")

synonyms1 <- c("covid-19;coronavirus","gene; genes", "prediction; predicting", "modeling; modelling; resting","emotion; emotional", "adhd; hyperactivity",
      "differentially expressed genes;differentially expressed")
tm1 = thematicMap(M, field = "DE",n.labels = 2, ngrams = 1, remove.terms = remove.terms.1word,synonyms = synonyms1)
plot(tm1$map)

Thematic map is a plot which has been divided to four quadrant: Niche Themes, Motor Themes, Basic Themes and Emerging or declining Themes. For more details refer to (Zhang et al. 2022).

Motor Themes: Quadrant I, located in the upper-right quadrant, named motor
themes, suggested that the themes of the quadrant have developed
and formed important pillars that shape the field of research.

Niche Themes: Quadrant II, located in the upper left quadrant, named niche themes, reflected highly developed but isolated themes.

Emerging or declining Themes: Quadrant III, located in the lower-left
quadrant and named emerging or declining themes, suggested weak development and marginalization of the research field.

Basic Themes: Quadrant IV, located in the lower-right quadrant, was named as basic themes. Although these topics are less developed, they are important to the field of study.

Some diseases (motor themes), like obesity, covid_19, schizophernia, cancer, tuberculosis are discussed well and highly developed and analyzed by Rstudio. on the other side, some statistical and analytical topics such as machine learning, pca (principal component analysis), bibliometrics and bioinformatic analysis.

some diseases (basic themes) like type 2 diabetes mellitus, stroke, differentially expressed genes need to be considered and analyzed more than the present by rstudio, as well as meta analysis, systematic review, network analysis and computational analysis are methods which is reccommedned to use.

Based on this map, there are some themes which have been over discussed (topics covered by niche themes quadrant) in PubMed database. Topics such as, natural language processing, text mining, Pan-Cancer, behavioral science etc. As well, themes in quadrant lll, for example visualization and shiny are of declining themes.

Explanation: some words which we don’t want to be included in the map, as well synonym words are predefined.

Associations among our information

Here, we can see association among Authors, DEs and Journals.

threeFieldsPlot(M)

This plot shows how keywords, authors and journals are related to each other.

Thematic Evolution Plot

Here, We can See Evolution of Topics in RStudio applications field based on DE and TI.

This plot shows themes which have been evolutted during the years.

years=c(2019)

nexus <- thematicEvolution(M,field="DE",years=years,n=100,
          minFreq=3, ngrams = 1,remove.terms = remove.terms.1word,
          synonyms = synonyms1)
plotThematicEvolution(nexus$Nodes,nexus$Edges)
nexus <- thematicEvolution(M,field="TI",years=years,n=100,
          minFreq=3, ngrams = 2,remove.terms = remove.terms.1word,
          synonyms = synonyms1)
## Warning: Expected 2 pieces. Missing pieces filled with `NA` in 1 rows [779].
plotThematicEvolution(nexus$Nodes,nexus$Edges)

References:

Pritchard, A. 1969. “Statistical Bibliography or Bibliometrics.” Undefined. https://www.semanticscholar.org/me/library/all.
Sengupta, I. N. 1992. “Bibliometrics, Informetrics, Scientometrics and Librametrics: An Overview” 42 (2): 75–98. https://doi.org/10.1515/libr.1992.42.2.75.
Zhang, Mingjie, Xiaoxue Wang, Xueting Chen, Zixuan Song, Yuting Wang, Yangzi Zhou, and Dandan Zhang. 2022. “A Scientometric Analysis and Visualization Discovery of Enhanced Recovery After Surgery.” Frontiers in Surgery 9. https://www.frontiersin.org/articles/10.3389/fsurg.2022.894083.
Posted on:
October 16, 2022
Length:
11 minute read, 2311 words
Tags:
Bibliometrix scientometrix Rstudio PubMed Data base
See Also:
Text Mining of Rubaiyat of Omar Khayyam using R