Chapter 23 Literature databases

Databases of scientific literature have been around since the 1950s, when the number of journals and papers started increasing so much that scholars simply couldn’t keep up (Garfield, 1955). The first database (from 1962), and one that continues to be one of the largest, was the Science Citation Index (now renamed Web of Science) that sought to record not only the contents of selected journals (titles, abstracts, authors and other metadata), but also the citations that each paper used.

Many more literature databases have emerged over the years. Perhaps most notably PubMed contains a database of biomedical literature and has been free for public use since 1996 and is funded by the US National Institutes of Health (NIH). The Science Citation Index was acquired by the Thompson Corporation in 1992, and this scholarly information part of the business was sold to private for-profit owners, Clarivate Analytics, in 2016. A rival literature database, Scopus was founded by publishing company Elsevier in 2004, claiming to be the largest citation database (Schotten et al., 2017). Similarly, Dimensions, was launched in 2018 as a project owned by Holtzbrinck Publishing Group, owners of Springer-Nature (Hook, Porter & Herzog, 2018; Van Noorden, 2018). These initiatives of major science publishing houses are thought to be linked to their attempts to capture the scientific work-flow (Brembs et al., 2021): and see Part I.Google launched Google Scholar in 2004, and while this database is free to use, the owners are a for-profit company and the data behind the database remains frustratingly closed.

In 2022, a new Open and free database, OpenAlex, successfully aggregated data from ORCID, ROR, DOAJ, Unpaywall, Pubmed, Pubmed Central and the ISSN International Centre among others (Singh Chawla, 2022; Priem, Piwowar & Orr, 2022). OpenAlex is remarkable for two reasons: (1) by aggregating works from so many different Open Source databases, it has amassed 243 million works - nearly 3 times that of Scopus or Web of Science collections. (2) OpenAlex is a free index (specifically it is CC0), it is Open Source and the data is free to use. Contrast this with the other for-profit databases, and you will understand that we should support this new initiative.

As a postgraduate student, you should use one or more of these databases to look for relevant literature to read. As we have seen in the last chapter, citations record the way in which people find the literature useful, and are therefore, a highly effective way of finding related and pertinent literature in your field.

Before you start using a literature database, you will need to have some key literature with which you are familiar and that relates directly to your project. If you don’t have this, then check with your advisor. Sometimes, this key literature is cited in the advert for the PhD, the grant application that funded it, or if this is a topic that your advisor has already conducted some research on, then their own publications may include part of the key literature. If you are not familiar with your advisor’s (including co-advisor’s where appropriate) papers, then a great first step is to search for them on a database and read through titles and abstracts that are relevant to your research topic. Once you have a set of key literature (this might only be 5 to 10 papers), then it is time to go to the literature databases to do your search.

23.1 Searching the literature using a database

There are two typical ways in which you can search for literature about your area of interest. The first is to use keywords and the second is to use citations of key references. I suggest that you use both of these tactics when doing your initial literature search. Further, I’d suggest that it’s worth your time keeping a record of what you have searched for, how and when and then download the results. This is called a systematic literature review. Using this tactic from the outset of your interactions with literature databases will give you a good idea of what proportion of the literature you are reading, and you might then be able to convert your finding into an objective literature review - making a great first introductory chapter for your thesis.

23.2 Keywords

While searching for keywords sounds easy, picking the keywords that you should use is more difficult and requires a lot of practice with the databases and your key references. Remember that keywords are often not single words, but also include short phrases. Look at all of the key references and the keywords that they contain. Any keywords that they share are going to be important for you. Make a note of all relevant keywords for your searches, as ultimately, this will help you when you need to determine the keywords for your own chapters.

Most literature databases have the option of using just keywords, or keywords and words in the title and abstract. I would start with the broader search term and see how well they work. However, these searches alone may return such a large amount of literature that you can’t possibly read it all. In this case, you will need to use more than a single keyword. The trick is to find the correct combination of keywords that will deliver all (or most) of your key references, while not delivering returns of thousands of papers that are not relevant to your study. Ideally, you want a search term that will deliver around one hundred papers at most, where the majority have relevance to your study area. Luckily, literature databases have some tricks for you to add to combine your keywords in ways that you can include or exclude certain topics.

Struggling to define your own keywords? Help is at hand with a new workshop designed specifically to enable you to determine your own keywords specific to your study, and set up a citation alert from your favourite database (click here).

23.2.1 Boolean terms or operators

There are three principle Boolean operators: AND, OR and NOT. ### Example of Shiny App for Boolean Logic

Below is a Shiny app that visualizes the Boolean logic between sets. Use this to better understand how the AND, OR, and NOT operators work in practice (see below):

You should use these to combine a string of keywords. They are relatively straightforward to use:

Use AND when you want both terms included. For example, performance AND lizards. In this example, all papers that mention performance AND lizards in the title, abstract or keywords are returned.
Use OR when you want to include either term. For example, performance AND lizards OR geckos. This example should return more literature than the previous search as it will include, all papers that mention performance AND lizards OR geckos in the title, abstract or keywords are returned. You will note that geckos are lizards, but people who write about geckos might not mention that they are lizards in the title, abstract or keywords, so you may need to use these extra search terms in order to pick up all of the relevant literature on lizards. If your research requires you to cover all lizards, then you might need a lot of search terms in the OR string in order to pick up all of the literature that covers lizard performance.
Use NOT when you specifically want to exclude a keyword. Following our example above, the search term performance AND reptiles NOT snakes should give you all performance papers on lizards, but you may want to exclude turtles and crocodilians as well if they have a lot of hits. NOT is a powerful operator and needs to be used with caution, as it could exclude some relevant hits. In our example here, an item about interactions between snakes and lizards would be excluded.

23.2.2 Proximity operators and parentheses

You can also use the term NEAR to indicate that two terms need to occur within a specified number of words. For example, performance NEAR/5 reptile will capture items where these terms occur within 5 words of each other, but no greater. This might be useful to exclude some papers that mention performance and reptiles in the abstract but are not about reptile performance.

Once you start putting together search strings you will need to use brackets to make sure that the correct terms are grouped. For example, performance AND (lizards OR geckos) will get what we described above, whereas (performance AND lizards) OR geckos will give you all papers on lizards and performance plus all papers on geckos (not what we wanted).

Explore the advanced search options of the specific database that you are using to see the correct semantics needed for a search.

23.2.3 Wildcards

Most literature databases allow you to search using different wildcards. These are *, $ and ?. These can be placed at any point within a search term and help overcome differences in spelling, or families of words that you may want to capture or exclude in your search.

Use * when you want to replace a set of letters at the end of a word. For example, _reptil*_ will return all terms including Reptilia, reptile and reptiles.
Use ? when you want to search for any single letter at that place in a word. For example, sterili?e will capture words with sterilize and sterilise.
Use $ when there is more than a single letter that might change within a word, or the omission of a letter. For example, colo$r will return both US and UK spelling, color and colour, respectively. Note that Scopus doesn’t use the $ symbol as a wildcard: for this example you can use ? instead.

23.2.4 Combining searches

Lastly, there is the option of combining searches, so that you can add search strings together or keep them apart. You will need to use the advanced search options in your database to do this.

You will need to carefully check your database for their precise semantics when using these Boolean operators in a search string. They are all slightly different, often using quotes around the keywords, and your search won’t work as you intend unless you are using the Boolean operators in the way the database dictates.

Your specific search term may end up having a long list of terms. Remember to save this whenever you make a search, especially if you refine it during your studies. Most literature databases will provide the search string that you used together with the results. Check your search string carefully for possible spelling errors that could influence your search.

23.3 Moving items into your reference manager

An invaluable way in which databases have become extremely useful is the ability to move items that you have found during your search directly into your reference manager. Beware that these regularly come with errors, and that the amount of errors is related to your database source (Scopus and WoS are much better than Google Scholar). This means that after importing the references, you will need to go back to your reference manager and edit them or they will come out incorrectly when you use them. Of particular note to biologists is the need to have all species names in italics, and you’ll need to do this yourself.

23.4 Citations

In addition to using keywords, you should search citations of your key references. Typically, when you start reviewing the literature in your specific area, you will have a few key references that have been given to you by your advisor. Having read these, you will already know that they cite other works of interest, and you should follow up to look at all that seem relevant. But that’s not all, these key publications are likely to have also been cited themselves by newer papers. Tracking citations of key papers can help you keep abreast of what is happening in your field. Indeed, given the prolific ever increasing nature of scientific publications, it might be the only way to keep on top of what is happening. Some literature databases allow you to receive alerts if a paper is cited (e.g. Google Scholar), and you should set this up for all of your key citations, and perhaps for some of the most prominent authors in your field.

You can’t search every paper for all the references that it cites, and its own citations. Although this would give you a solid systematic basis for your literature search, it is likely to be too much work. Getting good search terms for your keywords, and finding key references to search for citing articles is a better strategy as it makes your literature searches manageable. If you refine and save your search terms it also makes it repeatable. This will provide you with a solid basis for moving forward with your PhD, and with appropriate alerts make sure that you keep up with the latest literature in your field.

The above section is a brief guide, and your library is likely to have in-depth guides on how to use the specific databases that they house. They may also provide training sessions, or online videos to watch. It is worth becoming familiar with exactly how the databases that you will use work, so this is time well spent at the beginning of your studies.

23.4.1 What else are literature databases used for?

The Science Citation Index used to be published as a single volume that emerged each year and was available in the reference section of the library. By the time that I started at university as an undergraduate in the 1980s, it was several volumes large and my first task for my tutor (one Dr Dave J Thompson) was to look up all of his citations (thankfully not too many at that time). Within a couple of years, the database was digitised and available on CD, and eventually became accessible through the internet.

Citations provide an idea of how the literature is used. This can be used to assess not simply the productivity of a scientist (i.e. number of publications), but their usefulness to others (i.e. number of citations).

Clearly, the Science Citation Index cost some money to put together, and universities paid annually to receive the books, and then the CDs and finally to access the online versions. They were a not for profit organisation, but eventually they were bought out.

23.5 For-profit databases vs Open Source

Why should we be concerned about for-profit databases? We rely on four central tenets of science: rigour, independence, transparency and reproducibility. The for-profit takeover of popular databases has meant that using them compromises the independence of our research - especially when the for-profit owners are also the owners of publishing companies. There is the potential for these large corporates to promote the content of their own journals and suppress the content of rival companies. There is also the potential for them to distort the metrics that they espouse, and evidence of this exists already.

We need to be careful about the independence of our databases, in just the same way that we need to be confident with the independence of our own data. The encouraging start of an Open Source literature database, OpenAlex, should be the start to a new era in independent literature databases.

23.6 Administrators want metrics

But it wasn’t just scientists that were interested in the databases of literature. Administrators became interested in the metrics that the databases produced on academics so that they could use them as a measure of performance, or potential in the case of hiring academics. The obsession with simplifying metrics by administrators who don’t understand the work that their academics produce, has driven the importance and proliferations of literature databases. However, it is widely acknowledged that the way in which journal metrics are used is unhealthy, and this has led to the San Francisco Declaration on Research Assessment.

In the life sciences, there are three prominent databases used: Google Scholar, Web of Science and Scopus. But new players are entering the market all the time, and two of the newest are Dimensions and Microsoft Academic.

23.7 Searching by scientist’s name

Once you have found a scientist who works within your field and produces influential work that you are keen to read, then you may want to search for all of their other papers to determine whether there is more literature that you might have missed. The databases make such searches easy, but you will need the (correct) spelling of your scientist’s name, and also to be aware that they may have used different iterations of their name throughout their career.

23.7.1 Overcoming the difficulties of common names

While Dave Thompson could always add his middle initial to specify his work, as time went by there were more and more scientists publishing as D J Thompson. This made it more difficult for administrators and scientists alike. This has begun to be a bigger global problem, such that the databases have begun to use profiles of particular scientists that individuals can ‘own’ and curate. Each database has its own way of doing this, and it can become rather a chore so that not many scientists have their own profiles, which makes it a problem, particularly for those with common names.

The ORCID number has been erected to provide a common platform for authors to curate, and now some journals and funding authorities won’t allow you to submit or apply without one.

23.8 Google Scholar, Web of Science, Dimensions, Scopus or OpenAlex?

Have you ever wondered why Google Scholar (GS) scores are so inflated compared to other citation databases like Web of Science (WoS) or Scopus? I’ve always noticed that Scopus has better coverage than WoS, and that GS is bigger than both (and a lot messier with lots of weird duplicates and poorly entered stuff), but is there anything more to it than that?

According to Martín-Martín et al. (2018), who analysed some 2.5 million citations, the GS score is likely to be inflated - 1.90 for GS/WoS and 1.45 for GS/Scopus. If you deviate from this with a higher score, you can give yourself a pat on the back for having work that’s reaching more people in more parts of the world. Google Scholar results reflect a more inclusive citation index (Martin-Martin et al., 2018). While WoS and Scopus aren’t exclusively English or journal publications, they are mostly. But that extra third that GS gives you allows you to show the extra scope that your work is getting outside that English journal mainstream. Is your GS score more than a third higher than your WoS or Scopus score? If yes, then your work is having a greater impact elsewhere in the world, and there’s nothing wrong with that. Perhaps GS is the better one to cite as it’s a more inclusive index: more inclusive of different document types and different languages.

However, it is worth noting that of the three indices, Google Scholar seems to be the easiest to manipulate. Delgado López‐Cózar et al (2014) show how it is possible to manipulate citations and therefore metrics of individual researchers by depositing false documents into institutional repositories (note that they did this as an experiment, and you may find that restrictions on doing this at your own institution are prohibitive or lax). The authors also note that a consistent problem with Google Scholar is the lack of transparency, and therefore verification, of how the bibliometrics are calculated. This will make a difference to what papers are found, as Google Scholar tends to report those with the greatest citations first on the first page of outputs. Hence once a paper is initially boosted by manipulation, it will be more likely to be found and increasingly cited.

To directly compare your citation metrics from different platforms, and also to get more unbiased metrics than these platforms provide, consider using the software Publish or Perish (Harzing, 2007). This also provides a great platform on which to harvest data from Google Scholar into a database, or put it directly into your own database, like EndNote, Zotero or Mendeley.

We anticipate future studies that will elucidate how Google Scholar, Web of Science, and Scopus differ from Dimensions and OpenAlex. We already know that larger databases should have potential for larger citation numbers. Because OpenAlex is Open Source, this means that you have the potential to build your own API to generate any metric that you are interested in completely free of charge. The prestigious French Sorbonne has been the first university to abandon their subscription to for-profit literature databases (Jack, 2023), and it will be interesting to see how OpenAlex will transform the scene of literature databases in the future.

References

Brembs B, Huneman P, Schönbrodt F, Nilsonne G, Susi T, Siems R, Perakakis P, Trachana V, Ma L, Rodriguez-Cuadrado S. 2021. Replacing academic journals. Zenodo. DOI: 10.5281/zenodo.5526635.

Delgado Lopez-Cozar E, Robinson-Garcia N, Torres-Salinas D. 2014. The Google scholar experiment: How to index false papers and manipulate bibliometric indicators. Journal of the Association for Information Science and Technology 65:446–454. DOI: 10.1002/asi.23056.

Garfield E. 1955. Citation Indexes for Science. Science 122:108–111. DOI: 10.1126/science.122.3159.108.

Harzing AW. 2007. Publish or Perish.

Hook DW, Porter SJ, Herzog C. 2018. Dimensions: Building Context for Search and Evaluation. Frontiers in Research Metrics and Analytics 3.

Jack A. 2023. Sorbonne’s embrace of free research platform shakes up academic publishing. Financial Times.

Martin-Martin A, Orduna-Malea E, Thelwall M, Lopez-Cozar ED. 2018. Google Scholar, Web of Science, and Scopus: A systematic comparison of citations in 252 subject categories. Journal of informetrics 12:1160–1177. DOI: 10.1016/j.joi.2018.09.002.

Priem J, Piwowar H, Orr R. 2022.OpenAlex: A fully-open index of scholarly works, authors, venues, institutions, and concepts. Available at http://arxiv.org/abs/2205.01833 (accessed December 28, 2023). DOI: 10.48550/arXiv.2205.01833.

Schotten M, El Aisati M, Meester WJN, Steiginga S, Ross CA. 2017. A brief history of Scopus: The world’s largest abstract and citation database of scientific literature. In: Research Analytics. CRC Press, 31–58. DOI: 10.1201/9781315155890.

Singh Chawla D. 2022. Massive open index of scholarly papers launches. Nature. DOI: 10.1038/d41586-022-00138-y.

Van Noorden R. 2018. Science search engine links papers to grants and patents. Nature. DOI: 10.1038/d41586-018-00688-0.