Opening the Black-Box of Peer Review

This paper investigates the impact of referee behaviour on the quality and efficiency of peer review. We focused on the importance of reciprocity motives in ensuring cooperation between all involved parties. We modelled peer review as a process based on knowledge asymmetries and subject to evaluation bias. We built various simulation scenarios in which we tested different interaction conditions and author and referee behaviour. We found that reciprocity cannot always have per se a positive effect on the quality of peer review, as it may tend to increase evaluation bias. It can have a positive effect only when reciprocity motives are inspired by disinterested standards of fairness.

Peer review is a cornerstone of science.The process allows scientists to experimentally pursue new lines of research through a continuous, decentralised and socially shared process of trial and error and ensures the quality of knowledge produced.Whether directly or indirectly, peer review determines how the resources of the science system-including funding, positions, and reputation-are allocated.Despite its importance, peer review remains dramatically under-investigated (e.g., Campanario 1998aCampanario , 1998b;;Godlee and Jefferson 2003;Kassirer and Campion 1994).Certain authors have argued, with little supporting evidence, that it is nothing but a "black-box" (Horrobin 2001) and that it has no "experimental base" (Smith 2006).
1.2 One of the main challenges is to understand referee behaviour and how to increase commitment and reliability for all parties (e.g., Squazzoni 2010; Squazzoni and Takács 2011).While journal editors and submitting authors can benefit from reputational rewards, understanding the incentives to and motivations of referees is more difficult.This is not a trivial matter, either, as it has been recently acknowledged that referees are dramatically overexploited, a fact which could undermine their commitment to the process (e.g., Neff and Olden 2006;Ware 2007).A recent survey estimated that more than 1 million journal articles per year are subjected to peer review, not to mention the innumerable conference proceedings, research proposals, fellowships and university-, department-, and institute-wide productivity evaluations (Björk, Roos and Lauri 2009).Serious doubt about the possibility of peer review continuing on in its present form has even appeared in the influential columns of Science (e.g., Alberts, Hanson and Kelner 2008).

1.3
Recent cases of misconduct and fraud have contributed to calls for a reconsideration of the rigour and reliability of the peer review process.In 1997, the editors of the British Medical Journal asked referees to spot eight errors intentionally inserted in a submission.Out of 221 referees, the median number spotted was two (Couzin 2006).
Another example was the stem cell scandal, able to be traced back to a group of scientists from South Korea, who published an article in Science in 2005 which was based on falsified data.The myopic attitudes of certain editors, influenced by "aggressively seeking firsts", and by nine referees dazzled by the novelties of the paper, implied that review time had been dramatically shortened: the referees took just 58 days to recommend the publishing of the article, as compared against the average of 81 days typical for this influential journal (Couzin 2006).More recently, the Stapel scandal-in which data for numerous studies conducted over a period of 15-20 years and published in many top journals in the field of psychology were found to have been fabricated (Crocker and Crooper 2011)-gained public notoriety in newspapers and on social media.It is worth noting, first, that these cases have caused a misallocation of reputational credit in the science-publishing ecosystem, with negative consequences for competition and resource allocation.Second, such cases also carry serious consequences for the credibility of science in the perceptions of external stakeholders.

1.4
It is worth mentioning that these problems have recently been addressed in Science, where Alberts, Hanson and Kelner (2008) have suggested the need to subject the peer review process itself to a serious review in order to improve its efficiency and guarantee its sustainability.All current attempts at reform, however, which have insisted on the importance of referee reliability and the need for measures to improve said reliability in particular, have followed a trial and error approach which is unsupported by experimental investigation.Although some 'field experiments' concerning peer review have been performed by certain journals or funding agencies (e.g., Jayasinghe, Marsh and Bond 2006;Peters and Ceci 1982), it is widely acknowledged that sound experimental knowledge would be needed concerning essential peer review mechanisms; such knowledge would lend much needed support to any prescribed policy measures (e.g., Bornmann 2011).

1.5
A select few studies in behavioural sciences have attempted to understand the behaviour of the figures involved and the consequences of said behaviour for the quality and efficiency of the evaluation process.One of the few topics studied in depth has been the reviewing rate.Engers and Gans (1998) have suggested, for instance, a standard economic analytic model which examined the interaction of editors and referees.They aimed to understand why referees accepted the responsibility of ensuring sound quality in reviewing without receiving any material incentives and whether improving this latter point would act to increase the reviewing rate.They showed that payment could potentially motivate more referees to review author submissions, although raising the review rate could lead referees to underestimate the negative impact of their refusal, as they could come to believe that other referees had readily accepted a given work.This could in turn motivate journals to increase the payment given to reviewers in an effort to compensate for this effect while reducing the need for referees to incur private costs in order to enhance the quality of reviewed works.
Finally, this could contribute to an escalation of compensation which would eventually prove unsustainable for journals.

1.6
Chang and Lai (2001) also studied reviewing rates and arrived at different conclusions.They suggested that, when reciprocity is present as a motive influencing the relationship between journal editors and referees, by providing room for reputation building for referees, the referee recruitment rate could significantly increase.They showed that this effect could significantly improve the review quality if accompanied by material incentives.This finding was confirmed by Azar (2008), who studied the response time of journals.He suggested that shorter response times of journals in specific communities were attributable to the strength of social norms-especially the mutual respect of good standards of evaluation, towards which referees were extremely sensitive (see also Ellison 2002).

1.7
The importance of social norms in peer review has been confirmed by recent experimental findings (Squazzoni, Bravo and Takács 2013).These results have shown that indirect reciprocity motives, as opposed to material incentives, can increase the commitment and level of reliability of referees.By manipulating incentives in a repeated investment game which had been modified to mirror peer review mechanisms, the authors found that allocating material incentives to referees undermined pro-social motivations without generating higher evaluation standards.These results are in line with game theory-oriented experimental behavioural studies which have acknowledged the importance of reciprocity-both direct and indirect-in facilitating cooperation in situations of information asymmetries and potential cheating temptations, which could represent a typical situation of the peer review process (e.g., Bowles and Gintis 2011;Gintis 2009).

1.8
It is therefore possible to argue that referees would cooperate with journal editors in ensuring the quality of evaluation, as they are invested in protecting the prestige of the journal as a means of protecting their own impact-this is especially so in cases where the reviewer has previously published work in the target journal.On the other hand, reviewers could also be motivated to cooperate with authors-cooperation here meaning the providing of fair evaluation and constructive feedback-as they are interested in establishing good standards of reviewing as a potential benefit when they are themselves subject to the reviewing process as authors.In considering peer review as a cooperation problem, referees would pay a significant cost-i.e., the time and effort needed to conduct a review-to generate a considerable benefit to authorsi.e., publications, citations, and a higher academic reputation-in order to protect the quality of peer review as a public good, from which they expect to benefit themselves in the future. 1.9 Our paper is an attempt to contribute on this point by proposing a modelling approach (e.g., Martins 2010;Roebber and Schultz 2011;Thurner and Hanel 2011;Allesina 2012).Empirical research encounters serious problems when attempting to consider essential aspects of peer review and when investigating complex mechanisms of interaction.Following Squazzoni and Gandelli (2012), we have modelled a population of agents interacting as authors and referees in a competitive and selective science system.We extended the previous model to understand the impact of various agent strategies on the quality and efficiency of peer review and to test the influence of reciprocity between authors and referees.
1.10 The structure of this paper is as follows.In the second section, we introduce the model.In the third section, we present various simulation scenarios and the simulation parameters.In the fourth, we illustrate our simulation results, while in the concluding section we present a summary of results and highlight certain implications germane to the debate surrounding peer review.
The Model [1]   2.1 We assumed a population of N scientists (N = 200) and randomly selected each to fill one of two roles: author or referee.The task of an author was to submit an article with the goal of having it accepted to be published.The task of a referee was to evaluate the quality of author submissions.As informed by the referees' opinion, only the best submissions were published (i.e., those exceeding the publication rate p). [2]  2.2 We gave each agent a set of resources which were initially homogeneous (R α (0)).Resources were a proxy of academic status, position, experience, and scientific achievement.The guiding principle was that the more scientists published, the more resources they had access to, and thus the higher their academic status and position.

2.3
We assumed that resources were needed both to submit and review an article.With each simulation step, agents were endowed with a fixed amount of resourcesF, equal for all (e.g., common access to research infrastructure and internal funds, availability of PhD.students, etc.).They then accumulated resources according to their publication score.

2.4
We assumed that the quality of submissions μ varied and was dependent on agent resources.Each agent had resourcesR( α )∈N, from which we derived an expected submission quality as follows: (1) where v indicated the velocity at which the quality of the submission increased with the increase of author resources.For instance, this means that for v = 0.1 each agent needed R α = 10 to reach a medium-sized quality submission (μ = 0.5).

2.5
We assumed that authors varied in terms of the quality of their output depending on their resources.More specifically, the quality of submissions by authors followed a standard deviation σ which proportionally varied according to agent resources and followed a normal distribution N(μ, σ).This means that, with some probability, top scientists could write average or low quality submissions, and average scientists had some chance to write good submissions.

2.6
We assumed that successful publication multiplied author resources by a value M, which varied between 1.5 for less productive published authors and 1 for more productive published authors.We assigned a heterogeneous value of M after various explorations of the parameter space.This was seen as mimicking reality, where publication is crucial in explaining differences in scientists' performance, but is more important for scientists at the initial stages of their academic careers and cannot infinitely increase for top scientists.

2.7
If not published, following the "winner takes all" rule characterizing science, we assumed that authors lost all resources invested prior to submitting.This meant that, at the present stage, we did not consider the presence of a stratified market for publication, where rejected submissions could be submitted elsewhere, as happens in reality (e.g., Weller 2001).

2.8
The chance of being published was determined by evaluation scores assigned by referees.The value of author submissions was therefore not objectively determined (i.e., it did not perfectly mirror the real quality of submissions), but was instead dependent on the referees' opinion.We assumed that reviewing was a resource-intensive activity and that agent resources determined both the agent's reviewing quality and the cost to the reviewer (i.e., time lost for publishing their own work).The total expense S for any referee was calculated as follows: (2) where R r was the referee's resources, Q α was the real quality of the author's submission and μ r was the referee's expected quality.This last was calculated as in equation ( 1).It is worth noting that, when selected as referees, agents not only needed to allocate resources toward reviewing but also potentially lost additional resources as a result of not being able to publish their own work in the meantime.

2.9
We assumed that authors and referees were randomly matched 1 to 1 so that multiple submissions and reviews were not possible and the reviewing effort was equally distributed among the population.We assumed that reviewing expenses grew linearly with the quality of authors' submissions.We assumed that, if referees were matched with a submission of a quality close to a potential submission of their own, they allocated 50% of their available resources toward reviewing.They spent fewer http://jasss.soc.surrey.ac.uk/16/2/3.html 2 14/10/2015 resources when matched with lower quality submissions, more when matched with higher quality submissions.Reviewing expenses, however, were proportionally dependent on agent resources, meaning that top scientists would be expected to spend less time reviewing in general, as they have more experience and are better able to evaluate sound science than are average scientists.They will lose more resources than average scientists, however, because their time is more valuable than the latter.

2.10
We assumed two types of referee behaviour, namely reliable and unreliable.Reliability was here taken to connote the ability of referees to provide a consistent and unequivocal opinion which truly reflected the quality of the submission.In the case of reliability, referees did the best they could to provide an accurate evaluation and spent all needed resources for reviewing.In this case, we assumed a normal distribution of the referees' expected quality and a narrow standard deviation of their evaluation score from the real value of the submission (σ=R( α )/100).This meant that the evaluation scores by reliable referees were likely to approximate the real value of author submissions.

2.11
We also assumed, however, that in the case of referee reliability there was a chance for some evaluation bias (b = 0.1), and that b increased in proportion to the difference between referees' expected quality and author submission quality.This step was undertaken to represent the knowledge and information asymmetries between authors and referees which characterize peer review in science.To measure the quality of peer review, we measured the percentage of errors made by referees by calculating the optimal situation, in which submissions were published according to their real value, and by measuring the discrepancy with the actual situation in each simulation step (see evaluation bias in Tables 2, 3 and 4).
2.12 In the case of unreliability, referees fell into type I and type II errors: recommending submissions of low quality to be published or recommending against the publishing of submissions which should have been published (e.g., Laband and Piette 1994).More specifically, unreliable referees spent fewer resources than did reliable referees ( s = 0.5), and under-or over-estimated author submissions (see the parameters u and o, respectively, in Table 1).To avoid the possibility that referees assigned the real value to submissions by chance we assumed that, when they underrated a submission, the evaluation score took a standard deviation of approximately -90% of the real quality of the submission (u = 0.1).The opposite sign was assigned in the case of overrating (i.e., + 90%, or o = 1.9).
2.13 It is worth noting that certain empirical studies have shown that these types of errors are more frequent than expected, especially in grant applications (e.g., Bornmann and Daniel 2007;van den Besselaar and Leydesdorff 2007).Bornmann, Mutz and Daniel (2008), for instance, examined EMBO selection decisions and found that 26-48 percent of grant decisions showed such errors, with underrating being more frequent (occurring in 2/3 of cases).A general estimate of the percentage of errors, for peer review in journals in particular, which could have been used to calibrate the model, was unfortunately not available.
2.14 Finally, all simulation parameters are shown in Table 1.At the beginning of the simulation agent resources were set to 0 for all (R α ( 0)).At the first tick, 50% of agents were published randomly.Subsequently, everyone had a fixed amount of resources F for each tick.When selected as authors, agents invested all available resources in conducting research and producing a good submission (i = 1) (see the next section for some manipulation of this parameter).If accepted for publication, author agents had their resources multiplied by m [1, 1.5], as explained in equation ( 3), and so their resources grew accordingly.This meant that the quality of their subsequent submission was presumably higher.

3.1
We built various simulation scenarios to test the impact of referee behaviour on the quality and efficiency of the peer review process.By quality, we meant the ability of peer review to ensure that only the best submissions were eventually published (e.g., Casati et al. 2009).This was a restrictive definition of the various functions that peer review fulfils in the sciences.Here we considered only the screening function.Neither the role of peer review in helping authors add value to their submission via referee feedback (e.g., Laband 1990) nor its role in deciding the reputation of journals and their respective position in the market were considered here (e.g., Bornmann 2011).By efficiency, we meant the ability of peer review to achieve quality by minimizing the resources lost by authors and the expenses incurred by referees.

3.2
In the first scenario, called "no reciprocity", we assumed that agents had a random probability of behaving unreliably when selected as referees; this probability was constant over time and was not influenced by past experiences.When selected as authors, agents invested all available resources in publication ( i = 1), irrespective of positive or negative past experiences with the submission and review process.In this case, there was no room for reciprocity strategies between authors and referees.In the second scenario, called "indirect reciprocity", we assumed that agents were influenced by their past experiences as authors when selected as referees.In cases in which their past submission has been previously accepted for publication, they reciprocated by providing reliable evaluations when selected as referees.Note that in this case, authors were self-interested and did not consider the pertinence of the referee evaluation, only their publication success or failure in their previous submission.This meant that they reciprocated negatively if they experienced rejection and positively when they had been successfully published even if they knew that their submission wasn't worthy of publication.

3.3
In the third scenario, called "fairness", author agents formulated a pertinent judgment of the referee evaluation of their submission.They measured the fairness of the referee's opinion by comparing the real quality of their submission and the evaluation rate received by the referees.If the referee evaluation approximated the real value of their submission (i.e., ≥ -10%), they concluded that the referee was reliable and had done a good job.In this case, when selected as referees, agents reciprocated positively irrespective of their past publication or rejection history.This meant that indirect reciprocity was now not based on the pure self-interest of agents but on normative standards of conduct.

3.4
The final two scenarios, "self-interested authors" and "fair authors", extended the previous two scenarios by examining author behaviour in conjunction with reviewer behaviour.In the "self-interested authors" scenario, we assumed that authors reacted positively and continued to invest all available resources into their next submission when published (i = 1).In the case of rejection, they reacted negatively and invested fewer resources in subsequent attempts at publication (i = 0.1).This reaction was independent from the pertinence of the referee evaluation.In the "fair authors" scenario, in cases in which the agent had received a pertinent referee evaluation when themselves an author, they reinforced their confidence in the quality of the evaluation process and continued to invest heavily in producing quality submissions irrespective of the fate of their submission.In the case of non-pertinent evaluation (see above), they invested less in the subsequent attempt at publication (i = 0.1) and accumulated resources for the subsequent round irrespective of their previous publication.In this case, agents therefore inferred the overall situation of peer review http://jasss.soc.surrey.ac.uk/16/2/3.html 3 14/10/2015 standards through their experiences as authors and consequently acted in a strategic manner. Results

4.1
Table 2 shows the impact of referee behaviour on the quality and efficiency of peer review under various conditions of the publication rate (25%, 50%, and 75% of published submissions).Data were averaged on a 200-simulation run in any parameter condition.First, results showed that the reciprocity motives of referees did not have per se a positive effect on the quality and efficiency of peer review when the publication rate was more competitive.Second, when the publication rate was higher the quality of the peer review process improved only minimally, but at the expense of referees' resources.Although increased competitiveness in general implied increasing evaluation bias, "fairness" implied lower bias and fewer resources lost by authors, although reviewing expenses were generally higher.Furthermore, it ensured greater robustness to the changes in competition pressures.On the other hand, indirect reciprocity without fairness by authors implied higher evaluation bias and higher resource loss when the publication rate diminished.

4.2
It is worth noting that, in order to calculate the resource loss, we calculated the amount of resources wasted by (unpublished) authors compared with the optimal solution -i.e., where only the best authors were published.To calculate the reviewing expenses we measured the resources spent by agents for reviewing compared with the resources invested by submitting authors.

4.3
Table 3 shows the impact of the reciprocal behaviour of authors as the publication rate varies.Results showed that the reciprocity of authors improved peer review only when associated with fair criteria applied to the judgment of their submission.When authors reacted to referee evaluation only in relation to their self-interest-i.e., eventually being published-the quality and efficiency of peer review drastically declined.Moreover, in case of authors' fairness, peer review dynamics improved even with increased competition.Figure 1.The impact of agent behaviour on system resource accumulation in weakly selective environments (75% of published submissions).The x-axis shows the number of the simulation run.
based on a highly abstracted model and thus every conclusion should be considered with caution.For instance, peer review in reality is not equally distributed among the population, and editors are of course also important in providing room for reputation building and reciprocity motives in referees.These are certainly points which can inform future research efforts.

5.4
A crucial challenge facing any future work, however, will be an effort to address the gap between theory and empirical observation (e.g., Watts and Gilbert 2011).Although it is difficult to obtain empirical data which point to agent behaviour affecting peer review-especially at the scale needed to examine general aspects of the process-one possible means of development could be to empirically test referee behaviour in highly representative journals.

5.5
Certain empirical measures which have already been developed could be applied to test our findings.Laband (1990), for instance, examined referee reliability by measuring the lines of the report text sent to submitting authors, assuming that the longer the text, the higher the quality of the referee comments and the more reliable the final score assigned to the submissions.Although it is difficult to measure peer review effectiveness (e.g., Jefferson, Wager and Davidoff 2002), this is a brilliant idea to build an ex-ante measure which could complete the most common ex-post measures of peer review validity, such as citation indices or the fate of rejected submissions (e.g., Weller 2001).An alternative would be to exploit, where available, the ratings of referees assigned by journal editors as a proxy of the quality of reviews or to indirectly derive such measures by considering the number of reports in which a given referee was involved, assuming that the more often a referee was involved in the peer review process by a journal editor the higher the quality of his/her reviewing work.

5.6
Let us suppose that we can select a set of representative journals, possibly comprised of different scientific communities and audiences, and that we have access to both the list of referees and authors and to the referee reports.Let us suppose we apply one of the measures mentioned above to assess the ex-ante validity of peer review and measure the ex-post validity-e.g., by collecting data on subsequent citations of published articles or by analysing the fate of rejected submissions.This would allow us to build a statistical measure of the reliability of referee evaluation.By measuring the link between referees and authors in these journals and looking at the fate of past referee submissions to the journal we could thereby test whether space was given for reciprocity and fairness which might have influenced the quality of the evaluation.

Table 2 :
The impact of referee behaviour on the quality and efficiency of peer review in various selective environments (values expressed as percentage).

Table 3 :
The impact of author reciprocal behaviour on the quality and efficiency of peer review in various selective environments (values expressed as percentage).