The consistency of superior face recognition skills in police officers

In recent years, there has been increasing interest in people with superior face recognition skills. Yet identification of these individuals has mostly relied on criterion performance on a single attempt at a single measure of face memory. The current investigation aimed to examine the consistency of superior face recognition skills in 30 police officers, both across tests that tap into the same process and between tests that tap into different components of face processing. Overall indices of performance across related measures were found to identify different superior performers to isolated test scores. Further, different top performers emerged for target ‐ present versus target ‐ absent indices, suggesting that signal detection measures are the most useful indicators of performance. Finally, a dissociation was observed between superior memory and matching performance. Super ‐ recognizer screening programmes should therefore include overall indices summarizing multiple attempts at related tests, allowing for individuals to rank highly on different (and sometimes very specific) tasks.

criterion based on a somewhat arbitrary statistical cut-off is problematic. Although some individuals may simply reach criterion by chance, others, who are genuinely excellent at face recognition, may be "missed." The latter may occur because of fatigue, illness, lifestyle influences, or simple misunderstanding of instructions-factors that may be overcome by repeated assessment. A similar scenario has been noted at the other end of the face recognition spectrum, where McKone et al.
(2011) carried out a second screening session to clarify the diagnoses of six individuals who reported severe everyday difficulties with face recognition. Although these people only achieved borderline impaired scores in an initial assessment, they did fulfil the criteria for prosopagnosia in a second attempt at the test using novel stimuli. In another study, Bindemann, Avetisyan, and Rakow (2012) examined performance consistency in typical participants who completed the same face matching task on three subsequent days. They found that individual participants varied in their overall accuracy scores on each day, eliciting different responses to the same stimuli across the three attempts. Thus, repeated assessment of performance on the same task may be required to (a) interpret borderline cases and (b) detect not only the most proficient but also the most reliable performers.
Much existing evidence also suggests that an individual's genuine level of performance may differ across face recognition tasks that tap into different subprocesses. For instance, some people may be very good at discriminating between simultaneously presented faces, yet only have average face memory skills. Evidence supporting this possibility comes from the developmental prosopagnosia literature, where dissociations between subcomponents of face recognition have been observed. Although impaired face memory is the hallmark symptom of the condition (Murray, Hills, Bennetts, & Bate, 2018), earlier processes involving the perception of faces can be selectively spared (Bate, Haslam, Jansari, & Hodgson, 2009;Lee, Duchaine, Wilson, & Nakayama, 2010;McKone et al., 2011) or impaired (Bate et al., 2009;Duchaine, Germine, & Nakayama, 2007; for a review, see Bate & Bennetts, 2015).
Interestingly, some small-scale investigations into super recognition have found that facial identity perception (typically assessed via face matching tasks that place no demands on memory) is not always facilitated in individuals with superior face memory skills Bobak, Bennetts, et al., 2016;Bobak, Dowsett, & Bate, 2016), although it is unclear whether the reverse pattern can be found (i.e., facilitated face matching skills in the context of typical face memory skills). This is because performance on a face memory task (the CFMT+) is typically the sole screening measure for theoretical investigations, and face perception skills have only been reliably assessed in individuals who have passed the initial inclusion criterion.
Importantly, screening procedures that use the CFMT+ alone also ignore another fundamental indicator of face recognition performance: the ability to decide when a target face is absent from an array.
Yet face recognition in the real world, and particularly within policing settings, does not only involve the recognition of a target face when it is present within a set of faces but also importantly also requires successful acknowledgement that a particular face is absent. Although top performers should demonstrate heightened performance in both scenarios, some existing evidence indicates variation in target-absent accuracy in SRs who had initially been identified by the CFMT+ (i.e., target-present performance) alone (Bobak, Hancock, et al., 2016). Given work with typical participants has also failed to find an association between target-present and target-absent face matching performance (McCaffery, Robertson, Young, & Burton, 2018;Megreya & Burton, 2007), inclusion of both measures within a screening test is necessary to provide a complete indicator of top-end face recognition performance.
Finally, most traditional face recognition tasks use tightly controlled facial images that have been stripped of external features that could cue recognition (e.g., Bate, Haslam, Tree, & Hodgson, 2008;Duchaine & Nakayama, 2006;McKone et al., 2011). However, some authors suggest that this adjustment reduces ecological validity by failing to replicate the immense variability that typically occurs between different images of the same face in everyday life (Young & Burton, 2017. In fact, the matching of two unfamiliar faces of the same identity is a notoriously difficult task (e.g., Jenkins, White, Van Montford, & Burton, 2011;Young & Burton, 2017, even when external features are present and the two images have been collected on the same day (e.g., Bruce et al., 1999). The task becomes even more difficult when images have been captured on different days, and in this instance, the inclusion of extra-facial features can serve to further increase variability between naturalistic images (e.g., where the target has changed hairstyle, grown facial hair, or is wearing alternative make-up). For example, Kramer and Ritchie (2016) examined the influence of glasses on face matching performance. They found that typical participants incorrectly categorized more same-identity pairs when glasses were worn in only one image, compared with pairs where they were worn in both or neither image. Embracing real-world variability in facial presentation may therefore not only be an important means of replicating real-world policing scenarios (particularly where individuals may deliberately attempt to disguise their identity) but may also enhance the difficulty of face recognition tasks, ensuring they are appropriately calibrated for the detection of top performers.
The current study aimed to examine the consistency of superior face recognition skills both across tests that tap into the same process and between tests that assess different processes. We assessed the performance of a group of 30 police officers who had previously been screened for super recognition, surpassing a liberal criterion on at least one of two tests: the CFMT+ and a face matching task. This allowed us to assess face recognition consistency in those with apparent proficiencies in both memory and matching, in addition to those with facilitations in only one of the two processes. All officers completed five tests: a new face memory test that adapted the CFMT+ paradigm to include target-absent trials , three new versions of the face matching task, and a test that requires participants to decide whether a composite target face (generated using a holistic composite system) is present within a simultaneously presented image displaying a crowd of people ("Crowds" task). We included the Crowds test to examine whether proficient face recognition skills, as identified on either of the two preceding types of test, extend to a novel, more real-world policing task. All tests were calibrated to detect performance at the top end of the spectrum (allowing for at least three standard deviations from the control mean), using naturalistic facial images that varied in appearance. Consistency of performance across related tests was considered in terms of the number of times that a participant surpassed criterion performance and by overall index scores.

| Participants
Thirty police officers (10 female, M age = 37.6 years, SD = 7.9) from the United Kingdom took part in this study. These officers had previously been identified as having proficient face recognition skills following a large-scale screening programme carried out by our laboratory (see Data S1). Because we wanted to identify individuals who were proficient at face memory or face matching, these officers had obtained excellent scores on at least one of two tests: the CFMT+ (for full details, see Russell et al., 2009) and a face matching task (the Pairs Matching Test [PMT]; see Bate et al., 2018). Although the CFMT+ is a well-known test, the PMT is a more recent test developed within our laboratory. A detailed description of the latter test can be found in Bate et al. (2018); in brief, the PMT has a similar design to existing face matching tasks (e.g., Burton, White, & McNeil, 2010), but is sufficiently calibrated to detect top performers via single-case statistical comparisons. The task contains 48 (half male) pairs of faces, presented in colour. Half of the trials match in identity, and half are mismatched. Each pair of faces is displayed simultaneously for an unlimited duration, and participants elicit a "same" or "different" response for each pair.
Because each officer only had one attempt at each test, we set the selection cut-off at 1.5 SDs above the control mean (see Data S1).
Although this liberal criterion is lower than that used in previous work, it allowed borderline cases to be included-enabling us to thoroughly examine the importance of repeated testing and performance consistency. Using these cut-offs, 14 officers outperformed controls on both tests, 10 only on the CFMT+, and six only on the PMT. Twenty-eight officers were Caucasian; two were of mixed ethnicity. These individuals perform a wide range of roles within the police force, with 21 having direct contact with the general public. Length of service ranged from 1 to 31 years. Officers participated in this investigation during their normal working hours and did not receive any additional compensation for their time.
Forty (20 female; M = 33.4 years, SD = 10.2) civilian control participants, age-matched to the police participants, also took part in this study. They were randomly selected from Bournemouth University's participant pool, irrespective of their self-perceived face recognition skills. These individuals were offered a small financial incentive to ensure their motivation for the tasks. Ethical approval for the study was granted by the institutional ethics committee.

| Models Memory Test
This new test of face memory is an adaptation of the CFMT+, using naturalistic colour photographs of each individual that have been captured on different days and in different settings (see Figure 1). Images are cropped to display the faces from the neck upwards (image sizes are 8 cm high by 6 cm wide), but no external facial features are removed.
A full description of the Models Memory Test (MMT) can be found in Bate et al. (2018). In brief, the test begins with a similar encoding procedure to the CFMT+: For each of six target faces, three different images of the person (taken on different days and in different settings) are shown sequentially for 3 s and immediately followed by three test trials. Three faces are displayed in each test trial: one of the encoded images and two distractors. As in the CFMT+, the encoding phase terminates with a 20-s review of the six target faces, by simultaneously presenting a new frontal image of each individual.
Ninety test trials (45 target-present) are then presented in a random order, with a screen break at the halfway point. Target-present triads contain one new image of a target face and two matched distractors; target-absent triads contain three distractors that are matched to one of the target faces. Triads in the first half of the test contain images that more closely resemble those used in the encoding phase, whereas those presented after the screen break display the targets under more challenging conditions (e.g., with additional facial hair, or where the face was obscured by accessories or a large change in viewpoint).
Images remain on-screen until a response is made, and no time restriction is imposed. Participants can make a target-present or target-absent response for each trial. Target-present responses were elicited using the corresponding number key (1-3) that indicates the position of the target in the triad, whereas the 0 key represents a target-absent response. Five types of response are possible on this test. For target-present trials, participants can correctly identify the target face (hits), they can incorrectly elicit a target-absent response (misses), or they can incorrectly identify one of the distractor faces (misidentifications). In target-absent trials, participants can elicit the correct response (correct rejections) or incorrectly identify a distractor face (false positives). We recorded each of these responses for each participant and summed the number of hits and correct rejections to calculate an overall accuracy score.

| Pairs Matching Test
Three new blocks of the PMT (see Data S1 and Bate et al., 2018) were developed for this investigation. These assessed participants' ability to match simultaneously presented pairs of male Caucasian faces when (a) the viewpoint of the face severely changed (i.e., by more than 45°) across the two images, (b) the actor was wearing glasses in only one image, and (c) the actor had facial hair in one image but was cleanly shaven in the other (see Figure 2). Each of these three blocks contained 48 trials: 24 matched in identity, whereas the remainder displayed two different individuals. All images were downloaded from Google image searches and were cropped to display the entire face from the neck upwards. Mismatched faces were paired according to their perceived similarity to each other, and all images were adjusted to 10 cm in width and 14 cm in height. Participants completed the three blocks in a counterbalanced order, and trials were randomized within each block. To ensure ecological validity (i.e., in replicating policing scenarios such as CCTV image matching), stimuli were displayed until responses were made, and no time limit was imposed.
Participants made key presses to elicit "same" or "different" responses.
Scores were calculated in terms of hits (the number of correct "same" responses) and correct rejections (the number of correct "different" responses) and summed for overall accuracy.

| Crowds Matching Test
Our final test aimed to replicate a very specific policing scenario, where officers have a composite target face (generated using EvoFIT: a holistic composite system) and they are required to find this individual in a crowd. A detailed description of this test and the composite generation procedure can be found in Bate et al. (2018) and is also summarized in Supporting Information (see Data S2). In brief, an initial set of participants (see Data S2) generated the target composite stimuli, following a pre-existing procedure (Fodarella, Kuivaniemi-Smith, Gawrylowicz, & Frowd, 2015). This process began with participants freely describing a designated target face (half taken from the crowd images used in the final test and half taken from crowd images that were not used in the final test) in as much detail as possible, without guessing. This information was recorded by the experimenter on a face-description sheet, using feature description labels. An age-and gender-appropriate database was then presented to the participant, displaying the inner region of a series of faces. Participants selected faces that best matched the overall appearance of the target; these faces were combined, and the selection procedure repeated. They then selected the best-matching item and improved it using "holistic" (addressing the age, weight, and overall appearance of the face) and "shape" (addressing the size and position of facial features) tools.
Finally, the best-matching set of external features (hair, ears, and neck) were selected, and participants had a final opportunity to improve the face using the same holistic and shape tools.
Thirty-two composites were selected for the final experiment (see Data S2) and encompassed into 32 trials (16 target-present) where participants simultaneously viewed a target composite face at the top of the screen and an image below that showed 25-40 people in a naturalistic setting (e.g., an audience at a concert or sporting event; see Figure 3). Composite faces measured 3 cm in height and 2 cm in width, and crowd images were 9 cm in height and 13 cm in width. Participants were required to decide whether or not the target face is present in each crowd, pressing a key on the keyboard to make their response. Trials were displayed in a random order, with no time restriction for responses. Hits and correct rejections were calculated and summed for overall accuracy.

| Procedure
The majority of the officers was tested in face-to-face laboratory conditions. However, due to limitations in availability, a minority of individuals (N = 5) completed some or all of their testing

| Statistical analyses
Initial analyses compared the performance of online versus laboratory control participants. As no differences were detected on any measure (all ps > 0.05), data were collapsed across all control participants for subsequent analyses. For all tests, the overall mean and SD scores were calculated for all performance measures, and cut-offs in this phase were set at the usual, more conservative level of 1.96 SDs from the control mean. Because all the tests contained target-present and target-absent trials, these items were also analysed separately, together with relevant signal detection measures (see below for each test). Initial exploration of the data revealed that one officer scored 97.78% correct on the target-present trials of the MMT, but made no correct responses on the target-absent trials. We assumed this individual had misunderstood the task and removed their data from all relevant analyses.

| Relatedness of tests
The main aim of this investigation was to examine consistency of performance across tests that tap the same process and between tests that measure different processes. Initial analyses therefore collapsed data across SR and control participants and explored the relationship between the experimental tests and the CFMT+. Further, because existing work (e.g., Bate et al., 2018) has indicated differences in target-present and target-absent performance in super recognition, we entered data for each test separately for hits and correct rejections.
Initial eigenvalues from a principal components analysis (PCA) indicated that the first three factors explained 33.57%, 23.39%, and 10.71% of the variance, and the remaining eight factors had eigenvalues that were less than 1. Solutions for two, three, four, five, and six factors were each examined using varimax and oblimin rotations of the factor loading matrix. The five-factor oblimin solution (which explained 83.21% of the variance) was preferred, as it offered the best defined factor structure (see Table 1). The first factor had high loadings from target-present measures: hits on the three blocks of the PMT, hits on the MMT, and overall performance on the CFMT+. The second factor had high loadings from correct rejection scores on the three matching blocks, as well as overall scores from the CFMT+. The third and fourth factors represented hits and correct rejections, respectively, on the Crowds test; the fifth factor had a high loading from correct rejections on the MMT. A full correlation matrix is displayed in Table 2.
In sum, this analysis suggests that (a) the two target-present memory measures are related, but target-absent memory performance should be independently considered; (b) the three new blocks of the matching test are related, but target-present and target-absent trials should again be considered independently; and (c) both target-present and target-absent performance on the Crowds test is distinct from all other measures. These findings were used to create appropriate indices that assessed consistency of performance across related and unrelated measures.

| Consistency of face memory performance
Overall percentage correct on the MMT was calculated by summing hits and correct rejections. Norms for each of these measures were FIGURE 2 Sample pairs from the three new blocks of the PMT that differ according to (a) pose, (b) glasses, and (c) facial hair. Due to issues with image permissions, this figure only displays images that resemble those used in the actual test. All pairs display faces of the same identity set at 1.96 SDs from the control mean (see Table 3). Officers' scores ranged from 53.33% to 95.56% correct, with 14 individuals exceeding the control cut-off. Eleven of these officers had also outperformed    Figure 4a).
Because the CFMT+ only contains target-present trials, we reasoned that the discrepancy in the individuals identified by overall performance on each test could result from the inclusion of target-absent trials in the MMT (as also suggested by the PCA). Thus, we examined the consistency of performance between the CFMT+ and just the hits from the MMT (see Figure 4b). Ten officers surpassed the control cut- Given this variation in target-absent performance, there may be added value in considering correct rejections as a further performance indicator. We explored this issue using signal detection analyses and computed scores of sensitivity (d′) and bias (c) for each individual.
Information from hits and false positives were used to calculate d′-a measure of sensitivity that is free from the influence of response bias (Macmillan & Creelman, 2005). Values for the current test can range from −4.59 (consistently incorrect responding) to 4.59 (perfect accuracy), with a score of 0 indicating chance performance. Response bias is indicated by c and assesses whether the participant has a tendency to elicit target-present or target-absent responses (Macmillan & Creelman, 2005). Positive scores indicate more conservative responding (i.e., the tendency to make target-absent responses) whereas negative scores represent more liberal decisions (i.e., the tendency to make target-present responses); a score of 0 is a neutral response criterion. All target-present responses (i.e., hits and misidentifications) were included in this analysis, allowing us to calculate a measure of response bias that indexed a tendency to make targetpresent or target-absent decisions.
Because d′ accounts for both target-present and target-absent performance, we examined the top performers on this measure in comparison with their identified scores on the two memory tests and the overall Memory Hits Index. Twelve officers achieved d′ scores that were at least 1.96 SDs above the control mean (see Figure 4d). All Thus, officers who excelled at this task showed enhanced sensitivity relative to controls, rather than a change in response bias (i.e., a general tendency to say that the target is present/absent). This conclusion is further supported by the analysis of misidentifications.
Overall, SRs made less misidentification errors than controls, even when the number of misidentifications was controlled for the overall number of "target-present" responses. This indicates that the SRs were not simply guessing when they indicated that a target was present in a trial-instead, they were able to accurately identify the target faces substantially more often than control participants.

| Consistency of face matching performance
Our next set of analyses examined the consistency of performance across the three new blocks of the face matching test (i.e., the Pose, Glasses, and Facial Hair manipulations). Hits, correct rejections, and overall accuracy were summed for all participants on each block, and norms for each measure were calculated using the control data.
Cut-offs were again set at 1.96 SDs above the control mean (see Table 4). We initially examined overall accuracy rates in each block.
First, we looked at the officers who had outperformed controls in the screening version of the PMT. Of these 20 officers, 15 exceeded control performance on at least one of the three blocks: Three outperformed controls on all three blocks (see Figure 6a), nine on any two blocks (see Figure 6b), and three on any one block (see Figure 6c).
Five did not outperform controls on any block (see Figure 6d). Next, we looked at the performance of the 10 officers who had not passed the initial PMT screen (i.e., they were included in this study on the basis of their CFMT+ score alone). Remarkably, only one officer failed to exceed control criterion on any one block, and two only surpassed controls on any one block (see Figure 6e). Two officers surpassed control performance on all three blocks and five on any two blocks (see Figure 6f)   The performance of each individual officer on each index is displayed in Figure 8a, with all four indices converted to standardized scores for ease of comparison. A correlation matrix is presented in Table 5. There were strong relationships between accuracy and consistency for both hits and correct rejections; however, although consistency of performance was related across hits and correct rejections, accuracy was not. These findings indicate that although it is important to assess accuracy of performance independently for target-present and target-absent trials, consistency is more stable.
The top 10 performers on the Matching Hits Accuracy Index are displayed in Figure 8b. Only half of these individuals would have been picked up in the screening PMT, with z scores ranging from 0.37 to 1.72 in the remaining five officers. As observed for the memory tests, the top performers on matching hits displayed more varied performance on the target-absent trials. Notably, one  Note. Lower scores represent more consistent performance, whereas higher scores represent more accurate performance.
respectively (see Figure 7c). Similarly to performance on the memory tests, these results confirm that SRs excel at face matching due to better sensitivity, as opposed to a change in response bias.

| Crowds test
Hits and correct rejections were calculated for the Crowds test and summed to index overall accuracy. Controls achieved scores that ranged from 28.13% to 81.25% (see Table 6). There was no significant difference in the number of hits compared with correct rejections for controls, t(39) = 0.189, p = 0.851. Norms were once again set at 1.96 standard deviations from the control mean, yet no officer surpassed the cut-off for overall accuracy. When d′ was calculated, the same pattern was observed.
These results suggest that it is difficult to surpass the 1.96 cut-off on the Crowds test-perhaps because composites constructed from memory are difficult to recognize or match to target (see discussion below). We therefore lowered the criterion and examined the performance of participants who had performed more than one SD above the control mean on d′. Six officers (

| Consistency of performance between unrelated measures
Finally, we used the most informative measures identified above to look at the consistency of performance across tests that tap different processes. The initial PCA permitted us to combine measures across target-present face memory, but target-absent performance also needs to be considered. We therefore used the measure that combines both types of trial: d′ score on the MMT. For face matching, the PCA indicated that performance on the three new blocks of the PMT could be combined separately for target-present and target-absent trials. Although the officers demonstrated consistency in their performance across both types of trial, combined accuracy scores varied more substantially and provided a means to discriminate superior performers. We therefore selected the signal detection measure of sensitivity (A) to index overall face matching accuracy over target-present and target-absent trials. The PCA also indicated that the Crowds test was not related to the other measures, and target-present and target-absent performance should again be considered independently. Thus, we again used d′ as the critical measure on this test.
We initially looked at performance across all three measures.
Using a 1.96 SD cut-off for the memory and matching measures and a 1.00 SD cut-off (see above) for the Crowds test, it was found that only one officer achieved superior scores across all three indicators. We had expected that performance on the Crowds test  Table 7). Finally, it is of note that six officers did not achieve a superior score on either measure (see Figure 9). Although four of   This finding poses a practical problem, as different individuals tended to excel at each measure. In policing practice, the correct answer to a facial identity challenge is not known-that is, it is not possible to know whether an officer should be deployed who is particularly good at target-present trials versus one who is particularly good at targetabsent performance. Perhaps the best solution is to identify the top performers on measures that encompass both types of performance, such as sensitivity scores calculated from signal detection theory.
Although the top performers on these measures may not be the top performers on target-present or target-absent indices, they are the most consistent overall performers when response bias is accounted for. This is a particularly important issue in real-world face recognition scenarios such as policing, where false leads or even miscarriages of justice can result from errors in either target-absent or target-present judgments. Thus, although we agree that the CFMT+ is an excellent test of target-present face memory, it needs to be supplemented by measures of target-absent face memory to provide a full and informed assessment of top-end face memory performance.
The importance of independent assessment of target-present and target-absent performance also came through for face matching: although consistency was highly correlated across the two types of trial, accuracy was not. This finding suggests at least some stability in repeated performance at the same task, although it should be noted that the analyses were carried out on overall index scores. Although combined scores may eliminate some of the noise that present in isolated test scores, some caution may need to be exercised when creating combining performance across multiple attempts at related tasks.
For instance, different patterns of response bias were noted for the "pose" matching items compared with the glasses and facial hair manipulations, perhaps because changes in viewpoint require more substantial 3D transformations than judgments on frontal faces (i.e., when glasses or facial hair are added or removed, but viewpoint does not change). Future work should explore whether different task demands return different superior performers and consequently whether overall indices should be restricted to only the most similar tasks (if the aim is to identify the best performers for specific tasks), or include a range of tasks (if the aim is to identify the most consistent overall performers).
From a theoretical perspective, it seems likely that the finding that different individuals excel at target-present versus target-absent performance results from a genuine independence between the two measures. Indeed, we found no evidence of differences in response bias between SRs and controls on any measure. Further, because SRs are operating at such a high level of sensitivity, it is extremely unlikely that response bias could explain their performance. Instead, the findings reported here fit well with previous work using typical perceivers that suggest a dissociation between target-present and target-absent performance for the matching of unfamiliar faces-an effect that gradually disappears as faces increase in familiarity (Megreya & Burton, 2007). Interestingly, the results reported here extend this finding by suggesting that the effect may hold even for top-end performers-indicating that even these individuals do not have an absolute ability to tolerate within-person variability in images of unfamiliar individuals (see Young & Burton, 2017. The findings reported here also offer evidence for a potential dissociation between different types of superior performer at a broader level, as different patterns of facilitated face matching versus face memory skills were uncovered. This variability in SR presentation has previously been reported in small case series (e.g., Bobak et al., 2017), and a statistical dissociation between face matching and face memory for three SRs was offered in a recent publication from our laboratory . However, those individuals presented with superior face matching but typical face memory skills-the reverse pattern to the individual described in the current paper. This is theoretically important as previous evidence of "super-matchers" without facilitated face memory skills, but not vice versa, suggested that enhanced perceptual processes underpin facilitated face memory performance. Thus, it may be that composite face recognition tasks are difficult even for SRs who have at least some familiarity with this type of artificial image. Indeed, inaccuracies in the shape and appearance of individual features on composite stimuli, in addition to their spatial positioning (e.g., Frowd et al., 2005), can result even from protocols that are designed to create identifiable images (e.g., Frowd et al., 2012). Consequently, such composite faces are usually much harder to recognize, or even to match to target, than photographs of the target identities themselves (e.g., Frowd et al., 2014;Frowd, Bruce, McIntyre, & Hancock, 2007). These inaccuracies in the size, shape, and positioning of features may be what disrupts the performance of SRs on the Crowds task: SRs may be exceptional at recognizing the highly stable properties of faces (as tapped in tests such as the CFMT+, which has highly controlled images), but relatively less adept at spotting more general "likenesses" between faces.
This hypothesis is supported by the overall patterns of performance observed here. Although the composite faces used in the Crowds test present the most challenging instances of facial variability in the current battery of tasks, it is pertinent that both the MMT and matching tasks used more ambient facial images than the CFMT+. In both this study and that reported by Bate et al. (2018), the MMT appears more sensitive to top-end performance than the CFMT+ (see also Bate et al., 2018)-discriminating between individual SRs who achieved very similar scores on the latter test. Likewise, the finding that some SRs can excel at the matching tests but not the Crowds test (and not vice versa) may also be explained by the relative difference in within-person variability between these two tasks. Thus, it may be that the ability to complete more challenging face recognition tasks reflects properties of the images themselves, rather than different individuals being suited to different tasks. In any case, wider screening of personnel using tasks that directly replicate real-world needs should be initiated (see Balsdon, Summersby, Kemp, & White, 2018), and future work might examine the limits of super recognition with regard to image variability.
In sum, the above discussion indicates that (a) task demands of screening tests need to be thoroughly assessed prior to implementation, (b) multiple assessments should be carried out and index scores calculated, (c) screening should allow for different individuals to be short-listed for different tasks, and (d) the best overall performers will likely not be those that excel on target-present measures alone.
Although signal detection measures may offer the best indices of all-round performance, the use of any particular statistical cut-off alongside these measures only offers an arbitrary means of identifying SRs. What may work best in practice is to rank personnel on their overall performance, calculated from multiple attempts at specific tests containing target-present and target-absent items, to create a "leader board" for each required task. At any point in time, the best available personnel may then be selected for a particular task in hand.

SUPPORTING INFORMATION
Additional supporting information may be found online in the Supporting Information section at the end of the article.
How to cite this article: Bate S, Frowd C, Bennetts R, et al.