Estimating Response Time in Spatial Frequency Band Experiments involving Visual Perception*

The present study aimed to determine the optimum response time (RT) needed to identify images of everyday objects when filtered using different spatial frequency bands. Subjects were randomly presented with different images of familiar objects that were both serialized and progressive in their spatial frequencies. The time needed to recognize them was then measured. The results showed that the optimum RT for identifying an María del Prado Rivero Expósito , Enrique Vila Abad, Francisco Pablo Holgado Tello, Et al. | Universitas Psychologica | V. 16 | No. 1 | Enero-Marzo | 2017 | image filtered in different spatial frequency bands was approximately 2000 ms of exposure. Specifically, stimuli presented using spatial frequency bands with Gaussian filters of variance V26-V32, which were familiar and of medium size to the viewer, were recognized in a mean time of 2126 ms.

Since the 1980s, work in the field of artificial intelligence has shown that images are analyzed through a combination of spatial filters (which select a range of information to be analyzed) and spatial frequencies.The human visual system is organized to process visual information through spatial frequency (SF) channels that are each sensitive to a particular range of frequencies of repeating light-dark patterns across the visual field (Campbell & Robson, 1968;De Valois & De Valois, 1980;Thurman & Grossman, 2011).Most theories of visual perception have stated that the information contained in the stimulus is extracted and processed in some way (Ullman, 1981;Ramirez-Moreno & Ramirez-Villegas, 2011).
The results of different studies have led to the formulation of theoretical models that attempt to explain how human visual perception systems function (Palmer, 1999;Shipley & Kellman, 2001).The basis for each of these models is always image analysis using spatial frequency systems.Specifically, the conception of the psychophysical dimension of the human visual perception system has led to the study of the following categories: how spatial scales are used for the categorization of faces, objects, and scenes (Morrison & Schyns, 2001;Ramírez-Moreno & Ramírez-Villegas, 2011); the importance of spatial frequency characteristics in the visual identification of letters (Chung, Legge, & Tjan, 2002); and the relevance of categorization when visually processing natural images and scenes (Torralba & Oliva, 2003).In the last example, the authors (Torralba & Oliva, 2003) have attempted to show that in the early stages of visual processing, observers rely either on their prior knowledge of or the contextual clues surrounding the images they are viewing.This is demonstrated by the natural image statistics that result which vary strongly as a function of the interaction between the observer and the world.Low level visual categorization, which resists the urge to rely upon clustering or segmentation, could enhance the observer's ability to locate and identify objects.Many studies support the idea that spatial filtering in the frequency domain occurs during an early stage of visual processing.During this stage, basic operations transpire which lead to a higher level of observation, such as: three-dimensionality, recognition, and categorization.Morrison and Schyns (2001) have noted that the presentation of still images in different spatial frequency ranges determines the categorization, time recognition, and speed identification of such images.
Our visual system uses spatial frequency channels to perceive objects in situations of either high or low contrast.An image is displayed in a set of luminance contrasts.Psychophysical studies investigating what is known in visual perception as trellis bars -which are images of differing frequencies that demonstrate the same degree of luminance-found a contrast sensitivity function.It is through this contrast sensitivity function that the visual system obtains the information from the input image of the stimulus in the frequency domain.This does not mean that the perception of contrast is due exclusively to frequency.Instead, what is significant is that we do not perceive all luminance contrasts, only those that are related to certain spatial frequencies.
We can differentiate between the difference in physical versus perceptual contrast or in light intensity for areas that appear adjacent to one another.Perceptual contrast is affected by factors such as the degree of adaptation of the observer, the contour that defines the object, the position of the object in space, and the spatial frequency of the stimulus, understood as the number of pairs of light and dark cycles per degree of the visual angle.The most common effect is that the higher the spatial frequency, the less contrast subjects perceive, and vice versa.The contrast sensitivity assessment should also take into account the size of the object perceived because this varies depending on whether the object is small (glasses) or large (a car).
Since the discovery of the contrast sensitivity function for sinusoidal gratings (Campbell & Robson, 1968), one hallmark of vision science has been to determine what SF information is critical for recognizing objects.SF tuning has been measured for various stimuli, such as faces (Costen, Parker, & Craw, 1996;Fiorentini, Maffei, & Sandini, 1983;Gold, Bennett, & Sekuler, 1999), letters (Chung et al., 2002;Parish & Sperling, 1991), and objects (Norman & Ehrlich, 1987).These studies illustrate that diagnostic stimulus information is available in specific SF bands for different objects and that observers readily extract this information for visual categorization and recognition (Chotse Wai, 2004;Sowden & Schyns, 2006).
Measuring SF tuning for objects gives vision researchers information about the scale of the diagnostic features for recognizing a given object or for discriminating between different examples within an object class.
It should be noted that the majority of the experiments mentioned above have defined response time (RT) as the dependent variable.The defining characteristic of RT studies is that the observer must respond by pressing a key as soon as the stimulus has been detected.Response time may be affected by several psychophysical characteristics of the stimuli such as luminance, contrast, and size, among others.Research has shown both that the RT decreases asymptotically with increasing stimulus intensity (Jaskowski & Sobieralska, 2004, Bell, Meredith, Van Opstal, & Muñóz, 2006;Carrasco, 2006;Carreiro, Haddad, & Baldo, 2011) and that a quick visual object recognition is more favorable when the object exhibits the greatest possible contrast between its components (Luna, 2011).This relationship demonstrates that the properties of the stimulus are crucial both in the type of processing that occurs as well as in the ability of the stimulus to capture the subject's attention (Theeuwes, 2010).
Although some studies have used a presentation time of 500 ms for static stimuli, and an inter-stimulus interval of 3000 ms (Carretié, Rios, Periáñez, Kessel, & Álvarez-Linera, 2012), we have not found studies which relate to optimum presentation times for stimuli presented as a sequential series of filtered images.Therefore, taking into account different studies (Fernández Trespalacios, 2004;Oliva & Torralba, 2001, 2002;Schyns & Oliva, 1994;Torralba & Oliva, 2003), the aim of the current study was to determine the maximum RT required for accurate recognition of filtered image objects.Our hypothesis is that the maximum time needed for recognition of objects is lower than 10,000 ms.

Participants
The sampling was incidental and a total of eight participants ( = 30.25;SD = 16.69) were recruited from the Department of Psychology at UNED University.Six were righthanded, and two were left-handed.All participants had normal vision or vision that had been corrected to normal.All participants were informed of the nature and purpose of the experiment, and all gave their written, informed consent.The study was approved by the Ethical Committee of the University.Given the sampling method the results should be considered cautiously.

Stimuli and apparatus
Ten images of familiar objects (including a shoe, hat, lamp, key, coffeepot, glasses, bicycle, telephone, car, and alarm clock) were randomly selected, scanned in JPEG format and filtered applying Khoros Pro version 2.0.The stimuli were filtered with one of the sixteen Gaussian band-passes with a progression in its variance (V) of 2, that increased 2x2 until it reached a filtered image of variance of 32 (V2, V4, V6, V8, V10, V12, V14, V16, V18, V20, V22, V24, V26, V28, V30, and V32).This provided us with 16 filtered images, representing a total of 160 images.Images were present to subjects through the "Reaction to Visual Stimuli" application designed specifically for this type of study (Cibertec Software 2001).
RTs were automatically collect and analyze.To expedite the process of image manipulation, images were resized to 128x128 pixels and converted to grayscale.We created Gaussian noise fields by drawing independent samples from a Gaussian distribution ( = 0, SD = 1) for each pixel in a 128x128 array (Serrano, Fabio, & Figliola, 2012).The noise fields were then filtered using one of the sixteen band-pass filters, creating sixteen sets of filtered Gaussian noise.
Stimuli were presented in a random sequence, and a progressive filter pattern was followed to avoid learning effects.Each visual stimulus was preceded by visual noise filtered in the same frequency band.

Procedure
In the experiment, we randomly chose 20 images for each block, from the original set of 160 images.After giving subjects a sheet with instructions, they performed a training block to ensure that they understood the task and to familiarize themselves with the filtered stimuli.This training situation consisted of a block of 20 trials which included each of the different types of stimuli to be presented.Although it was a training situation, noise was presented prior to the introduction of the filtered image.
After the training block, subjects were presented with 8 blocks of 20 trials of images, each filtered in different spatial frequency.For each trial (see Figure 1), the fixation point was presented in the center of the computer screen for 150 milliseconds to fix the subject's attention.Next, the filtered noise was presented for 500 milliseconds.Finally, the filtered image was presented for 10.000 milliseconds.Visual noise and image where presented in the same band-pass filter.

Figure 1 Example of the trial used
Source: own work The experiment was conducted in a dimly-lit room, isolated from external noise.Each subject was seated in an armchair in front of a TFT 21 inch computer screen that showed visual stimuli.The screen had a resolution of 1024x768 pixels.Subjects were seated 45 cm from the screen with a chinrest to help them maintain a constant viewing distance.At this distance, the stimuli subtended 7.64 × 7.64 deg and were presented at a rate of 50 Hz.Subjects answered by pressing a keypad button and giving the name of the image presented.RTs were recorded, automatically.The response screen remained either until the subject answered by pressing the keypad button or for a maximum of 10000 ms.The entire experiment lasted approximately 10 minutes.

Data analysis
Data analysis was performed using the statistical program, Statistical Product and Service Solutions (SPSS) v. 13 for windows.We used a linear, mixed design (A 2 x B 3 x C 4 ) with repeated measures for three factors.Our design includes three independent variables and two dependent variables, each one with different levels.
Factor A represents the independent variable amount of detail, with two levels: greater amount of fine detail (lamp, key, glasses, bicycle, and alarm clock) and greater amount of broad detail (shoe, hat, coffeepot, telephone, and car).Factor B represents the independent variable size, with three levels: small (key, glasses, and alarm clock), medium (shoe, hat, telephone, lamp, and coffeepot), and large (car and bicycle).Factor C represents the independent variable frequency bands, with four levels: level 1 (V 2-8 ), level 2 (V 10-16 ), level 3 (V 18-24 ), and level 4 (V 26-32 ).
The two dependent variables that were controlled were RT and failures/successes (based on recognition versus non-recognition of images).
We also controlled the maximum presentation time of the image (10,000 ms), the sequence number of the presentation, the luminance of images, the spatial frequency with which images were presented, and the visual noise.
All experimental conditions (ambient noise, room lighting, and isolation of external stimuli) were identical for all subjects.

Results
The results show that the optimum RT for identifying an image filtered in different spatial frequency bands was approximately 2000 ms of exposure.Specifically, familiar and medium-sized stimuli presented in banks of spatial frequency bands with Gaussian filters of variance V26-32 were recognized in a mean time of 2126 ms.
The data also showed that RTs to images composed of fine detail were significantly shorter than those composed of broader detail (F = 69.13;p < 0.0005).Reaction times were also shorter when the stimuli were of medium size (F = 92.24;p < 0.0005).In regards to the established banks of spatial frequency bands, paired comparisons revealed significant differences (F = 45.88;p < 0.0005) for level 4 (V 26-32 ) versus the other bands analyzed.The comparisons for level 3 (V 18-24 ) and level 2 (V 10-16 ) also presented significant differences (p < 0.0005) versus the other frequency bands, but did not register significant differences when compared to each other (p = 0.745).Similar findings were obtained in the comparison of level 2 (V 10-16 ) and level 1 (V 2-8 ) (p = 0.067).The fastest responses were reported when mediumsized images were viewed, although there were no significant differences in regards to the amount of detail those responses contained (p = 0.256).Reaction times were longer for large stimuli; here, there were significant differences according to the amount of detail those responses contained (p < 0.0005) (see Figure 2).In the entire size grouping, as the images were presented in higher spatial frequency bands, the speed of recognition showed a progressive increase, although this progression was slightly higher in the group of medium-sized stimuli (see Figure 3).It should be noted that the real size of the object represented by each image and the time taken to recognize each stimulus were proportionally related to each other.In regards to the amount of detail, the fastest responses were produced by images presented with fine detail and high spatial frequency bands (level 4 with Gaussian filters between V 26 and V 32 ).Furthermore, the observed differences were significant (p < 0.0005), with the exception of level 2 spatial frequency bands (V 10-16 ) (see Figure 4).The research also found that, although the RTs to images incorporating fine detail were shorter than those which incorporated broader detail, the interaction of detail and frequency suggested that this was only the case where high spatial frequencies were also a factor.With low spatial frequencies, regardless of the level of detail (fine vs. broad), there was no difference in the RTs.

Discussion
In the present study, we have begun considering the information that spatial frequencies provide in the recognition of faces, objects, and scenes.Specifically, we have taken into account the studies of Morrison and Schyns (2001), Oliva and Schyns (1997), and Torralba and Oliva (2003).In these studies, a single static stimulus is taken into account, while we have considered stimulus sequences.Progressively filtered images were presented to explore how spatial information provided in serial fashion is able to guide the initial stages of early visual processing.In these presentations, we have controlled luminance, contrast, display time, refresh rate of the monitor, displayed image distance, visual angle, and visual noise.
Our results show that the RT in which a subject can determine whether a filtered image at a different spatial frequency band is recognized is located below 10 000 ms of exposure.In fact, we can consider that the RT would be approximately 2 000 ms. Specifically, familiar and medium-sized stimuli presented in banks of spatial frequency bands with Gaussian filters of variance between V26-32 were recognized in a mean time of 2126 ms.Shorter stimulus duration may be one of the possible factors causing low efficiency in discrimination tasks (Gold et al., 1999;Thurman & Grossman, 2011).
In relation to the independent variable prevalence of detail, the results of this research have shown that the RTs are lower when incorporating fine detail conditions than they are when including broad detail conditions.This finding is consistent with the fact that the fastest responses also occur in images with predominantly fine detail in high spatial frequency bands because fine details are perceived in the higher frequency bands, whereas broad details are identified in low frequency bands.With low spatial frequencies, the RT under broad and fine detail conditions is not significant.This finding leads us to believe that the decrease of RTs is due to the presentation of fine detail images in high spatial frequency bands.The progressive sequence of filtering implies that the representation in a space-image scale is an ordered hierarchy ranging from the lowest frequency scales to the highest.As we increase the Gaussian variance (σ 2 ), the fine details of the image (maximum responses of the filters, edges, high spatial frequencies, or fine scales) will appear.
In reference to the size variable, we have found that the fastest responses take place in mediumsized images without significant differences in the characteristics of detail.This does not happen with small and large images where RTs are longer and significant differences do exist between characteristics of detail.Average time for correct answers is lower for medium-sized images than it is for small or large images.In fact, in all responses grouped by size category, recognition of objects is higher as the spatial frequency bands increase.This may be due, as noted previously, to the contrast sensitivity assessment which takes into account the size of the object perceived because the contrast sensitivity varies depending on whether the image is small (i.e.glasses), medium (i.e.coffeepot), or large (i.e.car).We also believe that this phenomenon may be attributed to a relationship of proportionality.When the images are presented to subjects, they expect them to be close to their real size.Our results suggest that the real size of the object represented in each image is related to recognition time.
In relation to the third variable, the bands of spatial frequency, we find faster response times in the higher bands (bank bands nº. 4 with Gaussian filters between V26 and V32).Significant differences in all banks of bands ranging from small to large sizes, coupled with the RTs in condition of large and small sizes, differ significantly in bank bands nº. 2 and nº. 4.This finding leads us to consider that the decrease of RT is due to the presentation of stimuli predominantly incorporating fine detail in high spatial frequency bands (Díaz Pardo, Suárez Fajardo, Puerto-Leguizamón, & Zona Ortiz, 2015).The spatial frequency bands used in this research are between a Gaussian variance of V2 to V32 (grouped into four banks).What we observe is not different from other experiments on Gaussian filters.Campbell and Robson (1968) show that the detection threshold of complex sinusoidal gratings with high spatial frequencies coincided with the threshold obtained for the fundamental harmonic component.Finally, it should be noted that in relation to the bands, subjects demonstrate a tendency towards faster learning in regards to the stimuli presented because when a stimuli is recognized in a low frequency band and answered correctly, the RT decreases significantly when the same stimuli is presented immediately but in a higher frequency band.

Figure 2
Figure 2 Mean values Reaction Time: Detail x Size

Figure 3
Figure 3 Mean values Reaction Time: Frequency Band (4 groups) x Detail

Figure 4
Figure 4Marginal Means: Frequency Band (4 groups) x Size