Statistical image properties predict aesthetic ratings in abstract paintings created by neural style transfer

Introduction

The question of whether computers can create artworks has intrigued computer scientists and artists alike (Hertzmann, 2018; Lomas, 2018; Mazzone and Elgammal, 2019; So, 2020; Cetinic and She, 2022). In the art world, the usage of computers has been a research subject for more than 50 years (Giloth and Pocock-Williams, 1990; for a review, see Nake, 2012). After decades of relative quiescence, artificial intelligence (AI) has taken the art world by storm. A key trigger of this recent development was the introduction of Convolutional Neural Networks (CNNs), which have gained enormous popularity, in part because of their highly effective application in computer vision (LeCun et al., 2015). CNNs are neural networks with convolutional layers, which are particularly well suited for processing images. Under supervised training with more than a million stimuli, they can achieve extraordinarily high (human-like) accuracy rates, for example, in recognizing large series of natural objects and scenes (LeCun et al., 2015). Low- and intermediate-level responses of the network resemble those recorded in the early human visual system (Krizhevsky et al., 2012; Yosinsky et al., 2014; Güçlü and van Gerven, 2015; Cadena et al., 2019; Kindel et al., 2019). At higher levels, feature responses integrate over larger input regions to represent increasingly more complex (parts of) objects, similar to neural responses in extrastriate cortical regions (Cadieu et al., 2014; Yamins et al., 2014).

At present, an increasing number of artists are experimenting with computer-assisted art creation and automation in their work. The most widely used approach to generating art is based on a type of CNN called Generative Adversarial Networks (GANs; Goodfellow et al., 2014), as well as their advancements, such as AI Creative Adversarial Networks (AICANs; Elgammal et al., 2017). These developments give rise to questions about ethics, authenticity, and autonomy as well as to philosophical controversies regarding creativity and artistry (Mazzone and Elgammal, 2019; So, 2020; Cetinic and She, 2022).

Neural Style Transfer (NST; Gatys et al., 2015) represents another way of how CNNs have found their way into the art world. By applying NST, the color and texture information of one input image [termed style image by Gatys et al. (2015)] can be transferred onto another input image [termed content image by Gatys et al. (2015)], thus generating a novel style-transferred output image (So, 2020). Artists and scientists have widely used these algorithms to generate artworks and experimental stimuli (for reviews, see Semmo et al., 2017; Jing et al., 2020; So, 2020; Santos et al., 2021; Zhang et al., 2021). In recent years, many different NST algorithms have been published with distinct properties, features and performance. Note that the meaning of the term “style” in NST differs from its definition in art history or art theory. In NST, style refers to the perceptual texture of a single artwork, which is represented in a feature space designed to capture texture information (Gatys et al., 2016). In the present study, we use the term in this sense. By contrast, artistic style can be defined as the style of a particular artist or school or movement. For example, Davis (2011) uses the term “style” to denote specific pictorial configurations that stem from the artwork being of a particular origin. Style analysis (“stylometry”) allows art experts, for example, to identify the artist of an artwork. Style identification can be assisted by computers, utilizing CNNs for instance (Wallraven et al., 2009; Graham et al., 2012; Van Noord et al., 2015; Chu and Wu, 2018).

Neural style transfer (NST) facilitates the creation of large numbers of artworks for statistical analysis and experimental investigations. However, the use of NST-generated stimuli for aesthetic research has several shortcomings. (1) Although the computational paradigms underlying NST are relatively well defined and understood (Semmo et al., 2017; Kotovenko et al., 2019; Hien et al., 2021), it is less well known how objective (physical) image properties are modulated by NST and how they mediate the aesthetic attributes and the liking of the generated images (Zhang et al., 2021). (2) The responses of beholders may be biased against computer-generated art (Chamberlain et al., 2018). (3) There is a debate of whether artificial intelligence can create artworks at all (for a review, see Cetinic and She, 2022). For example, Hertzmann (2019) reasoned that computers cannot be credited with authorship of artworks, but they can assist artists and serve as an engine for innovation. Similarly, McCormack et al. (2019) contest that computers can have artistic creativity and autonomy. Taking an opposite viewpoint, Mazzone and Elgammal (2019) claimed that they succeeded in developing an almost autonomous computer algorithm that is capable of producing artworks.

The present study is an attempt to shed more light on computer-generated art. Using NST, we created a set of artificial abstract artworks and analyzed their perceptual structure by calculating statistical image properties (SIPs) that have been associated previously with aesthetic perception and affective images (Braun et al., 2013; Brachmann et al., 2017; Redies and Brachmann, 2017; Grebenkina et al., 2018; Redies et al., 2020; see also Supplementary material for a comprehensive description of the SIPs used in the present study). In a behavioral experiment, we investigated how the SIPs relate to subjective aesthetic ratings.

It is generally accepted that aesthetic ratings depend not only on perceptual processing, but also on cognitive processing and emotional attributes of images (Jacobsen, 2006; Chatterjee and Vartanian, 2014; Graf and Landwehr, 2015; Redies, 2015). Cognitive and emotional factors may potentially modify or confound aesthetic responses to perceptual features of visual stimuli, such as the SIPs. Therefore, in line with our focus on perceptual factors, we minimized the effects of cognitive and emotional processing in the present study by using abstract (non-figurative) stimuli. We combined 25 abstract artworks from different artists and diverse art styles (that served as style images for NST; Gatys et al., 2015; see Supplementary Table 1) with 150 random-phase images (content images for NST; Gatys et al., 2015) to generate 150 novel style-transferred images. Note that our content images for NST did not display any recognizable content. In the following, we will therefore refer to them as random-phase images.

Aesthetic ratings can be defined along different dimensions. Berlyne (1970) asked participants to describe artworks in terms of pleasingness and interestingness. The two terms correlated with other rating terms, such as complexity and novelty, to different degrees. Augustin et al. (2012) found that for different image categories, including artworks, landscapes, and faces, participants use different sets of aesthetic terms to describe them. Lyssenko et al. (2016) studied the qualitative descriptions of abstract artworks and identified both descriptive, image-related terms (for example, structured, colorful, and dark) and affective terms (for example, happy, boring, and warm). Marković and Radonjić (2008) established four subjective dimensions of the aesthetic experience of paintings, which represent the main psychological and behavioral domains: Hedonic Tone and Relaxation (affective or emotional), Regularity (perceptual or cognitive), and Arousal (motivational). For our study, we chose rating dimensions for each of these domains to cover a wide range of the aesthetic experience: Pleasing (Hedonic Tone), Harmonious (Regularity), and Interesting (Arousal). The aesthetic scales used in the present study were previously shown to correlate with image properties (Schwabe et al., 2018; Stanischewski et al., 2020), and they have been associated with different aspects of aesthetic perception and evaluation (Cupchik and Gebotys, 1990; Marković, 2012; Graf and Landwehr, 2015).

As shown before, the SIPs of abstract or modern artworks overlap to a large extent with those of traditional artworks of different cultural provenance, but particular subtypes of modern art can also deviate substantially from traditional art (Redies and Brachmann, 2017; Mather, 2018). We therefore compare the artificially created artworks with a set of 1629 traditional Western paintings (JenAesthetics dataset; Amirshahi et al., 2015). This dataset comprises diverse artworks from different periods, styles, artists, and depicted subject matters. We also investigate how and if this comparison can be related to the aesthetic ratings of our style-transferred images.

Individuals share common aesthetic taste, but they also show individual preferences. The proportion of private taste versus shared taste varies according to the type of images viewed (Leder et al., 2016; Vessel et al., 2018). Some of the differences in private taste for artworks can be related to differences in the personality traits of the beholders, for example, openness to experience (Chamorro-Premuzic, 2009). Interestingly, the subjective interpretation of the rating terms by individual beholders also depends on personality traits (Lyssenko et al., 2016). In view of these previous results, we also clustered participants and analyzed their results separately.

The purpose of the present study is to address the following research questions: (1) In an exploratory analysis, we compare the SIPs of the input images (original artworks and random-phase images) with their style-transferred derivatives to find out how well NST transfers SIPs. (2) We investigate whether NST transfers participants’ subjective ratings from the two types of input images to the style-transferred (output) images. This analysis was also done for clusters of participants. We hypothesize that the rating responses are largely driven by the style of the original paintings, and that, as a consequence, preference for a particular style is transferred from the original abstract artworks onto their style-transferred counterparts. (3) Furthermore, we were interested in how well the SIPs can predict the aesthetic ratings of the style-transferred images. (4) We compare the artificially created artworks with the JenAesthetics dataset. We hypothesize that style-transferred images prompt higher aesthetic responses in the beholders if the values for the SIPs of the style-transferred images are closer to those of traditional artworks.

Materials and methods Stimuli

We used three different types of stimuli. First, we selected 25 abstract artworks by different artists. Care was taken to include paintings from diverse abstract art styles, including Abstract Expressionism, Art Informel, Color Field Painting, Constructivism, Dadaism, Hard-Edge Painting, Monochrome Painting, Neo-Expressionism, Op Art, Orphism, and Tachism. Most of the images were from a dataset used in previous studies (Mallon et al., 2014; Lyssenko et al., 2016). Two additional images were downloaded from the internet. The artists and information on the paintings are listed in Supplementary Table 1. Example paintings are shown in Figures 1A–C.

Figure 1. Examples of the three image categories studied. Original artworks (A–C) are shown on top of the random-phase images (D,F,H) that were used to generate the style-transferred images (E,G,I), respectively. The slope of the log-log plots of Fourier power vs. spatial frequency is indicated on the left-hand side of each row. Original artworks are (A) Gelb-Rot-Blau by Wassily Kandinsky (1925); (B) Z VII by László Moholy-Nagy (1926); and (C) Untitled by WOLS, ca. 1940.

Second, we generated a set of 150 random-phase images with different Fourier spectral properties (for examples, see Figures 1D,F,H; Simoncelli and Olshausen, 2001; Galerne et al., 2010). Grayscale random-phase images can be generated easily and in great numbers for different slopes in log-log plots of Fourier power versus spatial frequency (Spehar et al., 2016). The random-phase patterns with different spectral slopes vary in their relation of fine detail and coarse image structure. Because the neural network used by the NST algorithm is trained on colored images and color is an important attribute of aesthetic judgments, we decided to generate colored versions of the random-phase images (Galerne et al., 2010). Colored versions of the random-phase patterns were obtained by merging different grayscale images of the same slope in the three channels of the RGB color space. In the present study, random-phase patterns had Fourier slopes that ranged from –5 to 0 in increments of 1 (–5, –4, –3, –2, –1, and 0). For each slope, 25 images were created. The images had a resolution of 1024 × 1024 pixels.

Third, we generated 150 images with NST (for examples, see Figures 1E,G,I). Each of the 25 styles of the original paintings was transferred onto 6 colored random-phase images with the different slopes (see above). Each style transfer was based on a different random-phase image. We used a revised version of the Style Transfer by Relaxed Optimal Transport and Self-Similarity (STROTSS) algorithm by Kolkin et al. (2019). The reasons for choosing this neural style transfer method were the availability of verified code, the speed of the method and the ability to produce images at a relatively high resolution (1024 × 1024 pixels). In addition, STROTSS is an optimization-based style transfer method that produces similar quality images for different styles and content. The parameter settings were identical to those used by Kolkin et al. (2019).

For the rating experiment, the stimuli were displayed on a ColorEdge CG241W screen (Eizo, Hakusan, Japan) in a darkened environment. A viewing distance of 80 cm was secured using a chin rest, resulting in a viewing angle of 20° for the target stimuli that were presented at 28.22 cm × 28.22 cm (800 × 800 pixels). The monitor was calibrated with an i1 Display pro calibrator (X-Rite, Grand Rapids, MI, U.S.A.; settings, brightness 120cd/m2; white point D65; gamma, 1.0 for all RGB channels).

Participants

Forty volunteers (14 male and 26 female) participated in the rating experiment at Jena University Hospital. The duration of the experiment was about 60 min. Participants were paid €8 for taking part in the rating study. The mean age of the participants was 23 (range 18 to 30) years. One participant reported left-handedness, the remaining 39 were right-handed. In a short questionnaire on art interest, applied at the beginning of the experiment, one participant reported no interest in art, 13 participants reported being somewhat interested, and 26 participants reported an interest in art. Sixteen study participants had a medical background (mostly medical students), eleven studied history of art and film studies, the remaining 13 were university students from various other fields, such as economy, law, or chemistry.

The study was designed according to the specifications of the World Medical Association Declaration of Helsinki and approved by the Ethics Committee of Jena University Hospital (approval no. 2021-2223-Bef). The participants gave their written informed consent prior to the experiment. They were informed that they can freely withdraw from the experiment at any time without any repercussions.

Procedure

Prior to the experiment, the participants were presented with a sheet of instructions for the experiment. Moreover, the participants were asked to answer a few demographic questions (age, gender, profession/field of study, level of interest in art, vision impairment and handedness). After completing the short questionnaire, the experiment was launched in full screen (1920 × 1200 pixels). Participants were asked to complete a short test-like run to familiarize them with the experimental procedure and the rating scale. For this supervised run, unrelated figurative paintings were used.

The experiment was divided into three blocks, one for each of the three image categories (abstract paintings, random-phase images, and style-transferred images; Figure 2C). Each of these main blocks consisted of three sub-blocks for the three rating dimensions (Pleasing, Harmonious, and Interesting). The experiment started in a randomized order with either the random-phase image block or the style-transferred image block (Figure 2C). The abstract paintings were always presented as the final block so that the participants’ ratings of the style-transferred images were not influenced by the original paintings. A disadvantage of this schedule is that the first two blocks possibly affect the ratings of the last block (original paintings). All 40 participants rated all 25 abstract paintings and all 150 style transfers. To avoid screen fatigue, every participant rated only 30 out of the 150 random-phase images (balanced with respect to their Fourier slope), resulting in 8 ratings per random-phase image. Within all main blocks, the order of the sub-blocks was randomized as was the image sequence within all sub-blocks (Figure 2C). In between blocks and sub-blocks, participants were allowed to take an optional break.

Figure 2. Experimental procedure. The schedule is shown in (A) with the presentation times indicated below each screen shot. (B) Shows a magnification of the screen display where ratings are entered by a mouse click on the scale below the image. (C) Illustrates the sequence of the rating blocks. Within the blue boxes, images and block sequences were randomized while the green box indicates a fixed position. n, number of images.

Within each trial, first, a blank black screen was presented for 500 ms followed by a white fixation cross, which appeared for a random duration between 300 and 800 ms (Figure 2A). Then, the target image was presented on the same black background alongside with a continuous rating scale below the image (Figure 2B). The rating scale for Harmonious ranged from “not harmonious” to “very harmonious.” The other ratings scales (Pleasing, Interesting) were presented in an analogous manner. Viewing time was not limited, but when participants entered the response by clicking on the scale using the computer mouse, the next trial was initiated. Median response time was 2.1 s (interquartile range: 1.6–3.0 s) with no difference between the image categories. The code for the presentation procedure was based on PsychoPy (Peirce, 2009).

Statistical image properties

Aesthetic ratings by human observers correlate with statistical image properties (SIPs; see Introduction section). Previous studies indicated that SIPs can overlap to a large degree in their predictive power for aesthetic ratings (for example, see Redies et al., 2020), possibly because many of these SIPs cover similar aspects of image structure (Braun et al., 2013; Van Geert and Wagemans, 2020). Consequently, the SIPs do not predict aesthetic ratings independently of each other, which can cause problems with multicollinearity in multiple linear regression analysis. Therefore, we needed a set of SIPs that showed as little overlap as possible while still covering the multidimensional SIP space well.

Our starting point was a set of 29 SIPs (calculated at a resolution of 800 × 800 pixels), which are described in detail in the Supplementary material. An exploratory principal component analysis (PCA) with the 29 SIPs revealed that each of the three image categories can be described by a different combination of the variables, confirming the usefulness of the variables in describing images with different structural characteristics. For the subsequent analyses, we reduced the initial set of 29 SIPs to eight largely independent SIPs (Table 1) by pursuing the following strategies:

Table 1. Statistical image properties used in the analysis.

(1) We decreased multicollinearity between the 29 variables (SIPs) by regression subset selection. To this aim, we performed an exhaustive search for the subset of SIPs that best predicts the three rating dimensions for the 150 style-transferred images. Regression subset selection was accomplished with the leaps package of the R project (Miller, 1990). The leaps package returned the 10 best models (i.e., models with the highest R2adj values) for all possible model sizes (one to 29 predictive variables). The output graphs indicate how often a given variable is predictor in the different models. Based on these results, we selected the twelve variables that predicted the ratings most robustly across different models, for at least one of the rating dimensions.

(2) We then calculated a correlation matrix for the twelve remaining SIPs. Spearman’s rank (non-parametric) correlation coefficients ρ were used as many SIPs were not normally distributed. We eliminated another four SIPs which showed relatively high correlations with other SIPs (ρ > 0.6). Figure 3 lists the Spearman coefficients of the correlations between the eight remaining variables. They reflect the complexity and distribution of luminance and color gradients, and features derived from the CIELab and HSV color spaces (Table 1).

Figure 3. Correlation matrix for the eight SIPs that were investigated. The numbers represent the Spearman’s coefficients ρ that were calculated for the 150 style-transferred images. The color indicates positive (blue) and negative (red) correlations. The shading represents the strength of the correlations, with darker shadings representing stronger correlations (see color bar).

(3) The predictive power of the eight remaining variables and their large degree of independence was confirmed by calculating coefficients of determination (R2) in multiple linear regression models. R2 values were adjusted to account for the number of predictors and the number of datapoints (R2adj). The R2adj values in the final (reduced) model with eight variables (Supplementary Table 2) were of similar magnitude as the R2adj values in a model comprising the first eight principal components (PCs) of the 29 original variables (see Supplementary Table 2). This result suggests that much of the predictive power was preserved in the final model.

We exploratively plotted another regression subset selection for the remaining eight variables for all image categories and all rating dimensions (leaps function of R statistics; R Development Core Team, 2017). It reveals that our variable selection consistently predicts the ratings for one or more of the three image categories (see Supplementary Figure 1).

Statistical methods

For statistical analyses, we used the R program (R Development Core Team, 2017) and PRISM for macOS, version 8.4.3 (GraphPad Software, San Diego, CA, U.S.A.). To compare multiple median values, we used the (non-parametric) Kruskal-Wallis test because most SIPs were not normally distributed. Subsequently, Dunn’s post-test was applied to obtain multiplicity-adjusted p-values for pairwise comparisons.

For the β* values, we use the following definitions for the size of the observed effects (Acock, 2014): | β*| < 0.2, weak effect; 0.2 ≤ | β*| < 0.5, moderate effect; and | β*| ≥ 0.5, strong effect. The same scheme was used to describe the strength of Spearman correlations. In the Figures and Tables, β* values for variables with asterisks had a significant effect on the ratings when the other variables were controlled for in the respective models.

As a measure for the distance between a given image and the JenAesthetics dataset of paintings in the multidimensional space of SIPs, we calculated the squared Mahalanobis distance with the mahalanobis program in the stats package of R statistics (R Development Core Team, 2017). This measure is a multivariate equivalent of the Euclidean distance and takes the full covariance matrix into account.

The participants were clustered according to how they evaluated images along the three rating dimensions Pleasing, Harmonious, and Interesting. K-means clustering was carried out with the kmeans program of R statistics (R Development Core Team, 2017). The clustering of participants was based on: (1) the correlations between the rating dimensions for each participant (five clusters), and (2) the ratings of the random-phase images (four clusters). To find the optimal number of clusters within each approach, we considered the elbow criterion, the silhouette criterion, and the gap criterion. The clearest results were obtained for the elbow criterion while the other criteria yielded ambiguous results. In addition, the number of clusters was chosen so that the number of participants in any cluster exceeded three participants.

Results

In the present study, we used a convolutional neural network (CNN) to create novel artworks by transferring the artistic style of 25 abstract paintings onto random-phase images with different Fourier spectral properties (see Materials and methods section; Figure 1). In the following sections, we will address the following questions. (1) How do the objective statistical image properties (SIPs) transfer from the input images (original paintings and random-phase images) onto the output (style-transferred) images? (2) How do the subjective ratings of the participants transfer from the input images onto the style-transferred images for the three aesthetic dimensions (Pleasing, Harmonious, and Interesting)? As a special case, we will study the relation between the aesthetic ratings and the initial Fourier power spectra, on which the computer-generated abstract images are based, also for subgroups of participants. In addition, we will study the correlations between the three rating dimensions both across and within participants. (3) Which of the SIPs can predict the aesthetic ratings and are there any differences between subgroups of participants? (4) How do the predictive SIPs in our dataset relate to the image properties of the JenAesthetics dataset of traditional Western paintings and how does this relation predict aesthetic ratings?

Statistical image properties transfer from the input images onto the style-transferred images

First, we investigated whether there are differences in the SIPs’ median values between image categories. Figure 4 shows box plots of the eight selected SIPs for the 25 original abstract paintings, the 150 random-phase images and the 150 style-transferred images. For comparison, we show results for the JenAesthetics dataset of traditional Western paintings. As demonstrated before (Redies and Brachmann, 2017; Mather, 2018), the SIPs of the abstract artworks overlap extensively with those of traditional artworks. However, the values for the original abstract art scatter more widely and the median values differ significantly from traditional artworks for three variables (Self-similarity, 2nd-order entropy and Variance Pf[30]). As a control, we contrasted the original abstract paintings to a set of 572 abstract artworks from the study by Redies and Brachmann (2017). None of the variables, except for HSV (S), p = 0.041, differed significantly, suggesting that the 25 original paintings were representative of a larger body of abstract paintings (data not shown).

Figure 4. Statistical image properties (SIPs) of the four image categories. The panels (A–H) show box plots of the values of all eight SIPs, respectively, as indicated on the y-axis of the plots. In each plot, data are shown for the JenAesthetics dataset of 1629 traditional Western paintings (black), the 25 original abstract paintings (red), the 150 random-phase images (green), and the 150 style-transferred images (purple). The boxes encompass the median (horizontal line) and represent the 25 – 75 percentiles. The whiskers indicate the 5 – 95 percentiles. Significance levels for the differences between the pairs of image categories are indicated at the top or at the bottom of the panels. Multiplicity-adjusted significance levels are *p < 0.05; **p < 0.01; ***p < 0.001; ****p < 0.0001.

As for the random-phase images, all eight SIPs differ significantly from those of the traditional paintings (except for HSV [S]) and the original paintings (except for Variance Pa[2] and HSV [S]), respectively (Figure 4). These objective differences are in accordance with the unique perceptual appearance of the random-phase images (Figures 1D,F,H).

The style-transferred images differ from JenAesthetics paintings in five SIPs (2nd-order entropy, Variance Pa[2], Variance Pf[30], HSV [S] and HSV [H] entropy) and from the original paintings in three SIPs (Self-similarity, Variance Pf[30], and HSV [H] entropy). They differ from the random-phase images in all image properties, except for Variance Pa(2). We thus conclude that the style-transferred images are more similar to the original paintings than to the random-phase images, although both types of images were used in their creation.

Second, the similarity of the input and output images of NST was assessed by correlating the SIPs of the style-transferred images with both the original paintings and the random-phase images. Results are shown in Table 2. All SIPs correlate strongly between the style-transferred images and the original paintings (ρ range: 0.60 – 0.95), with highest ρ values for the three color features. By contrast, only Self-similarity and Variance Pf(30) showed significant correlations between the style-transferred images and the random-phase images (ρ = 0.61 and 0.36, respectively).

Table 2. Spearman’s coefficients (ρ) for the correlation between the eight SIPs for all style-transferred images and original paintings as well as the random-phase images, respectively.

Third, we took a closer look at the Fourier spectral slope as the random-phase images were produced based on this measure. For the random-phase images, the set (intended) slopes and measured slopes correspond well to each other (Supplementary Figure 2A). This result validates our method of producing the colored random-phase images. Supplementary Figure 2B illustrates that the slope did not translate from the random-phase images to the style-transferred images. For the set slopes of the random-phase images, the slopes measured for the style-transferred images range from –3.3 to –1.8 (median –2.72; 95% CI: –2.73 to –2.71). This range is in fact similar to the range of the 25 abstract paintings in the present study (median: –2.64, 95% CI: –3.11 to –2.49).

Aesthetic responses transfer from the input images onto the style-transferred images

Each of the three image categories elicits a wide range of aesthetic ratings in the beholder (Figures 5, 6). In the following sections, we will describe how the subjective ratings transfer from the input images (original paintings and random-phase images) to the output (style-transferred) images.

Figure 5. Mean rating responses of participants for the original paintings (A) and for their style-transferred counterparts (B). Ratings are shown for Pleasing (blue), Harmonious (green) and Interesting (red). In both panels, individual artists are ordered from left to right in a sequence of ascending Pleasing responses. Spearman’s coefficients (ρ) for the correlations are listed in Table 3. (C–E) Same data as shown in (A,B) but plotted in slope graphs, separately for the different rating dimensions. Each line connects the mean rating responses for an original painting of one artist and for its style-transferred counterpart.

Figure 6. Rating responses for set slope values of the random-phase images. The boxplots show mean responses (y-axis) by all 40 participants for different set Fourier spectral slopes (-5 to 0; x-axis) of the random-phase images (A–C) and the style-transferred images (E–G). The whiskers represent the 5 – 95% confidence intervals. The rating dimensions are indicated on the top of the panels [(A,E) Pleasing; (B,F) Harmonious; and (C,G) Interesting]. Multiplicity-adjusted significance levels of pairwise comparisons are indicated by the asterisks (*p < 0.05; **p < 0.01; ***p < 0.001; ****p < 0.0001). Panels (D) and (H) show least-square fittings of second-order polynomial (quadratic) functions to the data from the previous three panels (orange, Pleasing; green, Harmonious; and blue, Interesting).

Figure 5 shows the mean ratings per artist for the original paintings (Figure 5A) and for the style-transferred images (Figure 5B), respectively. Artworks are sorted from left to right according to the Pleasing ratings. The sequence of the artists from low to high ratings is roughly similar for the two image categories (Figures 5C–E). We thus correlated the ratings of the original paintings and the style-transferred images and found that the mean responses per artist correlate for all three rating dimensions, but to different degrees (Spearman’s ρ range: 0.48 – 0.80; Table 3). In other words, if participants rated particular original paintings more highly, they tended to do so also for their style-transferred derivatives. Unlike the ratings for the original paintings, the ratings of the random-phase images did not correlate significantly with those of the style-transferred images (Table 3).

Table 3. Spearman’s coefficients (ρ) for the correlations between the three rating dimensions for all style-transferred images and original paintings as well as the random-phase images, respectively.

Random-phase images with different set slope values

To create the style-transferred abstract images, we used random-phase images that possessed slopes of the Fourier power spectrum ranging from –5 to 0. We thus asked whether the rating responses for the different set slope values transferred from the random-phase images onto the style-transferred images. Results are plotted as a function of the Fourier slope in Figure 6. We will first consider the ratings for the random-phase images, followed by the style-transferred images. Note that on a descriptive level, the style transfer did not translate the original slopes from the random-phase images to the output images, as described above (Supplementary Figure 2).

For the random-phase images, rating responses for Pleasing and Interesting follow an inverted u-shape with highest responses for slopes of –2 and –3 (Figures 6A,C). Differences are not significant for Harmonious ratings (Figure 6B). These results were confirmed by least-square fitting of 2nd-order polynomial (quadratic) functions (Figure 6D). Our findings thus extend results by Spehar et al. (2016) for grayscale random-phase images into the color domain.

For the corresponding style-transferred images, participants tended to rate the style-transferred images as more Interesting if they were derived from random-phase images with set slope values of less than –2, with a maximum at a set slope value of –3 (Figure 6G, blue in Figure 6H). However, the differences are less pronounced than those of the random-phase images. Interestingly, there is a weak inverse relation between set slope values and responses for Harmonious with lower responses for set slope values of –5 to –2 (Figure 6F, green in Figure 6H). For Pleasing, no differences in the ratings were obtained for different set slope values (Figure 6E, orange in Figure 6H). Taken together, our data suggest that the transfer of ratings from the random-phase images onto the style-transferred images is much less effective than from the original paintings.

Previous results by other researchers (Bies et al., 2016; Güclütürk et al., 2016; Spehar et al., 2016) revealed that individual participants favor different degrees of complexity in random-phase patterns. We thus asked whether groups of participants differed in their taste also for the colored versions of the random-phase images. Hence, we clustered participants according to the mean responses of each participant per set slope for all three rating dimensions. About half of the participants (Clusters 1 and 2) exhibit an inverted u-shaped response curve for all three rating dimensions. Linearly decreasing or increasing slope values were found for the remaining clusters (for detailed results, see Supplementary Figure 3).

Inter-rating correlations

Table 4 lists correlations between the rating dimensions for all three image categories across all participants. The lowest correlations are observed between Harmonious and Interesting while both dimensions correlate more highly with Pleasing. Figures 5A,B illustrates that ratings for Harmonious and Interesting vary widely for many artists.

Table 4. Spearman’s coefficients (ρ) for the correlations between the different rating dimensions (Pleasing, Harmonious, and Interesting) for all participants.

Despite these general tendencies, we observed marked differences between participants in the correlations between the rating dimensions (data not shown). Therefore, we calculated the inter-rating correlations also within participants and clustered participants according to these correlations. Results for the five clusters obtained (Table 5) indicate that the overlap of Pleasing with Harmonious and Interesting, respectively, is about equally strong for most participants. By contrast, Harmonious and Interesting correlate less strongly with each other (see also Figures 5A,B) and some participants even showed anticorrelated response tendencies. However, these results should be considered to be preliminary because the number of participants in the different clusters is very small (Dalmaijer et al., 2022).

Table 5. Average Spearman’s coefficients (ρ) for the correlations between the different rating dimensions (Pleasing, Harmonious, and Interesting) for the five groups of participants that were clustered on the basis of the inter-rating correlations.

Statistical image properties explain aesthetic ratings

To determine how well the SIPs explain the aesthetic responses along the three rating dimensions, we performed a multiple linear regression analysis with a model that comprised the eight independent variables (SIPs) selected for our analysis (see Materials and methods section). In the following two sections, we will describe how each variable predicts the ratings of the style-transferred images and compare the results to the original paintings (Figure 7). As described in the Statistical methods section, we refer to the β* coefficients as weak, moderate, and strong effects, respectively. Because the random-phase images display a rather unique image structure and differ in their image properties from both the original paintings and the style-transferred images, we will not consider them in the analysis of how SIPs explain the aesthetic ratings.

Figure 7. Standardized β (β*) values for the influence of the statistical image properties (SIPs) on the rating responses. Data are shown for original artworks (A,C,E) and style-transferred images (B,D,F). The three rating dimensions are Pleasing (A,B), Harmonious (C,D) and Interesting (E,F). The explained variance (R2adj) of the respective model is indicated on top of each panel. Asterisks indicate β* values of variables that had a significant effect on the ratings when the other variables were controlled for; the respective significance levels are *p < 0.05; **p < 0.01; ***p < 0.001. n.s., not significant. (G) Influence of the SIPs on the rating responses, in relation to the JenAesthetics dataset. This overview summarizes results for the ratings of the style-transferred images for all participants (A–F). The influence of the eight independent variables (SIPs) on the ratings (Pleasing, Harmonious, and Interesting) is represented by arrows, which are shown only for those variables that had a significant effect on the ratings when the other variables were controlled for [marked by asterisks in (B,D,F) and Supplementary Table 2]. The size of the arrows indicates the strength of the relation [small arrows, | β*| < 0.2 (weak effect); medium-sized arrows, 0.2 ≤ | β*| < 0.5 (moderate effect); and large arrows, | β*| ≥ 0.5 (strong effect)]. The direction indicates the sign of the relation (upward, positive relation; and downward, negative relation). The colors indicate the changes relative to the results for the JenAesthetics data set (Figure 4). Blue arrows indicate higher ratings if the SIPs are closer to the mean SIPs of the JenAesthetics data set. Red arrows indicate higher ratings if the SIPs are more distant from the mean SIPs of the JenAesthetics data set. Gray arrows indicate no significant differences of the SIPs between the style-transferred images and the JenAesthetics data set.

Style-transferred images

Figure 7 and Supplementary Table 2 list the explained variance for each model (R2adj) and the β* coefficient for each SIP. Overall, the SIPs predict a relatively large part of the observed variance in the ratings. Except for 2nd-order entropy, all other SIPs predict the responses to the style-transferred images for at least two of the rating dimensions (weak to strong effects, Figures 7B,D,F). Moreover, the direction of the β* coefficients is the same for the three rating dimensions for most SIPs. Positive β* values are obtained for Variance Pf(30) (Pleasing and Interesting), and negative β* values for Complexity and Variance Pa(2) (Pleasing and Harmonious), and Lab (b) and HSV (S) (Pleasing and Interesting). Only Self-similarity and HSV (H) entropy show opposite directions for Harmonious and Interesting, respectively. Lower levels of Self-similarity are perceived as more Interesting (Figure 7F) whereas higher levels of Self-similarity are rated as more Harmonious (Figure 7D) in the style-transferred images. The opposite tendency is seen for HSV (H) entropy. Here, higher values for Variance Pf(30) are perceived to be more Pleasing and Interesting (Figures 7B,F).

Original paintings

Compared to the style-transferred images, significant predictors (asterisks in Figures 7A,C,E) are less numerous for the original paintings. This result is expected because the SIPs were selected based on the style-transferred images (see Materials and methods section). Moreover, the size of the sample (25 original paintings) is exceedingly small for statistical analyses, which must therefore be considered preliminary. Nonetheless, the data suggests that participants prefer original paintings with lower values for the variables Self-similarity and HSV (S) for all three rating dimensions. For Self-similarity, preferred images are rated more highly if values are more different from the mean values of all other image categories (Figures 4B,7A,C,E). For increasing values of HSV (H) entropy, ratings increase for Pleasing and Interesting, while the opposite relation is seen for Harmonious (Figures 7A,C,E). For the sake of completeness, results for random-phase images are listed in Supplementary Table 2.

Clustering participants according to inter-rating correlations

As described above, participants were clustered according to the correlations of rating responses along the three rating dimensions (Table 5). Supplementary Figure 4 and Supplementary Table 3 show the results of the multiple linear regression model for the five clusters. All models are significant with explained variances ranging from 0.19 to 0.78. The relation between the inter-rating correlations and the preferences for particular SIPs can be described as follows. Clusters 1 and 2 show about equally strong correlations between all three rating dimensions. Correspondingly, participants preferred images with similar SIPs for all three rating dimensions. Stronger inter-rating correlations in Cluster 1 than in Cluster 2 correspond to more predictive power of the SIPs in Cluster 1. Second, in Cluster 3, the stronger correlation between ratings of Pleasing and Harmonious is mirrored by a similar pattern of β* values for the two rating dimensions. Third, Cluster 4 lacks a correlation between the ratings of Harmonious and Interesting. Accordingly, the SIPs that are associated with these ratings differ. Fourth, there is a negative correlation between ratings of Harmonious and Interesting in Cluster 5, which is also reflected in opposite signs of the β* values. Again, these preliminary results await confirmation by clustering studies with more participants.

Higher aesthetic ratings for statistical image properties that resemble traditional Western paintings

We next studied the rating responses of the style-transferred images and the relation of their SIPs and those of the JenAesthetics dataset of traditional Western paintings. We speculated that style-transferred images are rated more highly if their SIPs are closer to those of the JenAesthetics dataset (see Introduction section). To address this hypothesis, we examined the five variables that differed between the style-transferred images and the JenAesthetics images (2nd-order entropy, Variance Pa[2], Variance Pf[30], HSV [S], and HSV [H] entropy; Figures 4C–E,G,H). For most of these variables, responses are higher if the values of the SIPs are closer to those of the JenAesthetics images (blue arrows in Figure 7G). In other words, if the median SIP value of the JenAesthetics dataset is lower than that of the style-transferred images, β* values are negative. Consequently, the style-transferred images with smaller SIP values are rated more highly (as an example, see Pleasing and Interesting ratings for HSV [S]; Figure 7G). If the median SIP value of the JenAesthetics dataset is higher than that of the style-transferred images, the inverse applies. For HSV (H) entropy only, Harmonious and Interesting ratings show opposite tendencies in comparison to the JenAesthetics dataset (blue arrow and red arrow in Figure 7G, respectively). For 2nd-order entropy, the effect on the ratings is not significant in the model (Figure 7G) although the mean values for style-transferred images and the JenAesthetics images differ (Figure 4C).

For each SIP, we then correlated the rating responses with the Euclidean distance between the style-transferred images and the median of the JenAesthetics dataset (Figure 8A). We find strongest negative correlations for Interesting ratings which suggests that style-transferred images are rated as more interesting, if their SIPs approach those of the JenAesthetics dataset (green shadings in Figure 8A). Similar, yet less consistent results can be found for Pleasing ratings. An interesting exception is HSV (H) entropy where images are rated as more Pleasing and Interesting, the more distant they are from the JenAesthetics dataset, and more Harmonious, the closer they are.

Figure 8. Influence of the SIPs on the rating responses to the style-transferred images in relation to the median SIP values of the JenAesthetics (JA) dataset. (A) Spearman coefficients ρ for the correlation between the rating responses and the Euclidean distance between each individual SIP and the median SIP of the JenAesthetics dataset, respectively. Negative correlations (green) imply that the ratings are higher if the SIPs are closer to the JenAesthetic dataset. The inverse holds for positive correlations (orange). The second column lists the rank of the style-transferred images relative to the JA dataset. (B–D) Responses for each rating dimension are plotted as a function of the Mahalanobis distance in the 5d space spanned by the five SIPs that differ significantly between the style-transferred images and traditional Western artworks (Figure 4). Each dot represents one style-transferred image. For the linear regression, the solid line represents the fitted line and the dashed lines its 95% confidence interval. Spearman’s coefficients of correlation ρ are given in (A) and (C) with their respective significance levels. For (A–D), significance levels are *p < 0.05; **p < 0.01; ***p < 0.001. n.s., not significant.

To substantiate the above result, we calculated the Mahalanobis distance of each style-transferred image to the median of the JenAesthetics dataset in the multidimensional space spanned by the five SIPs. We correlated the distances with the aesthetic ratings. Results in Figures 8B–D suggest that style-transferred images, which are located closer to the JenAesthetics dataset in this space, are rated as more highly for Pleasing and Interesting; no such correlation is found for Harmonious ratings.

Discussion

We investigated how neural style transfer (NST; Gatys et al., 2015; Kolkin et al., 2019) can be used to generate abstract images that display a wide range of statistical image properties. With these images, we pursued four aims to better understand the style transfer process. (1) We compared the objective properties (SIPs) and (2) the ratings of the input images (original artworks and random-phase images) with the output images (style-transferred images). (3) We asked which SIPs predict aesthetic ratings by human beholders in the style-transferred images and (4) how these SIPs and their predictive value for aesthetic ratings relate to those of a large set of traditional Western paintings (JenAesthetics dataset).

To describe the objective structure of the images, we selected a set of eight statistical image properties (SIPs) that have been related previously to artistic style and aesthetic perception. The selected SIPs cover different aspects of formal image structure and composition. They reflect the density and distribution of oriented luminance and color gradients (Complexity, Self-similarity, 2nd-order entropy), richness and variability of low-level CNN filter responses (Variance Pa[2] and Pf[30]) and color features (Lab [b], HSV [S], and HSV [H] entropy). For the style-transferred images, the eight SIPs assumed a wide range of values (Figure 4) and showed relatively weak correlations between each other (Figure 3).

Importantly, the eight SIPs were strong predictors of the aesthetic rating responses to the style-transferred images (Figures 7,8 and Supplementary Tables 2,3). The explained variances R2adj for models with the eight SIPs are about as high as the R2adj values for models with the first eight principal components of all 29 variables that were considered initially (Supplementary Table 2; see Materials and methods section). Thus, the reduction from 29 to 8 variables did not decrease the explanatory power of the reduced model substantially.

We can only speculate about the origins of the remaining variance, which is not covered by the SIPs. Besides higher-order visual features, possible sources of variance include environmental and genetic factors (Bignardi et al., 2020; for a review, see Chamberlain, 2022), as also found for the evaluation of face attractiveness (Germine et al., 2015). Personality factors also predict aesthetic ratings (Chamorro-Premuzic, 2009). For instance, they explain a large proportion of the variance associated with aesthetic chills in response to art (Silvia and Nusbaum, 2011; Bignardi et al., 2022). A comprehensive model on how these diverse factors interact remains elusive at present.

Transfer of statistical image properties during neural style transfer

Our aim was to quantify how the SIPs changed during their transfer from original artworks onto random-phase images. We found that the style-transferred images differ from original paintings in three SIPs and from random-phase images in seven SIPs. In other words, the style-transferred images resemble original abstract artworks more closely in their image properties than they resemble the random-phase images. The correlation analyses (Table 2) quantify the transfer effects and provide evidence that what was originally termed the “style image” (Gatys et al., 2015) determines the formal features, i.e., the SIPs, whereas the formal features of the “content image” (Gatys et al., 2015) get largely lost in the process of NST. This result suggests that style, as defined in NST (Gatys et al., 2015), can be represented, at least in part, by the eight SIPs in our study. In particular, the transfer of color features seems to work particularly well both subjectively (Figure 1) and objectively, as indicated by high correlations between color values and ratings (Tables 2,3).

As an example, the Fourier slope, which was set to fixed values of –5 to 0 in the random-phase images, transforms to a relatively narrow range of values between –3 and –2 in the style-transferred images (Supplementary Figure 2). The 25 abstract paintings in the present study (–3.34 to –1.59; median: –2.64) also fall within this range. This range of values is close to the Fourier slope of natural scenes and other visual artworks (Aks and Sprott, 1996; Graham and Field, 2007; Redies et al., 2007), which human beholders generally prefer (Graham and Redies, 2010).

The range of SIPs of the style-transferred images shows considerable overlap with human-made artworks (Figure 4). The variance of the individual SIPs of the style-transferred images is generally higher than that found in traditional Western paintings (JenAesthetics dataset; Figure 4). A large range of variation of SIPs has also been described for abstract art (Redies and Brachmann, 2017) and modern art (Mather, 2018). The SIPs of the style-transferred images thus represent a wide range of values that cover also those of traditional art and abstract/modern art.

Transfer of aesthetic ratings during neural style transfer

Our results revealed that mean rating responses for the original abstract paintings correlate positively with the

View original article

FRONTIERS IN NEUROSCIENCE

分享书签

0 0 0 0 0 0 0

More from this channel

Statistical image properties predict aesthetic ratings in abstract paintings created by neural style transfer

留言 (0)