Functional analysis of variance of air pollution caused by fine particles

Particulate matter (PM) constitutes the main pollutant in the air and is closely related to the emergence of diseases (Bell et al., 2007). PM is usually classified based on aerodynamical diameter, fine particles in the air have aerodynamic diameters up to 2.5 μm and are known as PM2.5 (Laden et al., 2000; Febrero-Bande et al., 2007; Al-Hamdan et al., 2009. PM2.5 are associated to chronic respiratory illnesses such as asthma and lung cancer, as well as to cerebrovascular diseases, among others.

del Valle (UV). Station CO is located at the most east-side residential zone of the city. It has a heavy impact from mobile sources, especially early in the morning and early in the evening. Station BA is in the city's northeast and it is located in the middle of a small industrial area, it is also close to an air force base. Station UV is in the city's southside, in the middle of the Universidad del Valle main campus, with a moderate impact from mobile sources. It is encircled by a highly residential and commercial neighborhood.
We analyzed the PM 2.5 air pollution data from the stations CO, BA, and UV. The comparison may help the local environmental authority decide on whether to keep recording measurements at these three sites. This approach provides ground to suggest the redesigning of the local surveillance network. In addition, our results may be used as a guide for the imputation of some missing points at one or more places. In this PM 2.5 air pollution data study, we used a functional extension of a classical Analysis of Variance (ANOVA), that compares the variances within each station related to the variance among stations and that bases decision on an F-test. This extension is known as FANOVA (Ramsay & Silverman, 2005), its corresponding p-value is a continuous curve, allowing the user to identify those hours at which differences are statistically significant.
We employed discrete data, namely hourly measurements, which are assumed to come from an unknown continuous function. The response is a daily contamination curve. The set of indicator variables representing the stations, are used as predictors. Some recent work on this paper's main topic can be seen in Górecki & Smaga (2017), Górecki & Smaga (2018), Ruiz-Medina (2016), Estévez-Pérez & Vilar (2013) and Zhang (2013).

Particulate matter data
The available data are the PM 2.5 hourly averages at three stations in 2015. For each day without missing points, there would be 24 observations at each station. Thus, the fine particulate matter levels are discretely measured in time. According to Ramsay & Silverman (2005), a functional data analysis (FDA) is appropriate when measurements satisfy this condition, enabling the adequate representation of the data such that a detailed analysis can be performed. FDA is advantageous when considerable amounts of information need to be analyzed (Ramsay & Silverman, 2005).
The SVCASC takes air contaminant readings every 10 seconds; however, it only reports the PM 2.5 average per hour in µ g /m 3 . Consequently, we had expected to retrieve 24 observations per day, during 365 days for each station. However we obtained data corresponding to 164 days for station CO, 82 days for station BA, and 167 days for station UV. Missing data is a common problem on these measurement systems. Functional Data Analysis Ferraty & Vieu (2006) formally define a functional random variable: "A random variable χ is called a functional variable if it takes values in an infinite dimensional space (or functional space). An observation x of χ is called a functional datum". In this sense, FDA inherits the descriptive and inferential statistics procedures extended from the scalar case to functions. To proceed with the FDA, the first step is to convert the discrete measurements into a smooth curve through smoothing techniques using linear combinations of a collection of functions. In Ramsay & Silverman (2005), a basis function system is defined as a set of known functions f j , with j = 1, 2, ..., k, that are linearly independent and belongs to a functional space. Because the goal is to perform a comparison, it is necessary to work in a functional space that is defined over the field of real numbers and that has a norm and inner product, allowing for the notion of an orthogonal and orthonormal basis as an extension of the concept of linear independence.
On the other hand, Hastie et al. (2009) and James et al. (2013) mention that the most used basis function systems for the construction of smooth curves are Fourier series and B-splines. The former seem to be more appropriate for data with periodic or cyclic behaviors, whereas B-splines are easily adapted to behaviors with local changes, so their use extends to different fields. Ramsay & Silverman (2005) defines a spline as a polynomial function whose flexibility allows it to easily adapt to the data's behavior. The functions of the Fourier series and B-splines form a set of functions { f j } ( ∞ j =1) . However, a functional data analysis is conducted on a finite subspace denoted by { f j } k . In the B-splines case, each function is defined by an order m and a sequence of nodes τ, and any spline function can be expressed as a linear combination of the basis functions.
For the construction of the functional data, it is important to confirm that the measurements have been taken discretely. They are denoted z h , with h = 1, 2, ..., n, where n is the total number of measurements. Subsequently, the model of equation (1) is fitted, where t h is the h − t h time at which the measurement z h was taken, c j are coefficients to be estimated, and h is a random error.
Because there are n observations in total, the smoothing can be represented in a matrix form as in equation (2) where, It is important to identify the appropriate number k of functions in the basis, which, according to Febrero-Bande & Oviedo de la Fuente (2012), can be performed using the Generalized Cross Validation (GCV) methodology. Thus, the number k that minimizes equation (3) will be the optimum number of basis functions that should be chosen.
In equation (3), t r (s k ) is the trace of the smoothing matrix s k , and S S E is the error sum of squares defined as:

Smoothing of functional data
After defining the system of basis functions to be used, it is important to identify the estimation method for finding the c j coefficients of the model in equation (2) The OLS method seeks to minimize the sum of squares of the distances between the observed value and the fitted value, considering that the variance of the errors is constant; however, it is preferable to use the WLS method because it allows involving the variance and covariance matrix in the model when the errors exhibit autocorrelation structures. This matrix is generally denoted as Σ (Ramsay & Silverman, 2005) and in this case, the estimation of the coefficient vector C is given by equation (4), which is represented in a matrix form.
In equation (4), F is a matrix of size n × k that contains the values of the basis functions evaluated at time t h , which is f j (t h ). W is a weight (or weighting) matrix taken as (Σ ) −1 of size n × n. Finally, the constructed functional datum is given by equation (5).
However, the P-Spline method adds to the OLS method (which considers W = I ) a penalty for lack of smoothness given by the parameter λ, which can be estimated using GCV by only varying λ. In this case, the estimated coefficients are the solutions to the minimization of equation (6).
In practice, there are N functional data. Once the smoothed curves have been obtained, the following step is the calculation of their functional mean and variance, which are defined in equations (7) and (8), respectively, where N is the total number of curves in the sample.

Modeling in FDA
Sometimes, it is of interest to identify whether the variation of a functional variable can be explained by a model based on other independent or regression variables. To do so, it is appropriate to fit functional regression models. In this case, the model considers a functional response and indicator covariates. The model is very useful in studies involving the identification of characteristics of the response variable according to variables represented in factors. Ramsay & Silverman (2005) illustrate the case with functional temperature modeling, evaluating the effect of the geographic zone in which the weather station is located. In this case, it is necessary to perform a FANOVA because the response variable is functional.
The model is formally given by equation (9), where µ is the global mean of ŷ without considering the effect of the treatment or of the group g (in our case, stations CO, BA and UV), G is the number of groups, α g is the specific effect of group g on the response variable, and i is the unexplained variation in individual i, i = 1, 2, ..., N .
In this way, the model of equation (9) can be represented as in equation (10).
On the other hand, the matrix representation of model (10) is D = H β + ; therefore, using the general linear model theory, the estimator vector β is obtained via OLS as in equation 11.
In the functional field, the analysis of variance table looks exactly the same as a scalar data table; its main difference lies in that there are no scalar sums of squares but curves as seen in the Table 1. Owing to this analysis, the hypothesis testing is performed, which allows identifying whether some of the parameters associated with one of the groups are statistically significant. The statistical hypotheses are: for some i, j = 2, ..., G + 1, i = j . Test statistics F (t ) ( Table 1) is the F −statistics evaluated pointwise, as proposed by Ramsay & Silverman (2005). Thus, we compute the pointwise p-value as follows: The procedures to construct the functional data needed for the FANOVA analysis are carried out using The R Project packages fda (Ramsay et al., 2018) and fda.usc (Febrero-Bande & Oviedo de la Fuente, 2012).

Results and Discussion
To convert 24 discrete hourly measurements from one day into a smooth curve, we selected the B-spline approach, since we do not have evidence of any kind of periodicity in the data, and have empirical evidence of a lack of periodicity for some days of the week. We also chose the cubic polynomials approach, because of its flexibility and adaptability, given the high variability of the response. B-splines have been chosen by some other authors (Al-Hamdan et al., 2009) for fine particulate matter analysis based on FDA. Thus, the daily curves of the available complete days for the entire year were constructed. For this, we picked the best number k of third-degree B-splines to form a smooth-curves generating set. That selection is based on the GCV criterion, which minimizes the mean quadratic error of the smooth curve estimator. Similarly, the value of the penalization parameter λ is chosen, in this case, the optimum values are k = 20 and λ = 1.
In addition, the weekday variable was introduced in the analysis such that seven separated models are constructed, one for each day of the week, seeking to refine the analysis for determining whether the curves of the three stations are significantly different.   1 shows the observed data points (Fig. 1A) together with the smoothed data (Fig. 1B). The hourly averages per station (Fig. 1C) and the functional mean (Fig. 1D) are also depicted.
The current Colombian air quality regulation (Resolución 2254 de 2017, Ministerio del Medio Ambiente y Desarrollo Territorial) dictates that the maximum hourly average allowed for PM 2.5 is 37 µ g /m 3 . We compared the observed hourly averages against the standard using datapoints (Fig. 1C) as well as their functional means (Fig. 1D). We noted that the observed hourly averages never exceeded the established upper limit at any of the three stations. For the yearly global estimation, a functional analysis of variance test was performed to gain a general idea of air pollution in the three stations. Results are shown in Fig. 2 Fig. 2A depicts the PM 2.5 functional means at each of the three stations; Fig. 2B shows the pointwise p-value for the functional analysis of variance, from which we concluded that PM 2.5 concentrations show significant differences for hours from 0 to 15 on an average day at all three assessed stations.

Analysis with smoothing parameters per day and per station
Environmental pollution could be influenced by the day of the week due to the impact of mobile sources because the circulation of internal combustion vehicles is consistently higher during some days of the week. To evaluate the results in light of this fact, a day-to-day analysis was performed. Fig. 3 reveals that air pollution on an average Monday resembles that of the entire year (Fig. 2). This behavior is similar on Tuesdays, Saturdays and Sundays; but differs on Wednesdays, Thursday and Fridays, as shown in Fig. 4D for Thursdays, where the curves between stations are not significant during most of the day. In fact, even though significant differences were identified during the first hours of the day, atypical observations occur during those hours, corresponding to the first hours of the year, especially during New Year celebrations, as shown in Fig. 4 A and B.
These results lead to the conclusion that the three stations behave differently, as observed in the global analysis for the year 2015, confirming the importance of taking air quality readings in all three sites.

Conclusions
• The environmental pollution data, when measured in an infinite and continuous space, in this case time, turn out to be appropriate for the analysis of functional data. This analysis to summarize in a smooth curve the information of n scalar values of a variable that varies continuously and is discretely observed.

A B C D
• The results obtained through the data exploration are reflected in the functional analysis of variance because in general, environmental pollution in the three monitoring stations shows significant differences noticeable during morning hours and almost imperceptible during afternoon-evening hours.
• Thanks to this analysis, we determined that the station with the highest average pollution levels per day is station BA. Consequently, the inhabitants of this zone, which is to the north-east of the city, are exposed to a higher risk of contracting chronic respiratory diseases. Revealling the need for implementing environmental policies that to reduce PM 2.5 levels in such zones of the city.
• According to the present functional exploratory analysis (Fig. 1), stations CO and UV seem to show similar average PM 2.5 levels, even if local environmental conditions differ from one station to another. A further analysis is needed on this issue.
• Due to the large amount of missing data, the functional data were reduced. An option to tackle this drawback would be the imputation of missing data to complete the dataset. In this way, a greater number of curves will be available for analysis, since only days with complete hourly measurements are considered in the present work.
• Finally, as a matter of priority, we recommended to evaluate the correlations between measurements of different stations, which could be high because the studied pollutant can be airborne. This suggests a possible spatial and temporal correlation between monitoring stations. In addition, in order for the correlation to have greater meaning, measurements should be taken on the same dates at the studied stations, although this implies a reduction in the database.