^{a}

^{a}Corresponding author. E-mail:

In this paper, we compare the methods proposed by

En este trabajo se comparan los métodos propuestos por

Este trabalho compara os métodos propostos por

Databases often show outliers observations, which present a different behavior from the majority. It is important to detect these observations since they affect the data analysis in different ways. In this respect,

They could mask the data pattern and distort the results, so the conclusions would be completely different without their presence.

They could affect the normality condition that is necessary in many multivariate techniques.

Outliers have different causes:

Measurement errors, collection or transcription.

Intentional errors of response from the respondents.

Sampling errors: the incorporation of sample statistical units from different populations to the target population.

Intrinsic heterogeneity: the observed elements belong to the target population, but the inherent variability of the samples differs from the rest in their choices, attitudes or behavior.

Sometimes the detection of outliers is the first step in statistical analysis. Other times, the outliers need to be removed or downweighted; different causes motivate different procedures. The detection of outliers depends on the type of error (or cause) in the data. In the case of errors in measurement or data entry to the base, it is relatively simple to correct them and it is convenient to eliminate the obvious mistakes. However, a controversial question is: What should we do when the outliers derive from the intrinsic heterogeneity of the data?

In this paper we discuss some multivariate methods for detecting outliers. In multivariate methods, there exist two approaches to identify outliers: those based on the distances of the observations in the data center and those projecting the original data. The projection pursuit methods easily identify atypical observations, and they have the advantage that it is not necessary to know the data distribution. However, the disadvantage of the projection pursuit methods is that there are high requirements in terms of computational load, which increments significantly when there is an increase in the variables considered.

The document is organized as follows. The first section describes two algorithms used to detect outliers (

Mahalanobis’ distance from the center of the data is the classical multivariate way of identifying outliers observations far from most others.

Let _{1}, _{2}..._{n} be a random sample from a normal multivariate distribution _{p}
_{i} and the location µ, weighted by the covariance Σ is using to detect if the observation x_{i} is an outlier, it is (

The square Mahalanobis distance has Chi-square distribution with p degrees of freedom and the observation . is consider outlier if

When both µ and Σ are unknown it be used estimators,

The vector mean and the covariance matrix are affected by outliers, besides the Mahalanobis distance relies on the assumption of normality. Therefore, it is affected by outliers and it does not allow identify sets of outliers (

An alternative approach is to use robust location and scale estimators, measures resistance against the influence of outlying observations.

Nevertheless, the distance method approach presents two difficulties: (i) obtaining a reliable robust location estimator, and (ii) determining and classifying the outliers. It is important to find metric separating outliers from regular observations.

Others outlier-detection procedures are basing on projections to identify outliers. The underlying motive of these methods is to find suitable projection of the data in which the outliers are readily apparent and can thus be downweighted to yield a robust estimator.

From that point of view,

The disadvantage of the projection pursuit methods is the form to increase the computational burden with the variable number.

Projection pursuit methods have a computational time that increase very rapidly in higher dimensions.

Principal components are those orthogonality directions that maximize the variance along each component. It is well-known the method of dimension reduction that seems intuitive to identifying outliers since outliers increase the variance along their respective directions. The outliers appear more visible in principal components space, at least in some direction of maximum variance, than the original data space. Principal components select a small quantity of highly informative components, discarding those are not contribute significant additional information. In this way, the dataset become more computationally tractable without losing a lot of information.

In the following sections, we describe the

Given a sample (x_{1},....x_{n}) of a

1) The original data are rescaled and centred, see (

Where _{x} are the mean and sample variance, respectively.

2) Set

3) Project sample points onto a lower dimension subspace_{j}.

Where _{j}
_{j}

Where

4) Compute _{1}, d_{2}, …, d_{p}

5) Repeat (_{p+1}, d_{p+2}, d_{p+1}, ...,d_{2p}

6) To determine an outlier in any one of the 2

Where

7) Define a new sample composed of all observations i if

8) Let U denote the set of all observations not labelled as outliers and computed the mean vector

Those observations

The Kurtosis method is affine equivariant.

However other authors discussed the Kurtosis method, they argue important points. The method works well in the presence of scattered outliers or multiple clusters of outliers.

For those cases in which the shape of the contamination is similar to that of the original data, the method can be supplement with other general methods (an alternative approach is to use clustering methods to supplement the general-purpose robust methods). The greatest chance of success comes from use of multiple methods, at least one of which is a general-purpose method such as FAST-MCD and MULTOUT, and at least one of which is meant for clustered outliers, such as Kurt method.

However, several key aspects of the Kurt algorithm proposed are criticized. The standardization in Step 1 uses the classical mean and covariance matrix. It is well known that these estimators are extremely sensitive to outliers, which often leads to labelling outliers as good data points and good points as outliers. The authors answer:

This problem cannot appear in our method. First, note that the algorithm we propose is affine equivariant, independently of the initial standardization. The kurtosis coefficient is invariant to translations and scaling of the data and a rotation will not affect the maximizes or minimisers. Moreover, we have tried to be careful when defining the operations to generate the successive directions, as well as in the choice of an initial direction for the optimization problems. (

Maximizing and minimizing the kurtosis in Steps 2 and 3. The authors indicates that the kurtosis is maximal (respectively, minimal) in the direction of the outliers when the contamination is concentrated and small (respectively, large). However, this is not always true for an intermediate level of contamination. The authors answer:

The behaviour of the kurtosis coefficient is particularly useful to reveal the presence of outliers in the cases of small and large contaminations, and this agrees with the standard interpretation of the kurtosis coefficient as measuring both the presence of outliers and the bimodality of the distribution. What is remarkable is that in intermediate cases with a=0.3 the procedure does not break down completely and its performance improves with the sample size. (

Taking 2

Regarding the choice of orthogonal directions, they reply:

Our motivation to use these orthogonal directions is twofold. On the one hand, we wish the algorithm to be able to identify contamination patterns that have more than one cluster of outliers. The second motivation arises from a property of the kurtosis that implies that in some cases the directions of interest are those orthogonal to the maximization or minimization directions. (

The authors did not explain how they obtained the cut-off values _{p} in

The results are unfortunately not totally satisfactory; the reason is the large variability in these values in the simulations^{[1]}. This variability has two main effects –it is difficult to find correct values (huge numbers of replications would be required) and for any set of 100 replications there is a high probability that the resulting values will be far from the expected one. Nevertheless, we agree that these values could be estimated with greater detail, although they do not seem to be very significant for the behaviour of the algorithm, except for contaminations very close to the original sample. (

Sequential determination of outliers in Step 8, the mean and covariance matrix of the good data points are computed and used to decide which outliers can still be reclassified as good observations. This procedure is repeated until no more outliers can be reallocated. It has suggested that Step 8 be applied only once. The authors are a little surprised by the criticism of procedures that determine the outliers sequentially. They replied:

The statistical literature is full of examples of very successful sequential procedures and, to indicate just one,

This algorithm was designed primarily for computational efficiency at high dimension. It consists in two steps: The first one to detect the location outliers and the second one to detect scatter outliers.

1) Rescale the data _{(n,p)}

Compute the covariance matrix from ^{*}

2) Compute the eigenvalues and eigenvectors from covariance matrix ^{*}

3) Compute a robust kurtosis weights for each component denoted by _{j}

To classify the data between outliers and non-outliers we need to determinate a weighted norm from transformated data Z* but it has not chi quadratic distribution. Z* is similar as a robust Mahalanobis distance (RD_{i}) (distance from median rescaled by MAD). Therefore, the algorithm used a robust distance transform similar

4) To assign weights to each observation and use it as a measure of outlyingness is calculated the translated bi-weight function w_{1i} (

Where M is equal to 33⅓ quantil of the distances and

5) Use the step 2 decomposition and calculate the Euclidean norm for the data in non-weighting principal component space (equivalent to the Mahalanobis distance in the original data but faster to compute). After use the

6) Determine the weights w_{2i} to each robust distance with the translated biweight function where c^{2} is equal to , M^{2} is equal to and finally we calculate the final weight, see (1

Where typically the scaling constant s = 0.25. Outliers are then classified as points they have weight w_{i} < 0.25. This value implies that if one of the weights equals one the other must be less than 0.0625. If w_{1} = w_{2}, x is classified as outlier when the common value is less than 0.375.

The computational speed that is the speed t of the computer to process data^{[2]} is an advantage of this algorithm. Using examples an simulated data

The comparison with methods in low dimension, using simulated data reveals that PCOut performs well at identifying outliers, with low masked outliers, although it has a higher percent non-outliers that were classified as outliers. It does particularly well for location outliers while Kurt does very poorly, however Kurt does exceptionally well for scatter outliers.

In this paper we applied the detection outliers methods to a sample composed by a set of Argentine companies that quote their shares in the Buenos Aires stock exchange in the period 2004-2012. The database was prepared relying on the data publicly available in the Buenos Aires stock exchange web site (

The variables used to detect outliers are the following financial reporting indicators to analyse cost-effectiveness and cost behaviour (

^{[3]}. It is the sum of direct, indirect selling expenses and administrative expenses of a company.

The

We detected outliers using Mahalanobis distance and the both projection pursuit methods presented by

We used the Matlab for the Kurt method and the R package

The

The biplots (

For each method it has been calculated the first and second principal components of the non-outliers data and graphic it on biplots which shows different structures of the data (See

A plausible criterion for determining multivariate outliers is to use different methods and consider as those who are simultaneously identified as atypical by them. Particularly, in this application 19.9% of the data were identified as outliers at Kurt and PCOut methods.

The different results and different performance of each method leads us to consider all the outliers detected for these methods. We aggregated the outliers identified by all the methods taking advantage of their performance. I this empirical application, we detected 225 outliers (30.24%).

The biplot without the outliers detected by all the methods (

Outliers distort the results and mask the real data structure, so to detect them is an important task in the multivariate data analysis.

This work presented two pursuit algorithms to detect outliers, they are an example of the different algorithms available. But is not possible to point one algorithm as the better, it depends of the data sample and the algorithms could show similar performance. Each one method has some disadvantage, so we proposed aggregate the outliers detected by different methods (in this paper Kurt and PCOut methods) and use their different performance to improve the outliers detection. Specially the projection pursuit methods that they only search the useful projections, they are not affecting by non-normality and can be widely applied in diverse data situations.

A multivariate outliers detection is important for thorough data analysis, however, the researchers have to decide to exclude the outliers from further analysis or apply robust procedures to reduce the impact of them.

Research paper.

In the papers and the discussion have been conducted a number of computational experiments to study the practical behavior of the proposed algorithm.

It is defined as Millions of Instructions per Second –MIPS–.

The Argentinean GAAP in effect at the time of this study was Resolución Técnica Nº 9 (RT9) of FACPCE. Chapter 5 of RT9 defines as Selling Expenses those related with sales and distribution of products or services rendered by the firm. RT9 Chapter 5 says that Administration Expenses are expenses incurred by the firm in order to carry on its activities but cannot be attributable to any of the following functions: purchasing (procurement), production (operations), selling, research and development, financing of goods or services. The same chapter of RT9 states that net sales (revenues) are to be presented in the income statement and the amount shall exclude returns, discounts and taxes.