Cluster-based LSTM models to improve Dengue cases forecast

Public health problems such as dengue fever need accurate forecasts so governments can take effective preventive measures. Machine learning, in particular deep learning (DL) have become increasingly popular as the volume of data increases continuously. Neverthe-less, performing accurate predictions in areas with fewer cases can be challenging. When we apply DL models using long short-term memory (LSTM) in different cities considering weekly dengue incidence and climate, some models may present heterogeneous behaviors and poor accuracy because of the need for more data. To mitigate this problem, clustering analysis across time series is performed based on scores to measure the clustering quality in 217 Paraguayan cities. First, we compare the raw and feature-based clustering techniques considering several metrics. Our results indicate that hierarchical clustering combined with Spearman correlation is the most appropriate approach. Finally, several LSTM models built using clustering results were compared. The root-mean-square error confirms that the clustered models improve accuracy by 19 . 48 ± 18 . 80%. Finally, we present a comparison between one-dimensional statistical clustering and our best clustering setup. The main contribution of this work is a technique that can improve the performance of time series models that combine information from similar time series and weather data.


Introduction
Dengue virus is an arthropod-borne virus transmitted mainly by the Aedes aegypti mosquito acting as a vector, with a higher incidence in urban areas. Symptoms include fever, headaches, joint and muscle pain, and nausea, disease varies from mild fever to severe conditions of dengue hemorrhagic fever and shock syndrome [1]- [3]. Over 50,000,000 infections are thought to occur annually globally, including 500,000 hospitalizations for dengue hemorrhagic fever [4]. Dengue infections have dramatically increased during the past ten years in South American nations like Colombia, Ecuador, Paraguay, Peru, Venezuela, and Brazil. Dengue is also recognized as an endemic trait, which is why it is regarded as a public health issue in tropical and subtropical areas. [5].
In Paraguay, since 2009, a constant circulation has been observed, reporting between the years 2009 to 2015 a sustained increase in cases and a third major epidemic in 2013, the year in which 153,793 reported cases were observed [6]. More than one hundred deaths were found among all the cases the Ministry of Public Health and Social Welfare (MSPBS) documented during this time.
Deterministic and statistical models have been developed to predict the incidence of the disease as a function of time, based on epidemiological and entomological studies, including meteorological factors, vector population density, or social media data [7]- [11]. Therefore, multivariate metrics were developed to select the variables for the models [12]- [14]. However, the relationship between dengue incidence and meteorological data is highly complex and cannot be easily inferred [15]- [18]. Additionally, some data from healthcare providers arrive in the reporting system in a delayed manner. This is why data-driven approaches based on machine learning and deep learning have become competitive alternatives to traditional models [19]- [22].
Deep learning approaches, specifically LSTM (Long Short-Term Memory), have proven that they can outperform state-of-the-art models and have been used to forecast influenza trends successfully [23]- [27]. However, fitting LSTM models to the time series of many cities is a challenging task because some cities display low and high incidences in a sort of heterogeneous behavior. In [28], [29], authors proposed to group similar time series instead of fitting one model for each city. However, they have chosen their groups based only on the performance of the model.
Recently we presented a comparison between traditional machine learning approaches (e.g., Lasso regression (LR), Random Forest (RF), Support Vector Regression (SVR), and Deep Learning (DL)) by measuring the RMSE [30]. This work presents improvements to cluster analysis, raw-based and feature-based approaches, and additional metrics such as Kalinski Harabasz and Dunn to measure the group's quality. Finally, group-based DL models are considered to demonstrate the improvement due to the clustering approach, and new metrics such as the standard deviation of percentual Errors and the maximum of percentual errors are considered. Another important comparison between one-dimensional statistical clustering and our best clustering setup is presented to analyze the interpretability of our clusters. This paper is organized as follows: §2 presents the methodology, the used dataset, the sampling, and preprocessing. Section 3 provides experiments for the selection of the clustering technique of the time series and its evaluation. §4 provides a brief review of the selected model and explains how the data is grouped for experiments, showing and discussing the results. Finally, §5 draws the conclusions.

Methodology
A grouping technique comparison is performed based on the similarity of the incidence time series. To verify whether the clustering improves the model performance, a LSTM model is trained considering other natural grouping schemes, such as the administrative division of the country and global model. Clustering models will be presented at §3. The overall framework of this work is illustrated in Figure 1, and the detailed steps are presented later.  Figure 1: Summarized workflow of the model selection presenting order of procedures: a) reading the databases, b) carrying out the tests to determine the most suitable clustering technique, c) the evaluation of the LSTM model using the city data, and grouped by using the clustering criteria described in §3. Finally, the interpretability of our clusters is presented.

Datasets
Dengue fever cases (c) from January 2009 to December 2013, organized in 217 cities of 17 states, also called departments in Paraguay, and population (p) of each city, were provided by the COMIDENCO project [31]- [33]. The COMIDENCO team curated data to develop epidemiological models. The meteorological information was gathered from weather stations located all around the country [34]. Meteorological data available included daily reports of minimum, average, and maximum temperature; minimum, average, and maximum atmospheric pressure; rainfall; maximum, average, and minimum wind speed and cloudiness.
To develop predictive models, the data is organized weekly. Once the time series for each city is obtained, the Dengue fever incidence is computed. In this article, we consider the following definition for the incidence of Dengue in a city in terms of percentage.
The incidence of Dengue fever in a city, denoted as Icd, is given by where c is the number of cases per week and p is the population. Once the incidence is calculated, weather features associated with each time series have to be chosen. According to [28], the features selected are average temperature, average atmospheric pressure, and rainfall.
To obtain effective values for each feature associated with a time series, we geographically interpolate the values recorded at the two closest weather stations, as needed. Then the features are linearly combined using weight factors proportional to their corresponding city populations. Finally, our model uses four variables: the incidence Icd and the three meteorological variables (which are chosen previously). Each time series has 265 records since we have 265 weeks in the period 2009-2013.

Clustering algorithms
Clustering is an unsupervised machine learning task aimed to classify a large amount of data in groups when there is no prior knowledge about real groups. Partitions in groups are made in such a way that the elements of a group are as similar as possible to each other [35]. Time series clustering can be used to improve the performance of a predictive model for Dengue fever in two ways: 1. Each group can be taken as a unit, in this way, a model can be adjusted for all the individual components of that cluster, thus reducing the number of necessary adjustments per city.
2. The more data available for deep learning models, the better the network performance becomes as it helps to avoid overfitting.
K-means [36] algorithm tries to find a partition of the samples in k clusters so that each sample belongs to one of them, specifically the one whose centroid is closest. A centroid is the middle of a cluster, which can be thought of as the multidimensional average of the cluster. Algorithm 1 shows the k-means partitioning process. In hierarchical clustering [37], clusters are generated hierarchically, as the name implies. It starts by taking every data point as a cluster. Then, the closest points merge into a single cluster, and so on, until all points are in a single cluster. Algorithm 2 shows the hierarchical clustering process. Finally, a cluster of size n is obtained, where n is the initial number of points to be grouped. It seems pointless to form a single large group with all elements. However, the goal of hierarchical clustering is to form a dendrogram. A dendrogram is a tree that shows the merging process, from this dendrogram, cut points can be defined and form groups as seen in Figure 2.  [32], five groups formed can be observed, the lines represent the distances between elements and the lines of the same color represent those that are in the same cluster.
In DBScan [38], for each point, the neighborhood of a given radius must contain at least a minimum number of points to belong to a cluster. DBScan needs two parameters: • eps. Is the radius of distance to define a neighborhood, i.e., if the two points are at a distance ≤ eps it means that they are in the same neighborhood.
• M inP ts. Minimum number of neighbors, i.e., data points within the eps radius.
The algorithm starts by visiting a random point, the neighborhood of this point is visited, and if it has enough points (≥ M inP ts) it is said that it is dense enough and a cluster is started on it. If not, the point is labeled as noise. This process continues until a densely connected cluster is built. Then a new point is visited to discover another cluster or noise. Algorithm 3 describes the DBScan. As seen, there are parameters that must be entered beforehand to run the algorithms, the most crucial being the number of clusters. Determining the optimal number of clusters (k) is a complex task. Elbow method is a heuristic method that consists of graphing the variation of an error metric as a function of the number of clusters and choosing the elbow of the curve as the number of clusters to use. This method works by computing the algorithm method, e.g., k-means for different values of k, varying k from 1 to n (n ≥ 2), then choosing the value where the error starts to stop being significant.

LSTM Model
The deep neural networks considered for this work are specifically designed to forecast time series. This is due to the memory cells (LSTM) which preserve long and short dependencies [39], [40]. LSTM cells have input (i), forget (f ), and output (o) gates which determine the addition of new information to cell state (C), deletion of less important information from memory, and output gate that controls the output prediction (h). Similarly to Recurrent Neural Networks, LSTM networks use sequential information in which the output depends not only on the current inputs but also on previous ones, e.g. the input of a point x t is a value x t−n in the same series, where n is the look back. These gates work together to learn and store long-and short-term information related to the sequence.
States of LSTM cells are computed as follows [41]: where W ∈ R h×d , and U ∈ R h×h and b q ∈ R h are weights matrices and bias respectively, the subscript q can be either for input gate i, output gate o, forget gate f , or memory cell c depending on what is being calculated. The subscripts d and h refer to the number of input features and the number of hidden units, respectively. The ⊙ is the Hadamard entrywise product. Vectors i t ∈ R h , f t ∈ R h and o t ∈ R h are the input, forget, and output gates, respectively. Vector C t ∈ R h is the current cell state, and vectorĈ ∈ R h is the new candidate value for the cell state. The function σ(·) is a Sigmoid function and modulates equations (2)-(4) between 0 and 1. The decisions for these three gates are dependent on the current input x t ∈ R d and the previous output h t−1 ∈ R h . If the gate is 0, then the signal is blocked by the gate. Forget Gate f t defines how much of the previous state h t.1 is allowed to pass. Input gate i t decides which new information from the input to update or add to the cell state. Output gate o t solves which information to output based on the cell state. These gates work together to store and learn long and short-term sequence-related information. The memory cell C is an accumulator of the state information. Update of old cell state C t−1 into the new cell state C t is computed using equation (6). The calculation of both the new candidate valuesĈ of the memory cell and output of current LSTM block h t uses hyperbolic tangent function as in equations (5) and (7). The two states, cell state, and hidden state are being transmitted to the next cell for every time step. Weights and biases are obtained by minimizing a cost function during the training. LSTM neural networks consist of a set of connected LSTM cells.
LSTM models usually have better performance when the data are stationary. In order to check if a time series is stationary, the Augmented Dickey-Fuller test was conducted. For each series, we found that the p-value is less than 0.05. Therefore, the time series are stationary and easier to generalize [42].
During the training procedure, LSTM cell weights are tuned iteratively, starting from random weights. The main idea of this process is to cycle through all sequences in the training set a certain number of times, where each cycle is called one epoch. The number of training examples utilized per iteration is called batch size. A training step or an epoch can be divided into many iterations based on the batch size.
The most common loss function is used, i.e., the mean squared error. Squaring the error has the advantage of always resulting in a positive error and penalizes the learner more when the error is bigger. Getting smaller values of the loss function means that the prediction of our model is improving. The Adam optimization algorithm is used to minimize the loss function error. The default values from the deep learning library were used (learning rate = 0.001, β 1 = 0.9, β 2 = 0.999, ϵ = 1 × 10 −8 ) [43].

Training and evaluating time series models
To validate the performance of the time series forecasting models, the data is split into train-test sets, 70% − 30% for network training and test, respectively. During the training, observed data is the input, and during the test, the performance is evaluated. Therefore, our train set consists of 185 of 265 records. The input is the incidence of Dengue cases and three meteorological variables (incidence, average temperature, average atmospheric pressure, and rainfall). To describe the input vector associated with each city, consider a vector associated X ti corresponding to a specific week ti, where the vector X ti = [Icd ti , T a ti , P r a ti , R w ti ]. Then the input of the model is a concatenated vector X = [X t1 , X t2 , X t3 , ..., X t185 ] which has the information of the 185 weeks. Figure 3 shows the LSTM architecture considered in this work.

Time Series Clustering Analysis
Three main approaches to time series clustering are encountered in the literature (see, for instance, [35]): (i) distance-based, directly with distances on raw data points; (ii) feature-based, indirectly with features extracted from the raw data; (iii) model-based, indirectly with models built from the raw data. The performance of clustering approaches (i) and (ii) depend greatly on the particular distance metric used because the time series may have noise, different dynamics, different scales, etc. According to the literature, there are three primary methods for clustering time series (for an example, see [35]): The first method uses distances between raw data points directly; the second uses characteristics taken from the raw data in an indirect manner, and the third method uses models created from the raw data in an indirect manner. Because the time series may contain noise, different dynamics, different scales, etc., the performance of the first two clustering procedures strongly depends on the chosen distance metric.
The primary challenges are a) figuring out the number of clusters, b) defining the metrics of similarity, and c) if feature-based clustering is used, figuring out which features are the most important ones. This is because there is no prior knowledge to categorize the time series. We conducted multiple experiments evaluating the performance of the main clustering algorithms taking into account raw-based, and feature-based approaches [35] to determine which methodology would be the most suitable for our case study. The Elbow Method is employed to choose how many clusters to consider. Additionally, a silhouette, Calinski Harabasz, and Dunn scores are used to analyze the quality of the clusters.
For the raw-based approach, k-means, hierarchical and dbscan clustering are implemented. To the evaluation of the performance of the aforementioned clustering methods, a set of distance metrics are considered. The considered metrics are the Euclidean distance, point-wise correlation, Spearman correlation, and dynamic time warping. For feature-based clustering, following [44], a set of seventeen features are considered, i.e., the Mean, Variance, First order of autocorrelation, Strength of trend, Strength of linearity, Curvature, Seasonality, Strength of trough, Number of peaks, Spectral entropy, Lumpiness, Level of shift using the rolling window, Variance change, Flat spots using discretization, Number of crossing points, Kullback-Leiber score and the Index of the maximum KLscore.
Silhouette score, Calinski Harabasz score, and Dunn score are used to analyze the separation distance between the resulting clusters. These scores are especially useful if there is no a priori knowledge of what is the true label for each object, which is the most common situation in real applications. In this paper, we consider that for a pair of clusters A and B, the silhouette score s(i) is computed as [45]: where i ∈ A, and a(i) is the mean distance associated to the point i to all the other points in the cluster A.
Similarly, b(i) is the mean distance associated with the point i to all the points of the cluster B.
To evaluate the quality of each cluster encountered in terms of cohesion, for each cluster A, consider its associated a(i) as a metric between the point i and the corresponding cluster A, i.e. how well the point i is assigned to the cluster A. In this case, the smaller the value of a(i), the better the assignment.
To evaluate how the clusters are well defined in terms of separation one from each other, we consider a cluster B such as i / ∈ B, and such that the distance between i ∈ A and the set B is the closest amongst all other encountered clusters. The expression (8) provides a metric between the considered clusters A and B. In this case, the smaller the value of s(i), the more the proximity of the clusters A and B.
where, n k and c k are the number of points and centroid of the k th group respectively, c is the global centroid, N is the total number of data points.
A higher Caliniski-Harabasz score index value means that the clusters are dense and well separated, although there is no acceptable cut-off value [46].
Dunn score also called the Dunn index, is defined as where δ(X i , X j ) is the intercluster distance and ∆(X k ) is the intracluster distance. Table 1 presents the Silhouette, Calinski-Harabasz, and Dunn scores in order to compare the raw-based and feature-based clustering. The best results are obtained more consistently with all metrics with the feature-based methods using Spearman correlation. K-means and Hierarchical clustering present similar clustering scores. However, based on [28] results, the method we selected was hierarchical clustering. Distance-based approach recognized a greater number of districts as outliers while getting a number of seven clusters. Feature-based approach allocated the data in four clusters and recognized fewer districts as outliers.
Acknowledging the feature-based approach leads to better performance on clustering, further experiments were focused on this method following the proposal techniques from [29]. To this end, we obtain global features from a time series to summarize, describe, and group them.

Grouping by incidence to evaluate the forecasting
We divide the time series of the cities in three equal-sized groups according to their population. This is because less populated areas usually have fewer cases. The first group (denoted as Group 1, which corresponds to the most populated cities) is composed by taking the population from the 66rd to 100rd

Forecasting in groups
This research aims to enhance a deep-learning neural network's capacity to forecast dengue cases through clustering. Clustering can be done in four different ways: (i) by taking into account each city individually; (ii) by taking into account the administrative division of the nation into departments (Paraguay has 17 departments or states, each of which includes several neighboring cities); (iii) by combining all the series of each city; and (iv) by forming groups using the best clustering technique. Based on this, the approaches that are taken into consideration are City, which is trained using only data from the city (for this approach, we have 217 models), Department, which is trained using data from the department (for this approach, we have 17 models), Cluster, which is trained using data from each cluster encountered as described in §3 (six models are implemented with this approach), and Country, which is trained using all the data (one model is implemented with this approach). The algorithm presented in Algorithm 4 illustrates the procedure of the function that produces inputs for each model. Note that the input for country will be the same for all cities. All models are evaluated at the city level to verify for generalization in the forecasts. The summarized workflow of the model selection can be seen in Figure 1.
The models City, Department, Cluster, and Country were trained with data from the city, the department, the cluster to which it belongs, and all the data from the country, respectively. The test set consists of the first thirty-five weeks of the year 2013. The metrics calculated to evaluate the forecast of each model are the RMSE, the standard deviation of the RMSE, and the maximum percentual errors. The group and selected cities were described in §3.1.
To measure the performance of each model, we use the root mean squared error (RM SE), the Standard Deviation of Percentual Error (ST DERR), and Maximum Percentual Error (M AXERR), which are used to measure the distance between the predicted and observed values. In this paper, we consider the following definition for the root-mean-squared error.
The root mean squared error (RM SE) is defined as follows The Standard Deviation of Percentual Error is defined as follows where u is the mean of data, and X is defined as where The Maximum Percentual Error is defined as in equations11 to 14 Y t is the Dengue incidence observed for time t, andŶ t is the incidence predicted by the model for a time t, and n is the size of the set. Figure 5, 6 and 7 present the metrics values for the test set and each city of the groups. Figure 5 shows that model Cluster has the best results for each city of Group 1, followed by the City and Department models. However, in Figure 6 and Figure 7 we can see that Department and City performs better than the Cluster model.
The Cluster model presents the lowest RMSE in most cities compared to City, Department, and Country in Group 1. Figure 8 shows that in the test case, the Cluster model performs better even for cities from Groups 2 and 3.
The typical maximum incidence, or Icd, is around 0.33. All models perform well in cities with incidence rates close to the mean. Only in Group 2 (intermediate incidence cities) the Cluster model barely outperforms other models. But in cities from Group 1, such as Capiatá, where RMSE improves 17.2% in comparison to the benchmark performing model, the Cluster model performs significantly better. The City and Department models frequently fail almost entirely in models with incidences that are well outside the average, with the Cluster model performing up to 62.5% percent better. The Country model consistently comes in second place in experiments; nevertheless, when it surpasses the Cluster model, the average improvement is only 8.3%, which is negligible in comparison to the other results. Peaks in cities with incidence Icd levels below the national average are understated by the Country model. The city of Encarnación, which has the worst result for this model with a 10.3% below average, is a good example of this. When the model is trained using all the cities (217) in the database, there are many cities with low incidence, which results in lower forecasts, the underestimation may be caused by this.
City model performs very similarly to Cluster, but fails in high incidence cities. The discrepancy between the best model and the City model in the situations of Areguá (difference of 16.5%) and San Lorenzo (difference of 62.5%) makes this very evident. Low-incidence cities typically do not keep track of cases in the early years of an epidemic, which could be one reason for this phenomenon. This indicates a lack of data for the model, which prevents it from understanding how outbreaks behave.
The Department model performs the worst overall. This should not come as a huge surprise because the occurrence of dengue cases is unrelated to the political-territorial arrangement.
In general, the Cluster model significantly improves the forecast, especially in cities from Group 1 (the cities with a high incidence of Dengue Fever). As far as the amount of computational work needed to cover the entire country with Cluster it is necessary to train six models, for Country only one but with a large amount of data which is quite computationally expensive, with City 217 models should be trained, which also represents a lot of work. Hence, the Cluster presents the best trade-off between computational cost and performance when the country needs to be analyzed.

Conclusion
A forecasting method will be a useful tool for guiding the efforts of the health surveillance system ahead of a Dengue regional outbreak, helping to prevent the possibility of its evolution into an epidemic situation. However, due to the lack of (or incomplete) datasets in several regions, forecasting methods may fail in their predictions, affecting the effectiveness of actions taken.
In this work, we propose the grouping of data to improve the performance of the models. Several models were evaluated with different grouping approaches. We tested clustering techniques to correct the lack of data for some cities. Our results indicate that an LSTM model, in combination with a hierarchical clustering algorithm, improves forecast accuracy.
With data clustering, we were able to improve the performance of the models. Also we were able to decrease the number of models needed to make a national prediction at the city level, since a model fitted with a cluster can be used to make predictions in each city that belongs to that cluster. We believe this approach can be applied to wide geographic regions and other mosquito-borne diseases. In future works, we will aim to optimize the clustering of the time series by designing more experiments and also combining data augmentation techniques with clustering.