5TH International Congress on Technology - Engineering & Science - Kuala Lumpur - Malaysia (2018-02-01)

Applying Data Science Methods In Production And Geological Data Verification

Exploration and production data analysis is a foundation of oilfield development. Unfortunately, some parts of data are not always correct, so it is difficult to perform meaningful analysis. The following problems often encounter: the lack of information for some time intervals, measurements do not always correspond to a physical model or do not coordinate with each other. For example, workover candidate selection strongly depend on the quality of initial geological and production data. Decisions based on inaccurate information can lead to negative consequences; therefore, it is necessary to verify data operatively for accuracy and consistency. Currently, the amount of incoming information increases and requirements for data quality rise; therefore, it is impossible to verify data manually. As a result, it becomes an urgent task to develop algorithms for automated analysis of field data [1]. Today various analytical and statistical methods are actively used to verify and improve the quality of operation data. However, these models have a number of limitations. For example, analytical models have a number of physical constraints. Statistical models are highly dependent on input data, and in the result can be very different depending on the selected model. The numerical models require a lot of time to calculate, which do not allowе to use these models to online monitoring. To address above shortcomings, can use the methods of analyzing the Big Data, including statistical models and machine-learning techniques. In order to assess the quality of the operation data, all information about the reservoir and the fluid, and the performance of the wells was used. As a result, the data verification algorithm is divided into several stages. The first step is to cast all the data to a single structure, and then visualize the key issues in the data. The second step is to remove anomalus (non-physical) values in the data. Different classes of methods (Figure 1) were used to detect anomalus values. The first method is to build the trend using the methods of weighted moving averages and sliding medians and calculate the difference between the trend and the value in the points, and then exclude the very different values. The second method is the detection of abnormal values based on an analysis of the density of the observations in multidimensional space. One such method is the LOF algorithm [2]. This algorithm is based on the concept of local density, where locality is defined by k-closest neighbors, the distances to which are used to estimate density. Also, this group of methods includes the Isolation Forest and One Class SVM [3] algorithms. After removing the incorrect values, it is necessary to perform a data recovery procedure. There are several algorithms that can be performed to restore missing values in an array of data. The first algorithm is modified ZET-algorithm for data cubes [4]. The ZET-algorithm is based on three assumptions. The first (redundancy hypothesis) is that the real tables have redundancy, which is manifested in the presence of similar objects (rows) and the properties (columns) that are dependent on each other. The second (the local compaction hypothesis) is that to predict a skipped item, you need to use only the "competent" part of the table instead of the whole spreadsheet. The third (the linear constraint hypothesis) is that only linear constraints are used among the possible types of constraints between the columns (rows) in the ZET-algorithm. The second approach is to modify the ZET-algorithm by removing sections that contain missing values. The modification of the algorithm is based on the reduction of the size of the cube by removing the known uninformative sections containing a large number of missing values. The full algorithm is to apply the algorithm sequentially, which deletes the sections with the missing values, and the algorithm of building the compact subcube. The compact cubes that received in the previous step are used to train the algorithms for recovering missing values. The methods used: Linear regression, k-nearest neighbors method, Random Forest method [5, 6]. It was established in the experiment that the method of data recovery with the construction of a compact subcubes is the most accurate. For compact subcubes with the average error was 0.77%. The average error for compact subcubes with pre-removed missing values was 1.27%. All methods were tested on synthetic and operational data from set of oilfields. The developed software complex is universal from the point of view of the use of accessed data and deposits subject to bring input data into a uniform format. The algorithms maintain control parameters for fine-tuning according to researched field. Further development includes the implementation of the developed software in corporate information systems and implementation these algorithms with the physical models. These implementation of algorithms can improve the quality of data analysis and forecasts.
Maksim Simonov, Dmitriy Perets, Alla Andrianova