+7 (495) 987 43 74 ext. 3304
Join us -              
Рус   |   Eng

Authors

Shurykin Alexander S.

Degree
Undergraduate, department of Information Technology in Economics and Management, the Branch of National Research University MPEI in Smolensk
E-mail
shurykins@mail.ru
Location
Smolensk, Russia
Articles

Solving the problem of calendar data preprocessing during the implementation of Data Mining technology

At the moment, dirty data, that is, low-quality data, is becoming one of the main problems of effectively solving Data Mining tasks. Since the source data is accumulated from a variety of sources, the probability of getting dirty data is very high. In this regard, one of the most important tasks that have to be solved during the implementation of the Data Mining process is the initial processing (clearing) of data, i.e. preprocessing. It should be noted that preprocessing calendar data is a rather time-consuming procedure that can take up to half of the entire time of implementing the Data Mining technology. Reducing the time spent on the data cleaning procedure can be achieved by automating this process using specially designed tools (algorithms and programs). At the same time, of course, it should be remembered that the use of the above elements does not guarantee one hundred percent cleaning of "dirty" data, and in some cases may even lead to additional errors in the source data. The authors developed a model for automated preprocessing of calendar data based on parsing and regular expressions. The proposed algorithm is characterized by flexible configuration of preprocessing parameters, fairly simple implementability and high interpretability of results, which in turn provides additional opportunities for analyzing unsuccessful results of Data Mining technology application. Despite the fact that the proposed algorithm is not a tool for cleaning absolutely all types of dirty calendar data, nevertheless, it successfully functions in a significant part of real practical situations. Read more...