+7 (495) 987 43 74 ext. 3304
Join us -              
Рус   |   Eng

articles

№ 6(90) 28 december 2020 year
Rubric: Models and methods
Authors: Okunev  B., Shurykin A.

Скачать статью

At the moment, dirty data, that is, low-quality data, is becoming one of the main problems of effectively solving Data Mining tasks. Since the source data is accumulated from a variety of sources, the probability of getting dirty data is very high. In this regard, one of the most important tasks that have to be solved during the implementation of the Data Mining process is the initial processing (clearing) of data, i.e. preprocessing. It should be noted that preprocessing calendar data is a rather time-consuming procedure that can take up to half of the entire time of implementing the Data Mining technology. Reducing the time spent on the data cleaning procedure can be achieved by automating this process using specially designed tools (algorithms and programs). At the same time, of course, it should be remembered that the use of the above elements does not guarantee one hundred percent cleaning of "dirty" data, and in some cases may even lead to additional errors in the source data. The authors developed a model for automated preprocessing of calendar data based on parsing and regular expressions. The proposed algorithm is characterized by flexible configuration of preprocessing parameters, fairly simple implementability and high interpretability of results, which in turn provides additional opportunities for analyzing unsuccessful results of Data Mining technology application. Despite the fact that the proposed algorithm is not a tool for cleaning absolutely all types of dirty calendar data, nevertheless, it successfully functions in a significant part of real practical situations. Continue...
№ 6(90) 28 december 2020 year
Rubric: Algorithmic efficiency
The author: Kuznetsova A. A.

Скачать статью

Average precision (AP) as the area under the Precision – Recall curve is the de facto standard for comparing the quality of algorithms for classification, information retrieval, object detection, etc. However, traditional Precision – Recall curves usually have a zigzag shape, which makes it difficult to calculate the average precision and to compare algorithms. This paper proposes a statistical approach to the construction of Precision – Recall curves when assessing the quality of algorithms for object detection in images. This approach is based on calculating Statistical Precision and Statistical Recall. Instead of the traditional confidence level, a statistical confidence level is calculated for each image as a percentage of objects detected. For each threshold value of the statistical confidence level, the total number of correctly detected objects (Integral TP) and the total number of background objects mistakenly assigned by the algorithm to one of the classes (Integral FP) are calculated for each image. Next, the values of Precision and Recall are calculated. Precision – Recall statistical curves, unlike traditional curves, are guaranteed to be monotonically non-increasing. At the same time, the Statistical Average Precision of object detection algorithms on small test datasets turns out to be less than the traditional Average Precision. On relatively large test image datasets, these differences are smoothed out. The comparison of the use of conventional and statistical Precision – Recall curves is given on a specific example. Continue...
№ 6(90) 28 december 2020 year
Rubric: Algorithmic efficiency
Authors: Fedorova E., Afanasyev D., Demin I., Lazarev A., Nersesyan R., Pyltsin I. V.

Скачать статью

The main goal of the research is to develop a publicly available tonal-thematic dictionary in Russian, which allows identifying the semantic orientation of groups of economic texts, as well as determining their sentimental (tonal) characteristics. The article describes the main stages of compiling a dictionary using machine learning methods (clustering, word frequency allocation, correlogram construction) and expert evaluation of determining the tonality and expanding the dictionary by including terms from similar foreign dictionaries. The empirical base of the research included: annual reports of companies, news from ministries and the Central Bank of the Russian Federation, financial tweets of companies and RBC news articles in the area of "Economics, Finance, money and business". The compiled dictionary differs from the previous ones in the following ways: 1. it is one of the first dictionaries which can be used to rate the tone of economic and financial texts in Russian language by 5 degrees of tonality; 2. allows you to rate the tonality and content of the text by 12 economic topics (e. g., macroeconomics, monetary policy, stock and commodity markets, etc.) 3. the final version of EcSentiThemeLex dictionary is included in the software package (library) ‘rulexicon’ for the programming environment R and Python. Step-by-step examples of using the developed library in the R environment are given. It allows to evaluate the tone and thematic focus of an economic or financial text by means of a concise code. The structure of the library allows you to use the original texts for their assessment without prior lemmatization (the reduction to elementary forms).The resulting EcSentiThemeLex dictionary is included in the rulexicon software package for the R modeling environment .The tonal-thematic dictionary EcSentiThemeLex with all word forms compiled in this work will simplify the solution of applied problems of text analysis in the financial and economic sphere, and can also potentially serve as a basis for increasing the number of relevant studies in the Russian literature. Continue...
№ 6(90) 28 december 2020 year
Rubric: Processes and systems modeling
The author: Veselov A.

Скачать статью

In designing modern computer equipment and digital electronics, the use of simulation models is of great importance. At first, monolithic models were widely used for this. However, they worked well only when their size was relatively small. Because of it developers began to refuse gradually use of monolithic models and to pass to use of the distributed models allowing to increase their speed and to expand borders of their admissible sizes. At the same time, they begin to pay special attention to hierarchical distributed models, which provide the opportunity to investigate the behavior of the created devices at different levels of detail. Similar models made it possible to noticeably expand the permissible boundaries of their sizes and increase the speed of work. However, such distributed models have the disadvantage that their effectiveness is noticeably dependent not only on the number of components included in their composition, but also on the size of these components. he paper presents the results of a study of the effect of introducing an additional upper hierarchical level on the performance of distributed models based on Petri networks. The use of such a method of modifying distributed models leads to an increase in their speed in a wide range of changes in their sizes. At the same time, the most significant effect achieved in distributed models containing a large number of small components. The maximum speed of the thus modified models can be an order of magnitude higher than that of the non-modified ones. As a result, in addition to the overall increase in the efficiency of the modified hierarchical distributed models, this also led to a significant equalization of the performance of the modified distributed models with subordinate components of different sizes. Continue...
№ 6(90) 28 december 2020 year
Rubric: Researching of processes and systems
Authors: Kalabikhina I., Abduselimova I., Arkhangelsky  V., Banin E., Klimenko G., Kolotusha  A., Nikolaeva U., Shamsutdinova V.

Скачать статью

Demographic indicators are important functions of state programs for the development of Russia, operational monitoring of demographic development is the key to the successful implementation of programs. Very often, government statistics data are published with a delay, which does not allow their use for operational monitoring and planning. In this work, the approach allows for the rapid assessment of demographic processes in the field of formation and forecasting of demographic trends in the short term based on data from query statistics from Google Trends. The relationships between the search queries and demographics are analyzed using Pearson's correlation. The analysis uses annual (total fertility rate, abortions per 100 births, abortions per 1000 women, marriages and divorces per 1000 population) and monthly data (number of births, number of marriages and divorces) by birth, marriages and abortions with and without lags. The analysis is carried out on data for Russia as a whole and for the eight most populated regions: Moscow, Moscow Region, Krasnodar Territory, St. Petersburg, Rostov Region, Sverdlovsk Region, Republic of Tatarstan, Republic of Bashkortostan. Using the temporal metrics available in Google Trends since 2004, some demographics can be predicted based on data from related queries to the Google search algorithm using the ARIMA model. Thus, it is possible to use query data as a supplement to demographic data, when building multiple regression models for demographic calculations, or use it as a proxy variable. Continue...
№ 6(90) 28 december 2020 year
Rubric: Researching of processes and systems
Authors: Marenko V., Lozhnikov V.

Скачать статью

Description of a new method of researching objects in the form of a set of information tasks is the goal of the work. A simplicial analysis of the cognitive structure of the object of study is included in the method. Several stages have a method. The set of basic factors is revealed at the first stage. Pairwise comparison of factors is carried out. The formation of the cognitive model in the form of an adjacency matrix of the 1st level of the hierarchy is done. Factors for the formation of the 2nd level of the hierarchy are grouped. The combination of components in the cognitive structure of the 3rd level of the hierarchy is carried out. Detailing the components of the 3rd level of the hierarchy is presented at the 4th level. A series of simulation experiments is conducted to test the stability of the detailed structure of the cognitive model. The implicit relationship between the underlying factors are studied. The method was tested on the example of the cognitive model "lifestyle" of students. The components "living conditions", "cognitive dissonance", and "performance" are grouped at the second level of the hierarchy. A simulation experiment was conducted. The presence of pulse resonance in the detailed structure of the 4th level of the hierarchy is established. Simplicial analysis is done. The ordering of the elements its purpose is. The simulation experiment was made after simplicial analysis. The result now corresponds to the theory. The influence of cognitive dissonance of the individual on the "activity" was revealed. The "activity" factor affects cognitive dissonance, among other things. To identify significant factors, detection of hidden trends and the implementation of measures of social control, this method need Continue...