Cluster analysis is

Good day. I have respect for people who are fans of their work.

Maxim, my friend, belongs to this category. Constantly works with numbers, analyzes them, and makes appropriate reports.

Yesterday we had lunch together, and for almost half an hour he told me about cluster analysis - what it is and in what cases its use is justified and appropriate. Well, what am I?

I have a good memory, so I will provide all this data, by the way, which I already knew about, to you in its original and most informative form.

Cluster analysis is designed to divide a set of objects into homogeneous groups (clusters or classes). This is a multidimensional data classification problem.

There are about 100 different clustering algorithms, but the most commonly used are hierarchical cluster analysis and k-means clustering.

Where is cluster analysis used? In marketing, this is the segmentation of competitors and consumers.

In management: dividing personnel into groups of different levels of motivation, classifying suppliers, identifying similar production situations in which defects occur.

In medicine - classification of symptoms, patients, drugs. In sociology, the division of respondents into homogeneous groups. In fact, cluster analysis has proven itself well in all spheres of human life.

Lovely this method— it works even when there is little data and the requirements of normality of distributions of random variables and other requirements of classical methods of statistical analysis are not met.

Let us explain the essence of cluster analysis without resorting to strict terminology:
Let's say you conducted a survey of employees and want to determine how to most effectively manage personnel.

That is, you want to divide employees into groups and highlight the most effective management levers for each of them. At the same time, differences between groups should be obvious, and within the group respondents should be as similar as possible.

To solve the problem, it is proposed to use hierarchical cluster analysis.

As a result, we will get a tree, looking at which we must decide how many classes (clusters) we want to divide the personnel into.

Suppose that we decide to divide the staff into three groups, then to study the respondents who fall into each cluster we will get a table with approximately the following content:


Let us explain how the above table is formed. The first column contains the number of the cluster - the group, the data for which is reflected in the line.

For example, the first cluster is 80% men. 90% of the first cluster fall into the age category from 30 to 50 years, and 12% of respondents believe that benefits are very important. And so on.

Let's try to create portraits of respondents in each cluster:

  1. The first group is mostly men mature age holding leadership positions. They are not interested in the social package (MED, LGOTI, TIME-free time). They prefer to receive a good salary rather than help from an employer.
  2. Group two, on the contrary, gives preference to the social package. It consists mainly of “aged” people occupying low positions. Salary is certainly important to them, but there are other priorities.
  3. The third group is the “youngest”. Unlike the previous two, there is an obvious interest in learning and professional development opportunities. This category of employees has a good chance to soon join the first group.

Thus, when planning an implementation campaign effective methods personnel management, it is obvious that in our situation it is possible to increase the social package of the second group to the detriment, for example, of wages.

If we talk about which specialists should be sent for training, we can definitely recommend paying attention to the third group.

Source: http://www.nickart.spb.ru/analysis/cluster.php

Features of cluster analysis

A cluster is the price of an asset during a certain period of time during which transactions were made. The resulting volume of purchases and sales is indicated by a number inside the cluster.

A bar of any timeframe usually contains several clusters. This allows you to see in detail the volumes of purchases, sales and their balance in each individual bar, at each price level.


A change in the price of one asset inevitably entails a chain of price movements in other instruments.

Attention!

In most cases, understanding a trend movement occurs already at the moment when it is rapidly developing, and entering the market along the trend risks ending up in a correctional wave.

For successful transactions, you need to understand the current situation and be able to anticipate future price movements. This can be learned by analyzing the cluster graph.

Using cluster analysis, you can see the activity of market participants within even the smallest price bar. This is the most accurate and detailed analysis, as it shows the point distribution of transaction volumes at each price level of the asset.

There is a constant conflict between the interests of sellers and buyers in the market. And every smallest price movement (tick) is a move towards a compromise - the price level - which in at the moment suits both parties.

But the market is dynamic, the number of sellers and buyers is constantly changing. If at one point in time the market was dominated by sellers, then at the next moment there will most likely be buyers.

The number of transactions completed at adjacent price levels is also not the same. And yet, first the market situation is reflected in the total volume of transactions, and only then in the price.

If you see the actions of the dominant market participants (sellers or buyers), then you can predict the price movement itself.

To successfully apply cluster analysis, you first need to understand what a cluster and delta are.


A cluster is a price movement that is divided into levels at which transactions with known volumes were made. Delta shows the difference between the purchases and sales occurring in each cluster.

Each cluster, or group of deltas, allows you to understand whether buyers or sellers dominate the market at a given time.

It is enough just to calculate the total delta by summing up sales and purchases. If the delta is negative, then the market is oversold and there are redundant sell transactions. When the delta is positive, buyers clearly dominate the market.

The delta itself can take a normal or critical value. The delta volume value above normal in the cluster is highlighted in red.

If the delta is moderate, then this characterizes a flat state in the market. With a normal delta value, a trend movement is observed in the market, but a critical value is always a harbinger of a price reversal.

Forex trading using CA

To obtain maximum profit, you need to be able to determine the transition of the delta from a moderate level to a normal one. Indeed, in this case, you can notice the very beginning of the transition from flat to trend movement and be able to get the greatest profit.

A cluster chart is more visual; you can see significant levels of accumulation and distribution of volumes, and build support and resistance levels. This allows the trader to find the exact entry into the trade.

Using the delta, you can judge the predominance of sales or purchases in the market. Cluster analysis allows you to observe transactions and track their volumes inside a bar of any TF.

This is especially important when approaching significant support or resistance levels. Cluster judgments are the key to understanding the market.

Source: http://orderflowtrading.ru/analitika-rynka/obemy/klasternyy-analiz/

Areas and features of application of cluster analysis

The term cluster analysis (first coined by Tryon, 1939) actually includes a set of different classification algorithms.

General question, asked by researchers in many fields, is how to organize observed data into visual structures, i.e. expand taxonomies.

According to modern system According to biology, humans belong to primates, mammals, amniotes, vertebrates and animals.

Note that in this classification, the higher the level of aggregation, the less similarity there is between members in the corresponding class.

Humans bear more similarities to other primates (i.e., apes) than to “outlying” members of the mammalian family (i.e., dogs), etc.

Note that the previous discussion refers to clustering algorithms, but does not mention anything about statistical significance testing.

In fact, cluster analysis is not so much an ordinary statistical method as a “set” of various algorithms for “distributing objects into clusters.”

There is a point of view that, unlike many other statistical procedures, cluster analysis methods are used in most cases when you do not have any a priori hypotheses about the classes, but are still in the descriptive stage of the study.

Attention!

It should be understood that cluster analysis determines the “most likely significant solution.”

Therefore, statistical significance testing is not really applicable here, even in cases where p-levels are known (as in the K-means method).

Clustering techniques are used in a wide variety of fields. Hartigan (1975) gave an excellent review of many published studies containing results obtained using cluster analysis methods.

For example, in the field of medicine, clustering of diseases, treatments for diseases, or symptoms of diseases leads to widely used taxonomies.

In the field of psychiatry, correct diagnosis of symptom clusters such as paranoia, schizophrenia, etc. is crucial for successful therapy. In archeology, using cluster analysis, researchers try to establish taxonomies of stone tools, funeral objects, etc.

There are widespread applications of cluster analysis in marketing research. In general, whenever it is necessary to classify “mountains” of information into groups suitable for further processing, cluster analysis turns out to be very useful and effective.

Tree Clustering

The example given in the Main Purpose section explains the purpose of the tree clustering algorithm.

The purpose of this algorithm is to group objects (such as animals) into large enough clusters using some measure of similarity or distance between objects. The typical result of such clustering is a hierarchical tree.

Consider a horizontal tree diagram. The diagram starts with each object in the class (on the left side of the diagram).

Now imagine that gradually (in very small steps) you “relax” your criterion about which objects are unique and which are not.

In other words, you lower the threshold related to the decision to combine two or more objects into one cluster.

As a result, you link more and more objects together and aggregate (combine) more and more clusters consisting of increasingly different elements.

Finally, in the last step, all objects are combined together. In these diagrams, the horizontal axes represent the join distance (in vertical tree diagrams, the vertical axes represent the join distance).

So, for each node in the graph (where a new cluster is formed), you can see the distance value for which the corresponding elements are associated into a new single cluster.

When data has a clear "structure" in terms of clusters of objects that are similar to each other, then this structure is likely to be reflected in the hierarchical tree by different branches.

As a result of successful analysis using the merging method, it becomes possible to detect clusters (branches) and interpret them.

The union or tree clustering method is used to form clusters of dissimilarity or distance between objects. These distances can be defined in one-dimensional or multi-dimensional space.

For example, if you were to cluster types of food in a cafe, you might take into account the number of calories it contains, price, subjective taste rating, etc.

The most direct way to calculate distances between objects in multidimensional space is to calculate Euclidean distances.

If you have a two- or three-dimensional space, then this measure is the actual geometric distance between objects in space (as if the distances between objects were measured with a tape measure).

However, the pooling algorithm does not "care" whether the distances "provided" for that distance are the real ones or some other derived distance measure, which is more meaningful to the researcher; and the challenge for researchers is to select the right method for specific applications.

Euclidean distance. This appears to be the most common type of distance. It is simply a geometric distance in multidimensional space and is calculated as follows:

Note that the Euclidean distance (and its square) is calculated from the original data, not the standardized data.

This is a common way to calculate it, which has certain advantages (for example, the distance between two objects does not change when a new object is introduced into the analysis, which may turn out to be an outlier).

Attention!

However, distances can be greatly influenced by differences between the axes from which the distances are calculated. For example, if one of the axes is measured in centimeters, and you then convert it to millimeters (multiplying the values ​​by 10), then the final Euclidean distance (or square of the Euclidean distance) calculated from the coordinates will change greatly, and as a result, the results of the cluster analysis may differ greatly from previous ones.

Squared Euclidean distance. Sometimes you may want to square the standard Euclidean distance to give more weight to objects that are farther apart.

This distance is calculated as follows:

City block distance (Manhattan distance). This distance is simply the average of the differences over the coordinates.

In most cases, this distance measure produces the same results as the ordinary Euclidean distance.

However, we note that for this measure the influence of individual large differences (outliers) is reduced (since they are not squared). The Manhattan distance is calculated using the formula:

Chebyshev distance. This distance can be useful when one wants to define two objects as "different" if they differ in any one coordinate (in any one dimension). The Chebyshev distance is calculated using the formula:

Power distance. Sometimes one wishes to progressively increase or decrease weights related to a dimension for which the corresponding objects are very different.

This can be achieved using power-law distance. Power distance is calculated using the formula:

where r and p are user-defined parameters. A few examples of calculations can show how this measure “works”.

The p parameter is responsible for the gradual weighting of differences along individual coordinates, the r parameter is responsible for the progressive weighting of large distances between objects. If both parameters r and p are equal to two, then this distance coincides with the Euclidean distance.

Percentage of disagreement. This measure is used when the data is categorical. This distance is calculated by the formula:

Association or connection rules

At the first step, when each object is a separate cluster, the distances between these objects are determined by the selected measure.

However, when several objects are linked together, the question arises, how should the distances between clusters be determined?

In other words, a union or connection rule is needed for the two clusters. There are various possibilities here: for example, you can link two clusters together when any two objects in two clusters closer friend to each other than the corresponding communication distance.

In other words, you use the "nearest neighbor rule" to determine the distance between clusters; this method is called the single link method.

This rule builds “fibrous” clusters, i.e. clusters “linked together” only by individual elements that happen to be closest to each other.

Alternatively, you can use neighbors in clusters that are farthest from each other by all other pairs of objects. This method is called the full link method.

There are also many other methods for combining clusters similar to those discussed.

Single link (nearest neighbor method). As described above, in this method, the distance between two clusters is determined by the distance between the two closest objects (nearest neighbors) in different clusters.

This rule must, in a sense, string objects together to form clusters, and the resulting clusters tend to be represented by long "chains".

Full link (most distant neighbors method). In this method, distances between clusters are determined by the largest distance between any two objects in different clusters (i.e., "most distant neighbors").

Unweighted pairwise average. In this method, the distance between two different clusters is calculated as the average distance between all pairs of objects in them.

The method is effective when objects actually form different “groves”, but it works equally well in cases of extended (“chain” type) clusters.

Note that in their book, Sneath and Sokal (1973) introduce the abbreviation UPGMA to refer to this method as the unweighted pair-group method using arithmetic averages.

Weighted pairwise average. The method is identical to the unweighted pairwise average method, except that the size of the corresponding clusters (that is, the number of objects they contain) is used as a weighting factor in the calculations.

Therefore, the proposed method should be used (rather than the previous one) when unequal cluster sizes are assumed.

The book by Sneath and Sokal (1973) introduces the acronym WPGMA to refer to this method as the weighted pair-group method using arithmetic averages.

Unweighted centroid method. In this method, the distance between two clusters is defined as the distance between their centers of gravity.

Attention!

Sneath and Sokal (1973) use the acronym UPGMC to refer to this method as the unweighted pair-group method using the centroid average.

Weighted centroid method (median). This method is identical to the previous one, except that the calculations use weights to take into account the difference between the sizes of clusters (i.e., the number of objects in them).

Therefore, if there are (or are suspected) significant differences in cluster sizes, this method is preferable to the previous one.

Sneath and Sokal (1973) used the abbreviation WPGMC to refer to it as the weighted pair-group method using the centroid average.

Ward's method. This method is different from all other methods because it uses analysis of variance techniques to estimate the distances between clusters.

The method minimizes the sum of squares (SS) for any two (hypothetical) clusters that can be formed at each step.

Details can be found in Ward (1963). Overall, the method seems to be very effective, but it tends to create small clusters.

This method was previously discussed in terms of "objects" that need to be clustered. In all other types of analysis, the question of interest to the researcher is usually expressed in terms of observations or variables.

It turns out that clustering, both by observations and by variables, can lead to quite interesting results.

For example, imagine that a medical researcher is collecting data on various characteristics (variables) of patients' conditions (cases) suffering from heart disease.

A researcher may want to cluster observations (patients) to identify clusters of patients with similar symptoms.

At the same time, the researcher may want to cluster variables to identify clusters of variables that are associated with similar physical conditions.e

After this discussion regarding whether to cluster observations or variables, one might ask, why not cluster in both directions?

The Cluster Analysis module contains an efficient two-way join routine that allows you to do just that.

However, two-way pooling is used (relatively rarely) in circumstances where both observations and variables are expected to simultaneously contribute to the discovery of meaningful clusters.

Thus, returning to the previous example, we can assume that a medical researcher needs to identify clusters of patients who are similar in relation to certain clusters of physical condition characteristics.

The difficulty in interpreting the results obtained arises from the fact that similarities between different clusters may arise from (or be the cause of) some differences in subsets of variables.

Therefore, the resulting clusters are heterogeneous in nature. This may seem a little hazy at first; in fact, compared to other cluster analysis methods described, two-way join is probably the least commonly used method.

However, some researchers believe that it offers a powerful means of exploratory data analysis (for more detailed information you may want to refer to Hartigan's description of this method (Hartigan, 1975).

K means method

This clustering method differs significantly from such agglomerative methods as Union (tree clustering) and Two-way union. Let's assume you already have hypotheses about the number of clusters (based on observations or variables).

You can tell the system to form exactly three clusters so that they are as distinct as possible.

This is exactly the type of problem that the K-means algorithm solves. In general, the K-means method builds exactly K different clusters located at the greatest possible distances from each other.

In the example of a physical condition, a medical researcher might have a "suspicion" from his clinical experience that his patients mainly fall into three different categories.

Attention!

If this is the case, then the averages of the various measures of physical parameters for each cluster will provide a quantitative way of representing the researcher's hypotheses (eg, patients in cluster 1 have a high parameter 1, a low parameter 2, etc.).

From a computational point of view, you can think of this method as an analysis of variance in reverse. The program starts with K randomly selected clusters and then changes the objects' membership in them so that:

  1. minimize variability within clusters,
  2. maximize variability between clusters.

This method is similar to the reverse ANOVA method in that the test of significance in ANOVA compares between-group and within-group variability in testing the hypothesis that group means differ from each other.

In K-means clustering, the program moves objects (i.e., observations) from one group (cluster) to another in order to obtain the most significant result when conducting an analysis of variance (ANOVA).

Typically, once the results of a K-means cluster analysis are obtained, the means for each cluster along each dimension can be calculated to assess how different the clusters are from each other.

Ideally, you should obtain widely varying means for most, if not all, of the measurements used in the analysis.

Source: http://www.biometrica.tomsk.ru/textbook/modules/stcluan.html

Classification of objects according to their characteristics

Cluster analysis is a set of multidimensional statistical methods for classifying objects according to the characteristics that characterize them, dividing a set of objects into homogeneous groups that are similar in defining criteria, and identifying objects of a certain group.

A cluster is a group of objects identified as a result of cluster analysis based on a given measure of similarity or differences between objects.

Object – these are specific objects of research that need to be classified. The objects of classification are, as a rule, observations. For example, consumers of products, countries or regions, products, etc.

Although it is possible to conduct cluster analysis by variables. Classification of objects in multidimensional cluster analysis occurs according to several criteria simultaneously.

These can be either quantitative or categorical variables, depending on the cluster analysis method. So, main goal cluster analysis – finding groups of similar objects in a sample.

The set of multivariate statistical methods of cluster analysis can be divided into hierarchical methods (agglomerative and divisive) and non-hierarchical (k-means method, two-stage cluster analysis).

However, there is no generally accepted classification of methods, and cluster analysis methods sometimes also include methods for constructing decision trees, neural networks, discriminant analysis, and logistic regression.

The scope of use of cluster analysis, due to its versatility, is very wide. Cluster analysis is used in economics, marketing, archeology, medicine, psychology, chemistry, biology, public administration, philology, anthropology, sociology and other fields.

Here are some examples of using cluster analysis:

  • medicine – classification of diseases, their symptoms, treatment methods, classification of patient groups;
  • marketing – tasks of optimizing the company’s product line, segmenting the market by groups of goods or consumers, identifying potential consumers;
  • sociology – dividing respondents into homogeneous groups;
  • psychiatry – correct diagnosis of groups of symptoms is decisive for successful therapy;
  • biology - classification of organisms by group;
  • economics – classification of subjects of the Russian Federation according to investment attractiveness.

Source: http://www.statmethods.ru/konsalting/statistics-metody/121-klasternyj-analiz.html

Understanding Cluster Analysis

Cluster analysis includes a set of different classification algorithms. A common question asked by researchers in many fields is how to organize observed data into visual structures.

For example, biologists set a goal to divide animals into various types to meaningfully describe the differences between them.

The task of cluster analysis is to divide the original set of objects into groups of similar objects that are close to each other. These groups are called clusters.

In other words, cluster analysis is one of the ways to classify objects according to their characteristics. It is desirable that the classification results have a meaningful interpretation.

The results obtained by cluster analysis methods are used in a wide variety of fields. In marketing, this is the segmentation of competitors and consumers.

In psychiatry, the correct diagnosis of symptoms such as paranoia, schizophrenia, etc. is decisive for successful therapy.

In management, it is important to classify suppliers and identify similar production situations in which defects occur. In sociology, the division of respondents into homogeneous groups. In portfolio investing, it is important to group securities by similarity in profitability trends in order to compile, based on the information received, stock market an optimal investment portfolio that allows you to maximize profit from investments at a given degree of risk.

In general, whenever it is necessary to classify a large amount of information of this kind and present it in a form suitable for further processing, cluster analysis turns out to be very useful and effective.

Cluster analysis allows you to consider a fairly large amount of information and greatly compress large amounts of socio-economic information, making them compact and visual.

Attention!

Cluster analysis is of great importance in relation to sets of time series characterizing economic development(for example, general economic and commodity conditions).

Here you can highlight periods when the values ​​of the corresponding indicators were quite close, and also determine groups of time series whose dynamics are most similar.

In tasks of socio-economic forecasting, the combination of cluster analysis with other quantitative methods (for example, regression analysis) is very promising.

Advantages and Disadvantages

Cluster analysis allows for an objective classification of any objects that are characterized by a number of characteristics. There are a number of benefits that can be derived from this:

  1. The resulting clusters can be interpreted, that is, they can describe what groups actually exist.
  2. Individual clusters can be discarded. This is useful in cases where certain errors were made when collecting data, as a result of which the values ​​of indicators for individual objects deviate sharply. When applying cluster analysis, such objects fall into a separate cluster.
  3. Only those clusters that have the characteristics of interest can be selected for further analysis.

Like any other method, cluster analysis has certain disadvantages and limitations. In particular, the composition and number of clusters depends on the selected partition criteria.

When reducing the original data array to a more compact form, certain distortions may arise, and the individual features of individual objects may be lost due to their replacement with the characteristics of generalized values ​​of the cluster parameters.

Methods

Currently, more than a hundred different clustering algorithms are known. Their diversity is explained not only by different computational methods, but also by different concepts underlying clustering.

The following clustering methods are implemented in the Statistica package.

  • Hierarchical algorithms - tree clustering. Hierarchical algorithms are based on the idea of ​​sequential clustering. At the initial step, each object is considered as a separate cluster. In the next step, some of the clusters closest to each other will be combined into a separate cluster.
  • K-means method. This method is used most often. It belongs to the group of so-called reference methods of cluster analysis. The number of clusters K is specified by the user.
  • Two-input combining. When using this method, clustering is carried out simultaneously both by variables (columns) and by observations (rows).

The two-way pooling procedure is used in cases where simultaneous clustering across variables and observations can be expected to produce meaningful results.

The results of the procedure are descriptive statistics for the variables and observations, as well as a two-dimensional color chart in which the data values ​​are color coded.

Based on the color distribution, you can get an idea of ​​homogeneous groups.

Normalization of variables

Partitioning the initial set of objects into clusters involves calculating the distances between objects and selecting objects whose distance is the smallest of all possible.

The most commonly used is the Euclidean (geometric) distance that is familiar to all of us. This metric corresponds to intuitive ideas about the proximity of objects in space (as if the distances between objects were measured with a tape measure).

But for a given metric, the distance between objects can be greatly affected by changes in scales (units of measurement). For example, if one of the features is measured in millimeters and then its value is converted to centimeters, the Euclidean distance between objects will change greatly. This will lead to the fact that the results of cluster analysis may differ significantly from previous ones.

If variables are measured in different units of measurement, then their preliminary normalization is required, that is, a transformation of the original data that converts them into dimensionless quantities.

Normalization greatly distorts the geometry of the original space, which can change the clustering results

In the Statistica package, normalization of any variable x is performed using the formula:

To do this, right-click on the variable name and select the sequence of commands in the menu that opens: Fill/ Standardize Block/ Standardize Columns. The values ​​of the normalized variable will become equal to zero, and the variance will become equal to one.

K-means method in Statistica program

The K-means method divides a set of objects into a given number K of different clusters located at the greatest possible distances from each other.

Typically, once the results of a K-means cluster analysis are obtained, the means for each cluster along each dimension can be calculated to assess how different the clusters are from each other.

Ideally, you should obtain widely varying means for most of the measurements used in the analysis.

The F-statistic values ​​obtained for each dimension are another indicator of how well the corresponding dimension discriminates between clusters.

As an example, consider the results of a survey of 17 employees of an enterprise on satisfaction with indicators of the quality of their career. The table provides answers to the survey questions on a ten-point scale (1 is the minimum score, 10 is the maximum).

The variable names correspond to the answers to the following questions:

  1. SLC – a combination of personal goals and organizational goals;
  2. OSO – sense of fairness in remuneration;
  3. TBD - territorial proximity to home;
  4. OEB – sense of economic well-being;
  5. KR – career growth;
  6. ZhSR – desire to change jobs;
  7. RSD – sense of social well-being.

Using this data, it is necessary to divide employees into groups and identify the most effective management levers for each of them.

At the same time, differences between groups should be obvious, and within the group respondents should be as similar as possible.

Today, most sociological surveys provide only a percentage of votes: the main number of those who responded positively, or the percentage of dissatisfied ones, is considered, but this issue is not systematically considered.

Most often, the survey does not show a trend in the situation. In some cases, it is necessary to count not the number of people who are “for” or “against”, but the distance, or the measure of similarity, that is, to determine groups of people who think approximately the same way.

Cluster analysis procedures can be used to identify, based on survey data, some really existing relationships between characteristics and generate their typology on this basis.

Attention!

The presence of any a priori hypotheses of a sociologist when working with cluster analysis procedures is not a necessary condition.

In Statistica, cluster analysis is performed as follows.

When choosing the number of clusters, be guided by the following: the number of clusters, if possible, should not be too large.

The distance at which the objects of a given cluster were united should, if possible, be much less than the distance at which something else joins this cluster.

When choosing the number of clusters, most often there are several correct solutions at the same time.

We are interested, for example, in how the answers to the survey questions compare between ordinary employees and the management of the enterprise. Therefore we choose K=2. For further segmentation, you can increase the number of clusters.

  1. select observations with the maximum distance between cluster centers;
  2. sort distances and select observations at regular intervals (default setting);
  3. take the first observations as centers and attach the remaining objects to them.

For our purposes, option 1) is suitable.

Many clustering algorithms often “impose” an unnatural structure on the data and disorient the researcher. Therefore, it is imperative to apply several cluster analysis algorithms and draw conclusions based on overall assessment algorithm results

The analysis results can be viewed in the dialog box that appears:

If you select the Graph of means tab, a graph of the coordinates of the cluster centers will be built:


Each broken line in this graph corresponds to one of the clusters. Each division on the horizontal axis of the graph corresponds to one of the variables included in the analysis.

The vertical axis corresponds to the average values ​​of the variables for objects included in each of the clusters.

It can be noted that there are significant differences in the attitude of the two groups of people to their careers on almost all issues. There is complete unanimity on only one issue – the sense of social well-being (SSW), or rather, the lack thereof (2.5 points out of 10).

We can assume that cluster 1 represents workers, and cluster 2 represents management. Managers are more satisfied with career growth (CG), the combination of personal goals and organizational goals (CLO).

They have higher levels of perceived economic well-being (SEW) and perceived pay equity (SPE).

They are less concerned about territorial proximity to home (TPH) than workers, probably due to fewer problems with transport. Also, managers have less desire to change jobs (JSR).

Despite the fact that workers are divided into two categories, they answer most questions relatively equally. In other words, if something doesn't suit you general group employees, senior management is not satisfied with the same thing, and vice versa.

Coordination of schedules allows us to draw conclusions that the well-being of one group is reflected in the well-being of another.

Cluster 1 is not satisfied with the territorial proximity to home. This group is the bulk of workers who mainly come to the enterprise from different parts of the city.

Therefore, it is possible to propose to the main management to allocate part of the profit to the construction of housing for the company’s employees.

There are significant differences in the attitude of the two groups of people to their careers. Those employees who are satisfied with their career growth, who have a high level of agreement between their personal goals and the goals of the organization, do not have the desire to change jobs and feel satisfied with the results of their work.

Conversely, employees who want to change jobs and are dissatisfied with the results of their work are not satisfied with the stated indicators. Senior management should contact special attention to the current situation.

The results of variance analysis for each characteristic are displayed by clicking the Analysis of variance button.

The sum of squared deviations of objects from cluster centers (SS Within) and the sum of squared deviations between cluster centers (SS Between), F-statistic values ​​and p significance levels are displayed.

Attention!

For our example, the significance levels for two variables are quite large, which is explained by the small number of observations. In the full version of the study, which can be found in the work, the hypothesis about the equality of means for cluster centers is rejected at significance levels less than 0.01.

The Save classifications and distances button displays the numbers of objects included in each cluster and the distances of objects to the center of each cluster.

The table shows the observation numbers (CASE_NO), the constituent clusters with CLUSTER numbers and the distance from the center of each cluster (DISTANCE).

Information about objects belonging to clusters can be written to a file and used in further analysis. IN in this example comparison of the results obtained with the questionnaires showed that cluster 1 consists mainly of ordinary workers, and cluster 2 of managers.

Thus, it can be noted that when processing the results of the survey, cluster analysis turned out to be a powerful method that allows us to draw conclusions that cannot be reached by constructing a histogram of averages or calculating the percentage of people satisfied with various indicators of the quality of working life.

Tree clustering is an example of a hierarchical algorithm, the principle of which is to sequentially combine into a cluster, first the closest, and then increasingly distant elements from each other.

Most of these algorithms start from a similarity (distance) matrix, and each individual element is first considered as a separate cluster.

After loading the cluster analysis module and selecting Joining (tree clustering), in the window for entering clustering parameters, you can change the following parameters:

  • Initial data (Input). They can be in the form of a matrix of the studied data (Raw data) and in the form of a distance matrix (Distance matrix).
  • Clustering of observations (Cases (raw)) or variables (Variable (columns)) describing the state of an object.
  • Distance measure. Here you can select the following measures: Euclidean distances, Squared Euclidean distances, City-block (Manhattan) distance, Chebychev distance metric, Power distance ...), Percent disagreement.
  • Clustering method (Amalgamation (linkage) rule). The following options are possible here: Single Linkage, Complete Linkage, Unweighted pair-group average, Weighted pair-group average ), unweighted pair-group centroid, weighted pair-group centroid (median), Ward's method.

As a result of clustering, a horizontal or vertical dendrogram is constructed - a graph on which the distances between objects and clusters are determined when they are sequentially combined.

The tree structure of the graph allows you to define clusters depending on the selected threshold - a specified distance between clusters.

In addition, a matrix of distances between the original objects (Distance matrix) is displayed; average and standard deviations for each source object (Distiptive statistics).

For the example considered, we will conduct a cluster analysis of variables with default settings. The resulting dendrogram is shown in the figure.


The vertical axis of the dendrogram shows the distances between objects and between objects and clusters. Thus, the distance between the variables OEB and OSD is five. At the first step, these variables are combined into one cluster.

Horizontal segments of the dendrogram are drawn at levels corresponding to the threshold distance values ​​selected for a given clustering step.

The graph shows that the question “desire to change jobs” (WSW) forms a separate cluster. In general, the desire to go anywhere visits everyone equally. Next, a separate cluster is the question of territorial proximity to home (TDP).

In terms of importance, it is in second place, which confirms the conclusion about the need for housing construction made based on the results of the study using the K-means method.

Perception of economic well-being (SEW) and pay equity (WFE) are combined - this is a block of economic issues. Career development (CR) and the combination of personal and organizational goals (LOG) are also combined.

Other clustering methods, as well as the choice of other types of distances, do not lead to significant change dendrograms.

Results:

  1. Cluster analysis is a powerful tool for exploratory data analysis and statistical research in any subject area.
  2. The Statistica program implements both hierarchical and structural methods of cluster analysis. The advantages of this statistical package stem from their graphical capabilities. Two-dimensional and three-dimensional graphical displays of the resulting clusters in the space of the studied variables are provided, as well as the results of the hierarchical procedure for grouping objects.
  3. It is necessary to apply several cluster analysis algorithms and draw conclusions based on an overall assessment of the results of the algorithms.
  4. Cluster analysis can be considered successful if it is performed in different ways, the results are compared and general patterns are found, and stable clusters are found regardless of the clustering method.
  5. Cluster analysis allows you to identify problem situations and outline ways to solve them. Therefore, this nonparametric statistics method can be considered as component system analysis.

Provisions derived from purely
logical means, when comparing
with reality they turn out to be
completely empty.
A. Einstein

How to correctly analyze and classify data? Why do we need graphs and diagrams?

Workshop lesson

Purpose of the work. Learn to classify and analyze data obtained from text.

Work plan. 1. Analyze the text in order to determine the essential properties of the subject being discussed. 2. Structure the content of the text in order to highlight the classes of objects being discussed. 3. Understand the role of logical schemes, graphs, diagrams for understanding the material being studied, establishing logical connections, and systematization.

Analyze the text. To do this, you need to mentally identify the subject in the text - the essential. Select, dismember it into its component parts in order to find individual elements, features, aspects of this object.

Ivan Kramskoy. D. I. Mendeleev

Whose portraits of systematizing scientists would you add to this series?

PORTRAIT OF BALL LIGHTNING. “The portrait of the mysterious natural phenomenon - ball lightning was carried out by specialists from the Main Geophysical Observatory named after. A.I. Voeykova, using the services of computers and forensic methods. The “photo composite” of the mysterious stranger was compiled on the basis of data published in the press over three centuries, the results of research surveys and eyewitness reports from different countries.

Which of its secrets did the floating clot of energy tell scientists?

They notice him mostly during thunderstorms. At all times, there have been four forms of ball lightning: sphere, oval, disk, rod. The generation of atmospheric electricity naturally occurred for the most part in the air. However, according to American surveys, lightning can be seen with equal frequency landing on various objects - telegraph poles, trees, houses. The sizes of this amazing companion of thunderstorms are from 15 to 40 cm. Color? Three quarters of eyewitnesses watched sparkling balls of red, yellow and pink.

The life of a clot of electric plasma is truly a butterfly, usually within five seconds. Longer than this period, but no more than 30 s, it was seen by up to 36% of eyewitnesses. Almost always, her death was the same - she spontaneously exploded, sometimes bumping into various obstacles. “Collective portraits” made by observers from different times and peoples coincided.”

If, after reading the text, you were able to answer the questions about what the text says, what are the main features, elements, aspects, properties of the subject of discussion, then you have analyzed it. In this case, the subject, the main content of the text is the idea of ​​ball lightning. Properties of ball lightning - its appearance: size, shape, color, as well as life time, behavioral features.

Based on the analysis of the text, determine its logical structure. Suggest forms of working with this text to assimilate it, memorize it, use it as interesting, unusual material in your future educational work- in discussions, speeches.

CLUE. You can draw up an outline of this text, its outline, theses (generalizations and conclusions that you consider the main thoughts of the text). It is useful to highlight what is new and unfamiliar to you in the material. You can also create a logical diagram of the material. To do this, after analyzing the text, highlight the information that is significant to you, try to combine it into groups, and show the connections between these groups.

The use of tables, graphs, and diagrams helps us systematize when studying natural science subjects. Let us have data on average monthly daytime temperatures for one year for St. Petersburg and Sochi. It is required to analyze and systematize this material in order to identify any patterns.

Let's present a disparate set of data in the form of a table, then in the form of a graph and diagram (Fig. 5, 6). Find patterns in temperature distribution. Answer the questions:

  1. What are the features of the temperature distribution by month in different cities? How are these distributions different?
  2. What is the reason for the processes that lead to this distribution?
  3. Did systematizing the material using a graph or diagram help you complete the task?

Average monthly daily temperatures for one year for St. Petersburg and Sochi

Rice. 5. Graph of average monthly daily temperatures for one year for St. Petersburg and Sochi

Rice. 6. Diagram: average monthly daily temperatures for one year in the cities of St. Petersburg and Sochi

Important steps to mastering the methods of scientific knowledge are:

  1. Logical text analysis.
  2. Drawing up a plan, diagrams, highlighting the structure of the material.
  3. Taking notes or writing abstracts.
  4. Identification of new knowledge and its use in discussions, speeches, and in solving new tasks and problems.

Literature for further reading

  1. Einstein A. Without formulas / A. Einstein; comp. K. Kedrov; lane from English - M.: Thought. 2003.
  2. Methodology of science and scientific progress. - Novosibirsk: Science. 1981.
  3. Feyrabend P. Selected works on the methodology of science / P. Feyrabend. - M.: Progress, 1986

10.2. Data mining ( Data Mining )

The sphere of patterns differs from the previous two in that it contains accumulated information automatically summarized to information, which can be characterized HOW TO KNOWLEDGE.

Data mining (DM) technology has come into its own in the last decade, playing a central role in many areas of business.

    We are all subject to Data Mining dozens of times a day - starting from receiving mailing lists, competitions in stores, free newspapers on the street and ending with application fraud detection algorithms, analyzing any credit card purchase .

    The reason for the widespread use of data mining methods: they give good results. Technology can significantly improve an organization's ability to achieve its goals.

    Its popularity is growing as tools are improved and widely used, become cheaper and easier to use.

There are two terms translated as data mining (IDA) - these are Knowledge Discovery in Databases (KDD) and Data Mining (DM).

Data Mining is the process of searching raw data 1) correlations, trends, relationships, associations and patterns through various 2) mathematical and statistical algorithms.

    Most IDA methods were originally developed within the framework of artificial intelligence theory in the 1970s and 1080s. But they became widespread only in the 1990s, when the problem of intellectualizing the processing of large and rapidly growing volumes of corporate data required their use as a superstructure over data warehouses.

Purpose of this search(IAD stages) –

        1) Prepare data in a form that clearly reflects business processes.

        2) Build models with which you can predict processes that are critical for business planning:

        • (2a) perform model validation and evaluation;

        3) Conduct historical data analysis to make decisions:

        • (3a) selection and application of the model;

          (3b) correction and updating of models.

Classification of IAD tasks by types of information retrieved

In most cases IAD tasks are classified according to the types of information produced. Data Mining tasks (models) are divided into 2 classes:

    (1) predictive models with their help it is carried out prediction of numeric attribute values.

    (2) descriptive (descriptive) models which describe general patterns subject area.

The most striking representative of the first class is the classification problem.

1. Classification- this is the identification of signs, a set of rules that characterize the group.

The most common IAD task. She allows identify features characterizing similar groups of objects(classes), so that, based on the known values ​​of these characteristics, it is possible to classify new object to one class.

    Typical use of classification - competition between suppliers of goods and services for certain groups of customers. The classification can help determine the characteristics of unstable customers who are inclined to switch to another supplier, which allows us to find the optimal strategy for keeping them from this step (by providing discounts, incentives or even by individual work with representatives of “at-risk groups” ).

Using the classification model, the following tasks are solved:

    does it belong new client to one of a set of existing classes;

    whether a certain course of treatment is suitable for the patient;

    identifying groups of unreliable clients;

    identifying customer groups to which a catalog with new products should be sent.

The following methods can be used to solve the classification problem:

      Lazy-Learning type algorithms, including the well-known Nearest Neighbor and k-Nearest Neighbor algorithms,

      Bayesian Networks or neural networks.

      classification using decision trees;

      support vector machine classification;

      statistical methods, in particular linear regression;

      classification using the CBR method;

      classification using genetic algorithms.

To carry out classification using mathematical methods must have formal description of an object, which can be operated using the mathematical classification apparatus. This description is usually database. Each object (database record) carries information about some property of the object. The source data set is divided into two sets: training and testing.

        Training set (trainingset) - a set that includes data used to train (construct) the model.

        Test (testset) set used to check the functionality of the model.

The division into training and test sets is carried out by dividing the sample in a certain proportion, for example, the training set is two-thirds of the data and the test set is one-third of the data. This method should be used for samples with a large number examples. If the sample size is small, it is recommended to use special methods, in which the training and test samples may partially overlap

The classification process consists of two stages: constructing the model and using it.

    Model construction: description of a set of predefined classes.

Each dataset example belongs to one predefined class.

At this stage, the training set is used and the model is constructed. The resulting model is represented by classification rules, a decision tree, or a mathematical formula.

    Using the model: classifying new or unknown values.

Assessing the correctness (accuracy) of the model.

        A) The known values ​​from the test case are compared with the results of using the resulting model.

        B) Accuracy level - the percentage of correctly classified examples in the test set.

        C) Test set, i.e. the set on which the constructed model is tested should not depend on the training set.

If the obtained model accuracy is acceptable, it is possible to use the model to classify new examples whose class is unknown.

Classification Accuracy: Error Rate Estimation

Classification accuracy can be assessed using cross-validation. Cross-validation is a procedure for assessing classification accuracy on data from a test set, which is also called a cross-validation set. The classification accuracy of the test set is compared with the classification accuracy of the training set. If the classification of the test set gives approximately the same accuracy results as the classification of the training set, the model is considered to have passed cross-validation.

The most prominent representatives of the second class are problems of clustering, association, sequence, etc.

Rice. Comparison of classification and clustering problems

2. Clustering- This identification of homogeneous groups of data.

Logically continues the idea of ​​classification into a more complex case is when the classes themselves are not predefined. The result of using a method that performs clustering is precisely the determination (through a free search) of the inherent division into groups of the data under study.

    In the above example"risk groups" - categories of clients who are ready to switch to another supplier - clustering means can be determined before the start of the care process, which will allow for problem prevention rather than emergency correction.

The methods used are:“unsupervised” training of a special type of neural networks - Kohonen networks, as well as rule induction .

Clustering is designed to divide a collection of objects into homogeneous groups (clusters or classes). If the sample data is represented as points in the feature space, then the clustering problem is reduced to determining “concentrations of points.”

The purpose of clustering is to search for existing structures. Clustering is a descriptive procedure and does not make any statistical inferences, but makes it possible to conduct exploratory analysis and study the “data structure”.

The very concept of “cluster” is defined ambiguously: each study has its own “clusters”. The concept of cluster is translated as “cluster”, “bunch”.

A cluster can be characterized as a group of objects that have common properties.

The characteristics of a cluster can be described as two:

        internal homogeneity;

        external isolation.

Clusters can be non-overlapping, or exclusive (non-overlapping, exclusive), and intersecting (overlapping).

The quality of clustering can be assessed based on the following procedures:

    manual check;

    establishing control points and checking the resulting clusters;

    determining clustering stability by adding new variables to the model;

    creating and comparing clusters using different methods. Different clustering methods can create different clusters, and this is normal occurrence. However, the creation of similar clusters by different methods indicates the correctness of the clustering.

Cluster analysis in marketing research

In marketing research, cluster analysis is used quite widely - both in theoretical research and by practicing marketers who solve problems of grouping various objects. At the same time, questions about groups of clients, products, etc. are resolved.

One of the most important tasks when applying cluster analysis in marketing research is consumer behavior analysis, namely:

    grouping consumers into homogeneous classes to obtain the most complete picture of the behavior of a client from each group and the factors influencing his behavior.

An important problem that cluster analysis can solve is positioning, i.e. determining the niche in which the new product should be positioned offered on the market. As a result of applying cluster analysis, a map is constructed from which it is possible to determine level of competition in various market segments and the corresponding characteristics of the product for the possibility of entering this segment. By analyzing such a map it is possible identification of new, unoccupied niches in the market, in which you can offer existing products or develop new ones.

Cluster analysis can also be useful, e.g. to analyze the company's clients. To do this, all clients are grouped into clusters, and an individual policy is developed for each cluster. This approach allows you to significantly reduce the objects of analysis, and, at the same time, take an individual approach to each group of clients.

3. Association rules– search for events related to each other.

An association is not determined based on the property values ​​of a single object or event, but has place between two or more simultaneously occurring events. At the same time, the rules produced indicate that when one event occurs, another occurs with varying degrees of probability. The strength of association is quantified by several quantities; for example, the following three characteristics can be used:

    A) predictability) determines how often events X and Y occur together, as a proportion of the total number of events X;

So, in the case of buying a TV (X), a VCR is also bought in 65% of cases (Y);

    b) prevalence shows how often the simultaneous occurrence of events X and Y occurs relative to the total number of moments of recorded events;

In other words, how often is the simultaneous purchase of a television and a VCR made among all purchases made;

    c) expected predictability shows the predictability that would have developed in the absence of a relationship between events;

For example, how often would a VCR be purchased regardless of whether a television was purchased?

4. Sequence detection– search for chains of events related in time.

Like associations, sequences take place between events, but not occurring simultaneously, but with some specific gap in time. Thus, association is a special case of a sequence with a zero time lag.

If the VCR was not purchased along with the TV, then within a month after purchasing a new TV, a VCR is purchased in 51% of cases.

5. Forecasting– an attempt to find patterns that adequately reflect the dynamics of the system’s behavior, i.e. predicting the behavior of a system in the future based on historical information .

A form of prediction that, based on the behavior of current and historical data, estimates the future values ​​of certain numerical indicators.

In problems of this type, traditional methods of mathematical statistics, as well as neural networks, are most often used.

Forecasting (from the Greek Prognosis), in the broad sense of the word, is defined as an advanced reflection of the future. The purpose of forecasting is to predict future events.

Solving the forecasting problem comes down to solving the following subtasks:

    selection of a forecasting model;

    analysis of the adequacy and accuracy of the constructed forecast.

Classification and prediction problems - similarities and differences.

So what are the similarities between prediction and classification problems??

Both problems involve a two-step process of building a model from a training set and using it to predict unknown values ​​of the dependent variable.

Difference between classification and prediction problems consists in the fact that in the first task the class of the dependent variable is predicted, and in the second - the numerical values ​​of the dependent variable, missing or unknown (relating to the future).

For example, considering a travel agency, determining the class of a client is a solution to the classification problem, and predicting the income that client will bring in the next year is a solution to the forecasting problem.

The basis for forecasting is historical information stored in the database in the form time series.

There are two fundamental differences between a time series and a simple sequence of observations:

    Members of a time series, in contrast to elements of a random sample, are not statistically independent.

    Time series terms are not equally distributed.

Trend, seasonality and cycle

The main components of a time series are the trend and the seasonal component.

A trend is a systematic component of a time series that can change over time. A trend is a non-random function, which is formed under the influence of general or long-term trends affecting the time series.

The seasonal component of a time series is a periodically recurring component of a time series. The seasonality property means that at approximately equal intervals of time the shape of the curve that describes the behavior of the dependent variable repeats its characteristic outlines.

The seasonality property is important in determining the amount of historical data to be used for forecasting.

It is important not to confuse the concepts of the seasonal component of a series and the seasons of nature. Despite the similarity of their sound, these concepts are different. For example, sales volumes of ice cream in the summer are much higher than in other seasons, but this is a trend in the demand for this product!!!

Fragment of a time series for a seasonal period

Fragment of a time series for 12 seasonal periods

Forecasting period- the basic unit of time for which the forecast is made.

    For example, we want to know the company's income in a month. The forecast period for this problem is a month.

Forecasting horizon is the number of periods in the future that the forecast covers.

    If the forecast is 12 months in advance, with data for each month, then the forecast period in this problem is a month, the forecast horizon is 12 months.

Prediction interval- the frequency with which a new forecast is made.

    The forecast interval may coincide with the forecast period.

Forecast accuracy is characterized by forecast error.

The most common types of errors:

    Mean error (SE). It is calculated by simply averaging the errors at each step. The disadvantage of this type of error is that positive and negative errors cancel each other.

    Mean absolute error (MAE). It is calculated as the average of absolute errors. If it is zero, then we have a perfect forecast. Compared to the mean squared error, this measure "doesn't give too much weight" to outliers.

    Sum of square errors (SSE), root mean square error. It is calculated as the sum (or mean) of squared errors. This is the most commonly used estimate of forecast accuracy.

    Relative error (RO). Previous measures used actual error values. Relative error expresses the quality of the fit in terms of relative errors.

6. Anomalies– identifying anomalous values ​​in the data.

Their identification makes it possible to identify - 1) errors in the data, 2) the emergence of a new previously unknown pattern, or 3) clarification of known patterns.

2. Kritsman V. A., Rozen B. Ya., Dmitriev I. S. To the secrets of the structure of matter. – Higher School, 1983.

Revolutionary discoveries in natural science were often made under the influence of the results of experiments carried out by talented experimenters. Great experiments in biology, chemistry, and physics contributed to changing the understanding of the world in which we live, the structure of matter, and the mechanisms of transmission of heredity. Based on the results of great experiments, other theoretical and technological discoveries were made.

§ 9. Theoretical research methods

Lesson-lecture

There are more important things in the world

the most wonderful discoveries -

is knowledge of the methods by which

they were done

Leibniz

https://pandia.ru/text/78/355/images/image014_2.gif" alt=" Signature: !" align="left" width="42 height=41" height="41">Метод. Классификация. Систематизация. Систематика. Индукция. Дедукция.!}

Observation and description of physical phenomena. Physical laws. (Physics, 7 – 9 grades).

What is a method . Method in science they call a method of constructing knowledge, a form of practical and theoretical mastery of reality. Francis Bacon compared the method to a lamp illuminating the way for a traveler in the dark: “Even a lame man walking along the road is ahead of him who walks without a road.” The correctly chosen method must be clear, logical, lead to a specific goal, and produce results. The doctrine of a system of methods is called methodology.

Methods of cognition that are used in scientific activities are empirical ( practical, experimental) methods: observation, experiment And theoretical ( logical, rational) methods: analysis, synthesis, comparison, classification, systematization, abstraction, generalization, modeling, induction, deduction. In real scientific knowledge, these methods are always used in unity. For example, when developing an experiment, a preliminary theoretical understanding of the problem is required, the formulation of a research hypothesis, and after the experiment, it is necessary to process the results using mathematical methods. Let us consider the features of some theoretical methods of cognition.

Classification and systematization. Classification allows you to organize the material under study by grouping a set (class) of objects under study into subsets (subclasses) in accordance with the selected characteristic.

For example, all school students can be divided into subclasses - “girls” and “boys”. You can choose another feature, such as height. In this case, classification can be carried out in different ways. For example, highlight the height limit of 160 cm and classify students into subclasses “short” and “tall”, or divide the height scale into segments of 10 cm, then the classification will be more detailed. If we compare the results of such a classification over several years, this will allow us to empirically establish trends in the physical development of students. Consequently, classification as a method can be used to obtain new knowledge and even serve as the basis for constructing new scientific theories.

In science, they usually use classifications of the same objects according to different criteria depending on their goals. However, the attribute (the basis for classification) is always chosen. For example, chemists divide the class “acids” into subclasses according to the degree of dissociation (strong and weak), and according to the presence of oxygen (oxygen-containing and oxygen-free), and according to physical properties(volatile - non-volatile; soluble - insoluble) and according to other characteristics.

The classification may change as science develops.

In the middle of the 20th century. the study of various nuclear reactions led to the discovery of elementary (non-fissile) particles. Initially, they began to be classified by mass, which is how leptons (small), mesons (intermediate), baryons (large) and hyperons (superlarge) appeared. Further development physics showed that classification by mass has little physical meaning, but the terms were retained, resulting in the appearance of leptons, much more massive than baryons.

It is convenient to display the classification in the form of tables or diagrams (graphs). For example, the classification of planets in the Solar System, represented by a diagram - a graph, may look like this:

MAJOR PLANETS

SOLAR SYSTEM

TERRESTRIAL PLANETS

PLANETS - GIANTS

PLUTO

MEASURE-

VIENNA-

MARS

JUPITER

SATURN

URANUS

Please note that the planet Pluto in this classification represents a separate subclass and does not belong to either the terrestrial planets or the giant planets. Scientists note that Pluto's properties are similar to an asteroid, of which there may be many on the periphery of the Solar system.

When studying complex natural systems, classification actually serves as the first step towards building a natural scientific theory. The next higher level is systematization (taxonomy). Systematization is carried out on the basis of classification of a sufficiently large volume of material. At the same time, the most essential features are identified that make it possible to present the accumulated material as a system in which all the various relationships between objects are reflected. It is necessary in cases where there is a variety of objects and the objects themselves are complex systems. The result of systematization of scientific data is taxonomy or otherwise – taxonomy. Systematics as a field of science developed in such fields of knowledge as biology, geology, linguistics, and ethnography.

The unit of taxonomy is called a taxon. In biology, taxa are, for example, phylum, class, family, genus, order, etc. They are combined into unified system taxa of various ranks according to a hierarchical principle. Such a system includes a description of all existing and previously extinct organisms and clarifies the paths of their evolution. If scientists find new look, then they must confirm its place in common system. Changes can also be made to the system itself, which remains developing and dynamic. Systematics makes it easy to navigate the diversity of organisms - about 1.5 million species of animals alone are known, and more than 500 thousand species of plants are known, not counting other groups of organisms. Modern biological taxonomy reflects Saint-Hilaire's law: “The diversity of life forms forms a natural taxonomic system consisting of hierarchical groups of taxa of various ranks.”

Induction and deduction. The path of cognition, in which, based on the systematization of accumulated information - from the particular to the general - a conclusion is drawn about an existing pattern, is called induction. This method as a method of studying nature was developed by the English philosopher F. Bacon. He wrote: “We must take as many cases as possible - both those where the phenomenon under study is present, and those where it is absent, but where one would expect to find it; then you need to arrange them methodically... and give the most likely explanation; finally, try to verify this explanation by further comparison with the facts.”

Thought and image

Portraits of F. Bacon and S. Holmes

Why are the portraits of the scientist and the literary hero located next to each other?

Induction is not the only way to obtain scientific knowledge about the world. If experimental physics, chemistry and biology were built as sciences mainly through induction, then theoretical physics and modern mathematics were based on a system axioms– consistent, speculative statements that are reliable from the point of view of common sense and the level of historical development of science. Then knowledge can be built on these axioms by drawing conclusions from the general to the particular, moving from premises to consequences. This method is called deduction. It was developed

Rene Descartes, French philosopher and scientist.

A striking example of gaining knowledge about one subject in different ways is the discovery of the laws of motion of celestial bodies. I. Kepler based large quantity Observational data on the movement of the planet Mars at the beginning of the 17th century. discovered by induction the empirical laws of planetary motion in the solar system. At the end of the same century, Newton deductively derived generalized laws of motion of celestial bodies based on the law of universal gravitation.

In real research activities, methods scientific research interconnected.

1. ○ Explain what a research method is, the methodology of natural sciences?

All these approximations should be justified and the errors introduced by each of them should be numerically assessed.

The development of science shows that every natural scientific law has limits of its application. For example, Newton's laws turn out to be inapplicable when studying the processes of the microworld. To describe these processes, the laws of quantum theory are formulated, which become equivalent to Newton's laws if they are applied to describe the motion of macroscopic bodies. From a modeling point of view, this means that Newton's laws are a model that follows, under certain approximations, from more general theory. However, the laws of quantum theory are not absolute and have their limitations in applicability. More general laws have already been formulated and more general equations have been obtained, which in turn also have limitations. And there is no end in sight to this chain. Have not yet received any absolute laws, describing everything in nature, from which all particular laws could be derived. And it is not clear whether such laws can be formulated. But this means that any of the laws of natural science is actually some kind of model. The only difference from those models discussed in this paragraph is that natural scientific laws are a model applicable to describe not one specific phenomenon, but a wide class of phenomena.

Application of modern practical methods Data analysis and recognition are in demand in technical and humanitarian fields, science and manufacturing, business and finance. This description presents the basic algorithmic essence, the understanding of which is useful for more effective use of recognition and classification methods in data analysis.

1. Recognition task (supervised classification) and current state in the field of practical methods for solving it. The main stages in the development of recognition theory and practice: creation of heuristic algorithms, recognition models and model optimization, algebraic approach to model correction. The main approaches are based on the construction of dividing surfaces, potential functions, statistical and neural network models, decision trees, and others.

The main approaches and algorithms of combinatorial-logical recognition methods (models for calculating estimates or algorithms based on the principle of partial precedence), developed at the Computing Center of the Russian Academy of Sciences named after. A.A. Dorodnitsyna. These models are based on the idea of ​​searching for important partial precedents in feature descriptions of source data (informative fragments of feature values, or representative sets). For real features, optimal neighborhoods of informative fragments are found. In other terminology, these partial precedents are called knowledge or logical patterns that connect the values ​​of initial features with a recognized or predicted value. The knowledge found is important information about the studied classes (images) of objects. They are directly used in solving problems of recognition or prediction, they provide a visual representation of the interdependencies existing in the data, which has independent value for researchers and can serve as a basis for the subsequent creation of accurate models of the objects, situations, phenomena or processes under study. Based on the found body of knowledge, the values ​​of such useful quantities as the degree of importance (information content) of features and objects, logical correlations of features and logical descriptions of classes of objects are also calculated, and the problem of minimizing the feature space is solved.

2. Methods for solving the main problem of cluster analysis (unsupervised classification) - finding groupings of objects (clusters) in a given sample of multidimensional data. A brief overview of the main approaches to solving the problem of cluster analysis and a description of the committee method for synthesizing collective solutions is given.

3. Software system for intelligent data analysis, recognition and forecast RECOGNITION. The requirements for the system are based on the ideas of universality and intelligence. The universality of the system means the possibility of its application to the widest possible range of problems (by dimensions, by type, quality and structure of data, by calculated values). Intelligence is understood as the presence of elements of self-tuning and the ability to successfully automatically solve problems by an unskilled user. Within the framework of the RECOGNITION System, a library of programs has been developed that implement linear, combinatorial-logical, statistical, neural network, hybrid methods of forecasting, classification and knowledge extraction from precedents, as well as collective methods of forecasting and classification.


1. Recognition algorithms based on calculation of scores. Recognition is carried out based on comparison of the recognized object with reference ones based on various sets of features, and the use of voting procedures. The optimal parameters of the decision rule and voting procedure are found from solving the problem of optimizing the recognition model - such parameter values ​​are determined at which the recognition accuracy (the number of correct answers on the training sample) is maximum.

2. Voting algorithms for deadlock tests. Comparison of the recognized object with the reference ones is carried out using various “informative” subsets of features. As such feature subsystems, dead-end tests (or analogues of dead-end tests for real-valued features) of various random subtables of the original table of standards are used.

Based on the training sample, sets of logical patterns of each class are calculated - sets of features and intervals of their values ​​characteristic of each class. When recognizing a new object, the number of logical patterns of each class that are executed on the recognized object is calculated. Each individual "performance" is considered a "vote" in favor of the corresponding class. The object belongs to the class for which the normalized sum of “votes” is the maximum. This method allows you to estimate the weights of features, logical correlations of features, build logical descriptions of classes, and find minimal feature subspaces.

4. Algorithms for statistical weighted voting.

Based on the training sample data, statistically substantiated logical patterns of classes are found. When recognizing new objects, an estimate of the probability of the object belonging to each of the classes is calculated, which is a weighted sum of “votes”.

5. Linear machine.

For each class of objects, a certain linear function is found. The recognized object belongs to the class whose function takes the maximum value on this object. Optimal linear functions classes are found as a result of solving the problem of finding the maximum joint subsystem of a system of linear inequalities, which is formed from the training sample. As a result, a special piecewise linear surface is found that correctly divides the maximum number of elements of the training set.

6. Fisher linear discriminant.

A classical statistical method for constructing piecewise linear surfaces separating classes. Favorable conditions The applicability of the Fisher linear discriminant is the following: the following factors: linear separability of classes, dichotomy, “simple structure” of classes, non-degeneracy of covariance matrices, absence of outliers. The created modification of the Fisher linear discriminant allows it to be successfully used in “unfavorable” cases.

7. K-nearest neighbors method.

Classic statistical method. The recognized object belongs to the class from which it has the maximum number of neighbors. The optimal number of neighbors and prior class probabilities are estimated from the training set.

8. Neural network model of backpropagation recognition

A modification of the well-known method of training a neural network for pattern recognition (back propagation method) has been created. As a criterion for the quality of the current parameters of a neural network, a hybrid criterion is used, which takes into account both the sum of squared deviations of the output signal values ​​from the required ones and the number of erroneous classifications on the training set.

9.Support vector machine.

Method for constructing a nonlinear dividing surface using support vectors. In the new feature space (rectifying space), a dividing surface close to linear is constructed. The construction of this surface comes down to solving a quadratic programming problem.

10. Algorithms for solving recognition problems by teams of various recognition algorithms.

The recognition problem is solved in two stages. First, various System algorithms are applied independently. Next, the optimal collective solution is automatically found using special “corrector” methods. Various approaches are used as corrective methods.

11. Cluster analysis methods (automatic classification or unsupervised learning).

The following known approaches are used:

Hierarchical grouping algorithms;

Clustering with the criterion of minimizing the sum of squared deviations;

K-means method.

It is possible to solve the classification problem with both a given and an unknown number of classes.

12. Algorithm for constructing collective solutions to the classification problem.

The classification problem is solved in two stages. First there is a set various solutions(in the form of coverings or partitions) with a fixed number of classes using various System algorithms. Next, the optimal collective classification is found as a result of solving a special discrete optimization problem.