1. Introduction to Pattern Recognition

General model of pattern recognition task

Pattern Recognition (PR) is the scientific discipline dealing with methods for object description and classification. In PR task we distinguish three basic concepts: classes, patterns and features.

Classes are states of nature or categories of objects associated with concepts or prototypes.
Patterns are physical representations of the objects. Often we will refer to patterns as objects or samples.
Features are measurements, attributes derived from the patterns that may be useful for their characterization. The features might be qualitative or quantitative. Discrete features with a large number of possible values are treated as quantitative. In this course we will use discrete features.

Typical PR system has specific functional units as shown in figure below.

Pattern acquisition, which can take severed forms: signal or image acquisition, data collection.
Feature extraction and selection. Features are not all equally relevant. Some of them are important only in relation to others and some might be only “noise” in the particular context. Feature selection and extraction are used to improve the quality of the description.
The classification unit is the kernel unit of the PR system. Use a feature vector provided by previous unit to assign the object to classes.
Post-processing. Sometimes the output obtained from the PR kernel unit cannot be directly used. It may need, for instance, some decoding operation. This, along with other operations that will be needed eventually, is called post-processing.

Data sets

The central data structure we use in PR task is the dataset. It is a matrix in which the rows represent the objects and the columns the features, labels, or other fixed sets of properties (e.g. distances to a fixed set of other objects). The matrix has size m × (k + 1) of m row vectors representing the objects, each described by k feature values. One of the columns can represent the label. Objects with the same label belong to the same class. In addition, a list of prior probabilities is also stored, one per class.

Dataset declaration

Dataset is defined by a tuple (X, y), where X is the list of feature vectors, being representation of each object, and y is the set of labels.

A = (
    [[1, 2, 1], [1, 2, 3], [2, 3, 4], [2 ,3, 4]],
    [1, 1, 2, 2]
)
X, y = A

Class labels can be numbers or strings.

B = (
    [[1, 2, 1], [1, 2, 3], [2, 3, 4], [2 ,3, 4]],
    ['pies', 'pies', 'kot', 'kot']
)
X, y = B

Classifiers

In sklearn classifier is an object, which may be fitted using some dataset and later able to produce a prediction of a new, unlabelled object.

At the beginning, we need a dataset to fit a classifier.

C = (
    [[1, 2, 1], [1, 2, 3], [2, 3, 4], [2 ,3, 4]],
    ['pies', 'pies', 'kot', 'kot']
)
X, y = C

Later, we can import exemplary classifier (here, k-NN with k=3) and initialise it in clf variable. Initialised classifier may be fitted by passing X and y to its fit() method.

from sklearn import neighbors
clf = neighbors.KNeighborsClassifier(3)
clf.fit(X, y)

Fitted classifier is able to produce a prediction. It is obtained by passing unlabelled feature vector (here [2, 3, 5]) to its predict() method.

prediction = clf.predict([2, 3, 5])
print prediction

['kot']

Generating datasets

Sets of objects can be generated by one of the data generation routines implemented in PRTools.

How to generate dataset

Example routines make_moons and make_blobs may generate a dataset with desired number of samples and features.

from sklearn import datasets
D = datasets.make_moons(n_samples=50, noise = .05)
E = datasets.make_blobs(n_samples=50, n_features=2)

Data visualization

For datasets visualization we often use scatter plots. In this example, we use matplotlib's subfigures to divide plot in two parts (available in ax list). Using a scatter() method we are making a plot of first (X[:,0]) and second (X[:,1]) dimension of a dataset. As color markers (c) we are using class labels (y).

import matplotlib.pyplot as plt
fig, ax = plt.subplots(1,2, figsize=(8,4))

X, y = D
ax[0].scatter(X[:,0], X[:,1],c=y)
X, y = E
ax[1].scatter(X[:,0], X[:,1],c=y)

plt.savefig('scatter_plots.png')

Visualization of decision boundaries over feature space

Knowing the basic concepts, we can visualize the principle of operation of the classifier.

First of all, we need to obtain a dataset and save it in X, y tuple.

X, y = E

Later, we initialise a classifier clf and we fit it using obtained dataset.

clf = neighbors.KNeighborsClassifier(3)
clf.fit(X, y)

Next, we need to acquire a range of our decision space, by calculating minimal and maximal value of both dimensions from our dataset.

x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1

Having calculated ranges, we can prepare a mesh grid with given step inside created space.

import numpy as np
xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.1),
                     np.arange(y_min, y_max, 0.1))

Later we use our classifier to give prediction of all points in generated mesh grid to save it in Z array.

Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)

At the end, we are plotting countour of prediction in mesh grid space and a scatterplot of a dataset in the same figure.

plt.contourf(xx, yy, Z, alpha=0.4)
plt.scatter(X[:, 0], X[:, 1], c=y, s=20, edgecolor='k')

plt.savefig('decision_boundaries.png')

Excercises

As a help, use a script of entire module

Exercise 1

Generate a dataset, make a scatter-plot, train and plot some classifiers, minimum 3 on one plotting.

Exercise 2

For the generated Moon dataset, plot a series of classifiers computed by the k-NN rule for the values of k between 1 on 10. Look at the influence of the neighbourhood size on the classification boundary.

PreviousMoCIaDM Next2. Types of classifiers

Last updated 7 years ago