While building complex machine learning and/or predictive analytic algorithmic models, huge amounts of historical/sampled data is consumed to train & tune the models. Typically, data comprises of some past observations with qualitative and/or quantitative characteristics associated with each observation. Each such observation is commonly referred to as an ‘example instance’ or ‘example case’ and observation characteristics are commonly referred to as ‘features’ or ‘attributes’. For example, an observation about certain type of customers may contain ‘height’, ‘weight’, ‘age’ and ‘gender’ and these will form the attributes of such observations.

A set of past observations is assumed to be sampled and picked at random from a ‘data universe’. Some ‘attributes’ are of numeric type and some others are ‘discrete’. Numeric type examples are ‘height’, ‘weight’ and ‘age’. Attribute ‘gender’ is discrete with ‘male’ and ‘female’ as values. Another example of discrete attribute is customer ‘status’ with values such as ‘general’, ‘silver’, ‘gold’ and ‘platinum’. Discrete attributes can only take particular values. There may potentially be an infinite number of those values, but each is distinct and there's no grey area in between. Some numeric attributes can also be discrete such as currency coin denominations – 1 cent, 5 cents (nickel), 10 cents (dime), 25 cents (quarter) as in US currency. Non-discrete numeric attributes are considered as ‘continuous’ valued attributes. Continuous attributes are not restricted to well-defined distinct values, but can occupy any value over a continuous range. Between any two continuous data values there may be an infinite number of others. Attributes such as ‘height’, ‘weight’ and ‘age’ fall under ‘continuous’ numeric attributes.

Before any algorithmic models are developed, the entire set of observations is transformed into an appropriate mathematical representation. This includes translating qualitative discrete attributes into quantitative forms. For example, a qualitative/discrete attribute such as customer status can be translated into quantitative form by representing value ‘general’ as 0, ‘silver’ as 1, ‘gold’ as 2 and ‘platinum’ as 3. Numeric attribute values are typically normalized. Normalization involves adjusting values measured on different scales to a notionally common scale. Most commonly used Normalization is Unity-based Normalization where attribute values are scaled (down) to fall into a range between 0 and 1. Numeric continuous attributes are discretized. Discretization refers to the process of converting or partitioning continuous attributes to discrete or nominal values. For example, values for a continuous attribute such as ‘age’ can be partitioned into equal interval ranges or bins; ages falling between 1 year and 10 years, 11-20, 21-30, 31-40 and so forth which is typically referred to as ‘binning’. Finite number of such well-defined ‘bins’ are considered and based on the ‘bin’, attribute value falls into, a distinct ‘bin’ value is assigned to that attribute. For example, if ‘age’ falls in 1-10 ‘bin’, a value of 1 may be assigned; if it falls in 11-20 ‘bin’, a value of 2 may be assigned and so forth.

Other critical issues typically encountered in dealing with attribute values are ‘missing values’, ‘inliers‘ and ‘outliers’. Incomplete data is an unavoidable problem in dealing with most of the real world data sources - (i) a value is missing because it was forgotten or lost; (ii) a certain feature is not applicable for a given observation instance (iii) for a given observation, the designer of a training set does not care about the value of a certain attribute (so-called don’t-care value). Such observations with missing attribute values are less commonly discarded but most often techniques are applied to fill-in ‘closest possible missing value’ for the attribute. Such techniques are commonly referred to as missing data imputation methods. Similarly, both ‘inlier’ and ‘outlier’ values would need to be resolved. An 'inlier' is a data value that lies in the interior of a statistical distribution and is in error. Likewise, an 'outlier' is a data value that lies in the exterior of a statistical distribution and is in error. There are statistical techniques to detect and remove such deviators, simplest one being removal using quartiles.

Data preprocessing also includes attribute value transformation, feature extraction and selection, dimensionality reduction etc., based on applicability of such techniques. Overall, data preprocessing can often have a significant impact on generalization performance of a machine learning algorithm. Once observations are converted into appropriate quantitative forms, each observation is considered to be ‘vectorized’ – essentially each observation with quantified attribute values is supposed to be describing a specific point in a multi-dimensional space and each such point is assumed to represent a ‘position-vector’ with its attributes as coordinate components in each dimension. In almost all cases of building machine learning models, matrix/vector algebra is extensively used to carry out mathematical/algebraic operations to deduce the parametric values of the algorithm/model. As such the entire set of all observations, now converted into ‘position vectors’, is represented by a matrix of vectors – either as a column matrix where each observation-vector forms the column or as a row matrix where each observation-vector forms a row, based on how machine algorithm consumes such matrices.

The following are key infrastructure/technology challenges that are encountered right away while dealing with such matrix forms:

- We are talking about, not hundreds or thousands of observations but typically hundred thousands or millions of observations. As such a matrix or even if observations are represented in some other formats, may not fit into single process memory. Besides, there could be hundreds/thousands of attributes associated with each observation magnifying the size further – imagine 1 million by 100K matrix or even higher! As such underlying technology/infrastructure should address this issue meticulously and in a way NOT affecting the runtime/performance of machine learning algorithm

- Persisting chunks of such data/matrices and reading them from disk on-demand as and when needed, may not be an option at all and even if it looks feasible, will not eliminate the problem adequately

- While training/tuning the model, there will be intermediate/temporary matrices that may be created such as Hessian matrices, which would typically be in similar shape & size or even larger, thereby demanding equal proportions or more of memory

- Even if some clustering/partitioning techniques are applied to disperse memory across a network of server nodes, there would be a need to clone such memory partitions for redundancy and fail-over – similar to how Hadoop distributed file system (HDFS) copies chunks of large file data across multiple data-nodes

- Complex matrix algebra computations are typically carried out as part of model training & convergence. This includes not just so-called-simple operations such as matrix-matrix multiplications, scalar operations, additions, subtractions but also spatial transformations such as Givens rotations, Householder transformations and so forth typically executed as part of either eigenvalues extraction, determinant evaluation, matrix decompositions such as QR, singular value decomposition, matrix bi-diagonalization or tri-diagonalization. All such operations would need to be carried out with high-efficiency optimizing memory foot-print and without high latencies

- Most algorithms would need several iterations of training/tuning and each such iteration may take several minutes/hours/days to complete. As such the underlying infrastructure/technology, apart from being scalable, should be highly reliable and fault tolerant. For a model that inherently takes long durations to get trained/tuned, it is undesirable & unimaginable to have some system component crash resulting in entire model building getting restarted again from ground up. As such, long running models should have appropriate ‘save’ points such that an operation can be recovered/restored safely without having to restart and without losing mathematical meaning/accuracy

- Underlying technology/infrastructure should help take advantage of high-speed multi-core processors or even better, graphical processing units (GPUs) resulting in faster training/convergence of complex models

- And most importantly, the underlying technology should address all the above concerns/issues in a very seamless fashion with easy-to-use & easy-to-configure system from both algorithm modelers and from system administrators’ point of view

In the next blog, I will describe how Entrigna addresses many, if not all, of these issues/concerns leveraging Entrigna’s RTES technology solution.