Value Proposition of Business Decisions - A Systemic Perspective

The value proposition of a business decision is measured in terms of its effectiveness in generating expected benefits while accomplishing one or more business goals & outcomes. There are many factors that affect the effectiveness or the value of a business decision. One of the key factors is the decision-action-latency which is defined as the total time taken, after business event(s) occurred, to collect required data, analyze data, extract new insights & intelligence, understand the applicability of such new information and finally arrive at actionable decisions. According to Dr. Richard Hackathorn, an eminent BI analyst & creator of Time-Value curves, the value of data required to make an actionable business decision degrades as time lapses by after pertinent business events have occurred. This is shown in the following Time-Value curve:

Click on the image to enlarge

The decision-action-latency in turn is cumulative of 1] ‘data-latency’ defined as time taken to collect and store the data, 2] ‘analysis-latency’ defined as time taken to analyze the data & extract new insights & new intelligence and 3] ‘decision-latency’ defined as time taken to understand the context & applicability of such new insights & intelligence and to arrive at decisions in terms of what business actions can be taken.

It has to be mentioned here that business decisions can be strategic or tactical in nature. In case of strategic decisions, the value or effectiveness is potentially realized even though the underlying data used to make decisions can be very old accumulated over longer periods of time. Essentially the slope of the Time-Value curve would be small per se with very gradual decrease in value over time. Typically, strategic decisions are based on mining large data comprising of historical observations collected from several business events over a period of time. A retail store making a decision about when to run beer sales is an example of a strategic decision. For example, a retail store after inferring from store sales data that men who purchase diapers over the weekend also tend to buy beer at the same time can make a strategic decision to capitalize on this information to put the beer near the diapers and run beer sales over the weekends.

In case of tactical decisions, the value or effectiveness of business decisions is very short-lived because underlying data/information is highly volatile and inherently contains time-sensitive intelligence reflecting upon the momentary business performance. Essentially, the slope of the Time-Value curve would be very high with the curve being extremely steep. Typically, a tactical decision pertains to a single business event or transaction and hence is based on data collected from a single event that gets compared to or correlated with associated/related data collected from most-recent related business events. Credit card fraud detection is an example of a tactical decision. For example, a credit card company after inferring that a credit card, being used to purchase an item in Chicago, was used thirty minutes earlier in a place somewhere across the globe, can make an immediate tactical decision to capitalize on this information to mark the transaction as fraud and place a hold on the card.

So how do companies ensure that value proposition of business decisions is retained & realized, regardless of strategic or tactical decisions?

As IT evolved over the years, companies automated their operations to collect data for analysis & reporting. In early days, each business application would capture such data in its own 'Reporting' database. As more operational automation got implemented, such 'islands of information' became siloed and proliferated. Soon companies realized the analytical value if the data from all such siloed islands is collectively mined together & correlated. However, collecting & correlating data from all such siloed systems was a challenge due to the incompatibilities between systems and lack of easy ways for these systems to interact & interoperate. A need for an infrastructure to exchange, store and analyze the data so that a unified view of insights & intelligence across the enterprise can be created, was recognized and thus Traditional data warehouse evolved to fill this need. Traditional data warehouse organized information from different sources so that data can be effectively analyzed to generate interesting & meaningful reports. Such reports would provide key insights to make business decisions which would then lead to course-corrective actions. To help smoothen the decision making process, a broad range of tools were developed over the years such as Extract, transform, and load (ETL) utilities for moving data from the various data sources to common data warehouse, Data-mining & Analytical engines to perform complex analysis and Reporting tools & Digital Dashboards to provide management & business decision makers with easy-to-comprehend analysis results.

Click on the image to enlarge

Active or Real-Time Data Warehouses:

The need for a solution that satisfies both the strategic and the tactical goals of an enterprise resulted in the emergence of Active Data Warehouses or Real-Time Data Warehouses. Sometimes these are also referred to as Real-Time or Right-Time Business Intelligence (RTBI) systems. Active Data warehouses not only support the strategic functions of data warehousing for deriving intelligence and knowledge from past enterprise activity, but also provide real-time tactical support to drive enterprise actions that react immediately to events as they occur. The new breed of data warehouses are designed to reduce all three latencies as much as possible by revamping utility tools. The traditional ETL process involved downtime of the data warehouse while the loads were performed. As such they are characterized as being offline ETL facilities. However Active data warehouses needed online ETL facility that not only preserved historical strategic data but also provided current tactical data. The online ETL’s job is to create and maintain a synchronized copy of source data in active data warehouse while constantly keeping it up-to-date. Besides, Active data warehouses needed improved data-mining & analytical engines with ability to incorporate business rules and with flexibility to run analytical models that can consume & adapt to more recent data blended with historical data. In effect, Active data warehouses markedly reduced overall decision-action-latency and thereby tremendously increased the value-proposition of business decisions in meeting business goals as compared to that of traditional data warehouses. In addition, they offered flexibility in making tactical decisions as and when needed by an enterprise. The following diagram shows the Time-Value curve in case of leveraging Active/Real-Time Data Warehouses.

Click on the image to enlarge

Real-Time-Intelligence-Based Decision Systems:

Even though Active data warehouses reduced overall decision-action-latency & thereby increased the value-proposition of business decisions, they were still predominantly used in a traditional sense with a strategic intent albeit leveraging most recent/current data. They were never considered as systems that can act as pure providers of 'Real-Time-Intelligence-Based' decision services. Such Real-Time-Intelligence-Based decision systems would churn & process varying business operational & transactional data on a real-time basis, sense transitory business insights, predict business fore-sights and use such reasoning to make real-time decisions that can then effect immediate actions through business transactions & operations. Such decision system would agglomerate capabilities such as Machine Learning, Data Mining, Rules Processing, Complex Event Processing, Predictive Analytics, Operations Research type of Optimizations, Artificial Intelligence & other Intelligence-Generating Algorithmic techniques and would provide flexibility to mix & match such capabilities for more complex decision orchestrations. The breadth of decision frameworks is necessary because different business objectives require different analytical approaches. For example, a rules engine works great when recognizing a customer for a milestone. Likewise, event processing is well suited for identifying potential customer dis-service scenarios. Finally, optimization techniques work well when making decisions about which promotions to place in front of the customer.

Click on the image to enlarge

Entrigna’s proprietary product RTES falls under ‘Real-Time-Intelligence-Based’ Decision System. For more information, refer to the blog titled ‘From Real-Time Insights To Actionable Decisions - The Road Entrigna Paves’.

Data & Computation challenges in Machine learning/Predictive Analytics – An Architect’s POV

While building complex machine learning and/or predictive analytic algorithmic models, huge amounts of historical/sampled data is consumed to train & tune the models. Typically, data comprises of some past observations with qualitative and/or quantitative characteristics associated with each observation. Each such observation is commonly referred to as an ‘example instance’ or ‘example case’ and observation characteristics are commonly referred to as ‘features’ or ‘attributes’. For example, an observation about certain type of customers may contain ‘height’, ‘weight’, ‘age’ and ‘gender’ and these will form the attributes of such observations.

A set of past observations is assumed to be sampled and picked at random from a ‘data universe’. Some ‘attributes’ are of numeric type and some others are ‘discrete’. Numeric type examples are ‘height’, ‘weight’ and ‘age’. Attribute ‘gender’ is discrete with ‘male’ and ‘female’ as values. Another example of discrete attribute is customer ‘status’ with values such as ‘general’, ‘silver’, ‘gold’ and ‘platinum’. Discrete attributes can only take particular values. There may potentially be an infinite number of those values, but each is distinct and there's no grey area in between. Some numeric attributes can also be discrete such as currency coin denominations – 1 cent, 5 cents (nickel), 10 cents (dime), 25 cents (quarter) as in US currency. Non-discrete numeric attributes are considered as ‘continuous’ valued attributes. Continuous attributes are not restricted to well-defined distinct values, but can occupy any value over a continuous range. Between any two continuous data values there may be an infinite number of others. Attributes such as ‘height’, ‘weight’ and ‘age’ fall under ‘continuous’ numeric attributes.

Before any algorithmic models are developed, the entire set of observations is transformed into an appropriate mathematical representation. This includes translating qualitative discrete attributes into quantitative forms. For example, a qualitative/discrete attribute such as customer status can be translated into quantitative form by representing value ‘general’ as 0, ‘silver’ as 1, ‘gold’ as 2 and ‘platinum’ as 3. Numeric attribute values are typically normalized. Normalization involves adjusting values measured on different scales to a notionally common scale. Most commonly used Normalization is Unity-based Normalization where attribute values are scaled (down) to fall into a range between 0 and 1. Numeric continuous attributes are discretized. Discretization refers to the process of converting or partitioning continuous attributes to discrete or nominal values. For example, values for a continuous attribute such as ‘age’ can be partitioned into equal interval ranges or bins; ages falling between 1 year and 10 years, 11-20, 21-30, 31-40 and so forth which is typically referred to as ‘binning’. Finite number of such well-defined ‘bins’ are considered and based on the ‘bin’, attribute value falls into, a distinct ‘bin’ value is assigned to that attribute. For example, if ‘age’ falls in 1-10 ‘bin’, a value of 1 may be assigned; if it falls in 11-20 ‘bin’, a value of 2 may be assigned and so forth.

Other critical issues typically encountered in dealing with attribute values are ‘missing values’, ‘inliers‘ and ‘outliers’. Incomplete data is an unavoidable problem in dealing with most of the real world data sources - (i) a value is missing because it was forgotten or lost; (ii) a certain feature is not applicable for a given observation instance (iii) for a given observation, the designer of a training set does not care about the value of a certain attribute (so-called don’t-care value). Such observations with missing attribute values are less commonly discarded but most often techniques are applied to fill-in ‘closest possible missing value’ for the attribute. Such techniques are commonly referred to as missing data imputation methods. Similarly, both ‘inlier’ and ‘outlier’ values would need to be resolved. An 'inlier' is a data value that lies in the interior of a statistical distribution and is in error. Likewise, an 'outlier' is a data value that lies in the exterior of a statistical distribution and is in error. There are statistical techniques to detect and remove such deviators, simplest one being removal using quartiles.

Data preprocessing also includes attribute value transformation, feature extraction and selection, dimensionality reduction etc., based on applicability of such techniques. Overall, data preprocessing can often have a significant impact on generalization performance of a machine learning algorithm. Once observations are converted into appropriate quantitative forms, each observation is considered to be ‘vectorized’ – essentially each observation with quantified attribute values is supposed to be describing a specific point in a multi-dimensional space and each such point is assumed to represent a ‘position-vector’ with its attributes as coordinate components in each dimension. In almost all cases of building machine learning models, matrix/vector algebra is extensively used to carry out mathematical/algebraic operations to deduce the parametric values of the algorithm/model. As such the entire set of all observations, now converted into ‘position vectors’, is represented by a matrix of vectors – either as a column matrix where each observation-vector forms the column or as a row matrix where each observation-vector forms a row, based on how machine algorithm consumes such matrices.

The following are key infrastructure/technology challenges that are encountered right away while dealing with such matrix forms:

- We are talking about, not hundreds or thousands of observations but typically hundred thousands or millions of observations. As such a matrix or even if observations are represented in some other formats, may not fit into single process memory. Besides, there could be hundreds/thousands of attributes associated with each observation magnifying the size further – imagine 1 million by 100K matrix or even higher! As such underlying technology/infrastructure should address this issue meticulously and in a way NOT affecting the runtime/performance of machine learning algorithm

- Persisting chunks of such data/matrices and reading them from disk on-demand as and when needed, may not be an option at all and even if it looks feasible, will not eliminate the problem adequately

- While training/tuning the model, there will be intermediate/temporary matrices that may be created such as Hessian matrices, which would typically be in similar shape & size or even larger, thereby demanding equal proportions or more of memory

- Even if some clustering/partitioning techniques are applied to disperse memory across a network of server nodes, there would be a need to clone such memory partitions for redundancy and fail-over – similar to how Hadoop distributed file system (HDFS) copies chunks of large file data across multiple data-nodes

- Complex matrix algebra computations are typically carried out as part of model training & convergence. This includes not just so-called-simple operations such as matrix-matrix multiplications, scalar operations, additions, subtractions but also spatial transformations such as Givens rotations, Householder transformations and so forth typically executed as part of either eigenvalues extraction, determinant evaluation, matrix decompositions such as QR, singular value decomposition, matrix bi-diagonalization or tri-diagonalization. All such operations would need to be carried out with high-efficiency optimizing memory foot-print and without high latencies

- Most algorithms would need several iterations of training/tuning and each such iteration may take several minutes/hours/days to complete. As such the underlying infrastructure/technology, apart from being scalable, should be highly reliable and fault tolerant. For a model that inherently takes long durations to get trained/tuned, it is undesirable & unimaginable to have some system component crash resulting in entire model building getting restarted again from ground up. As such, long running models should have appropriate ‘save’ points such that an operation can be recovered/restored safely without having to restart and without losing mathematical meaning/accuracy

- Underlying technology/infrastructure should help take advantage of high-speed multi-core processors or even better, graphical processing units (GPUs) resulting in faster training/convergence of complex models

- And most importantly, the underlying technology should address all the above concerns/issues in a very seamless fashion with easy-to-use & easy-to-configure system from both algorithm modelers and from system administrators’ point of view

In the next blog, I will describe how Entrigna addresses many, if not all, of these issues/concerns leveraging Entrigna’s RTES technology solution.

From Real-Time Insights To Actionable Decisions - The Road Entrigna Paves

What problem does Entrigna solve?

With tremendous increase in computing power and decrease in memory & storage costs, today’s businesses are weighed down with a deluge of mostly disconnected data, much of it highly relevant to effective decision-making; data associated with business applications, processes, transactions, operations, customers, customer insights, products, product insights, policies, systems, business-partnerships, competition etc. produces unmanageable amounts of information in many different formats. Such data is highly volatile and inherently contains time-sensitive business intelligence reflecting upon momentary business performance. If detected real-time, such ‘in-the-moment’ business intelligence can provide real-time insights that can be used to dynamically determine an optimal action to be taken in real time; Essentially such data would need to be ploughed through to discover valuable knowledge & instantaneous insights in a way to make actionable decisions and feed such decisions in real time back into business applications, processes and operations in a way to drive business value & profitability. A few examples of such insights & decisions are: Product recommendations based on customer actions & purchase behavior, Offer optimization based on customer insights, customer churn prediction, customer disservice detection and recommendation of recovery actions, dynamic pricing of products & pricing optimization based on 'in-the-moment' shopping-versus-buying patterns, real-time predictions of perishable inventory shrink, anomaly detection such as fraud and recommendation of course-correction actions.

To derive intelligence and insights in real-time from such volatile & varying data, there is a real need for technology with capabilities such as Machine Learning, Data Mining, Rules Processing, Complex Event Processing, Predictive Analytics, Operations Research type of Optimizations, Artificial Intelligence and other intelligence-generating Mathematical Algorithmic capabilities coupled with flexibility to mix & match such capabilities for more complex decisions orchestration. The breadth of decision frameworks is necessary because different business objectives require different analytical approaches. For example, a rules engine works great when recognizing a customer for a milestone. Likewise, event processing is well suited for identifying potential customer dis-service scenarios. Finally, optimization techniques work well when making decisions about which promotions to place in front of the customer.

Another challenge such technology would need to address is processing & correlating high-velocity data in live-streams coming from disparate data sources and in wide-variety of formats. As such technology should be highly scalable and fault-tolerant with extensive provisioning for large memory distribution & massive parallelization for executing complex CPU bound operations. From licensing & maintenance point of view, such technology should be cost-effective and economically viable. However, there is no single technology that is readily available offering all these capabilities out-of-the-box and with seamless implementation.

Why is this problem a big deal?

There are commercially available technologies that specialize individually in one required capability or the other. For example, there are sophisticated Business Rules Engines, technologies that excel in Complex Event Processing, those that excel in Data-Mining, in Operations Research, in traditional Business Intelligence etc. Each technology perhaps works well within its realm of decision-making capability. In almost all cases such specialized technologies come from different product vendors. There are a few vendors that offer some of these technologies as part of a suite, but as independent products within the suite. As such these technologies are not necessarily designed to speak to each other & inter-operate. However for real-life complex business decisions orchestration, such capabilities would need to be combined in a flexible and seamless fashion.

Other businesses procure individually specialized technologies as mentioned before and make them inter-operate by developing custom middleware solutions. Even then deriving insights & actionable decisions is not comprehensive because required decision orchestration is still not seamless and not fully realized since it is like a split-brain across disparate technologies. This entire saga is prone to large capital investment spent in procuring disparate technologies and in developing custom middleware solutions to make such heterogeneous technologies inter-operate. As such many business initiatives with an urgent intent of exploiting maximum advantage from real-time insights are prone to long delays with long project timelines. Moreover, because decision orchestration cannot be fully realized, such business initiatives get implemented with limited-scope with many requirements de-scoped resulting in businesses losing original business value proposition, a loss typically measured in millions of dollars, besides losing competitive edge.

How does Entrigna solve this problem?

Entrigna developed a real-time decisions platform called RTES – Real Time Expert System. RTES enables real time decision capabilities by offering full range of decision frameworks packaged in one technology that work together seamlessly; Rules Engine, Complex Event Processing, Machine Learning, Artificial Intelligence, Optimization, Clustering/Classification. Essentially, RTES platform exposes such decision frameworks as built-in modularized services that can be combined and applied on an organization’s business data on a real-time basis to identify intelligent patterns that can lead to real time business insights. By packaging such decision capabilities in one technology, RTES enables seamless orchestration of higher-level decision services - meaning an hybrid decision service can be configured as network of individual decision services, as in electrical circuit, in-series and in-parallel. Individual decision services can be rules based, machine learning based, classification based, segmentation/clustering based, predictive or regressive, real-time optimization based.

Since RTES works on live-streams of data, it promotes event-driven data integration approaches. An event can be any number of things but is usually initiated by some action or activity; examples include, a customer shopping for tennis rackets online, sale of 1000 winter gear items in last 1 hour, a snowstorm being predicted for North-East, a flight getting rescheduled; RTES attempts to extract events from live-streams of data in order to initiate the real time decision process.

RTES processes live-data a.k.a events as they occur, combining the event with other valuable data or other events, gaining intelligence from the data and deciding on an action to be taken. Sometimes, knowledge of the event is sufficient information to derive an insight and take action.  More often, additional data must be leveraged to improve intelligence. For example: customer profile, transaction/sales history, channel interaction history, social activity history, external data like weather & traffic. RTES employs data architecture strategy commonly referred to as Data Virtualization that integrates disparate data sources in real time into a usable format while decoupling data processing aspect from intelligence derivation aspect.

To enable derivation of intelligence, RTES makes it easy to combine different decision frameworks. For example, to implement a specific offer optimization requirement, RTES enables use of decision trees to determine customer value score, clustering to segment customers based on customer attributes, neural networks to assess purchase trends by customer segment, optimization to rank most-value-generating products, additional rules to further personalize offers to a specific customer and CEP to augment offers based on external events such as weather & national events - all of these orchestrated seamlessly within one single technology .i.e. RTES.

Once actionable decisions are determined, RTES enables such decisions to be integrated with business applications, processes and operations to enable action in real-time that impact business outcomes. For example, presenting optimized & personalized offers to a customer in order to help that customer buy his/her choices of products more easily such that business objective of increased product sales is met. RTES makes actionable decisions accessible by means of web services, messaging, direct web-sockets, database adapters and other custom adapters.

RTES enabled machine-Learning and AI based predictive decision models can also learn and adapt based on feedback from actions taken. Online predictive models learn from action feedback in real-time while Offline predictive models learn in batch mode from action feedback that is stored first & consumed later. Typically such feedback is enabled for other business purposes such as Enterprise Data Management & Warehousing and RTES can tap into existing feedback channels, without having to necessarily devise new ways of consuming feedback.

How does Entrigna engage clients?

Below are high level steps that Entrigna would typically follow while initiating a client engagement.

Working collaboratively, Entrigna would engage with client by listening to client’s requirements for leveraging real-time intelligence paying close attention to business goals. This is a mandatory step more so because the concept of real-time intelligence is relatively new and as a product & services provider, it becomes critical for Entrigna to streamline client’s ideas, dotting the I’s and crossing the T’s and in the process steering the client realize much more business value off of real-time insights than initially anticipated. This includes capturing client's thoughts around what they presumed as possible solutions versus what they assumed as infeasible, impracticable, anecdotal or imaginative ones, something they thought not implementable at all because technologies that they are aware of, lacked required capability. Of course, this step is very much preliminary with the understanding that Entrigna would help realize more cases & opportunities for real-time intelligence during the course of actual ensuing implementation.

Once client’s initial requirements are discovered, next natural step is to thoroughly understand two important business aspects; 1] business processes, applications & operations where real-time insights-driven-actions would be implemented and 2] data, types of data, potential sources of existing data and sources of new data that would come into play for plowing through intelligence.

Next step is a bit more involved step where hypotheses for different intelligence scenarios are actually given shape in the form of algorithmic models. Entrigna would employ elements of data-science applying them to client’s specific needs. This is very much a collaborative step wherein client’s subject matter experts would work hand-in-hand to vet the intelligence models. Entrigna would quickly prototype more than one model - trained, tuned & tested. Typically, the differences in the models are more in terms of mixing & matching of underlying decision frameworks. Entrigna would then share & review the model results to verify if what models predicted is close to or exceeding what client anticipated. In case there is a need to increase results accuracy, Entrigna would either further fine tune underlying decision algorithm or replace the existing one with a more advanced decision algorithm, ensuring applicability of such decision algorithms within mathematical feasibility boundaries set by data-science.

Finally, once the models are finalized, Entrigna would determine how decisions would get integrated into and consumed by downline applications, processes or operations; thereby Entrigna would expose models as intelligent decision services with appropriate service access mechanisms such as web services, messaging etc.

Math behind Soft-Margin SVM Classifier - Sequential Minimal Optimization

The decision function for an SVM classifier is given by:

\begin{equation*}\begin{aligned}\hat{f}(\bar x) = sgn(\bar w.\bar x - b) \hspace{5 mm} \bar x \in \mathbb{R}^n \hspace{5 mm} and \hspace{2 mm} \bar w \in \mathbb{R}^n\end{aligned}\end{equation*}

$\bar w$ is the normal vector and b is the offset term for the decision surface $\bar w.\bar x = b$.

The corresponding supporting hyperplanes are as follows:

\begin{equation*}\begin{aligned}\bar w.\bar x = b + 1 - \xi \hspace{5 mm} \forall \hspace{2 mm} (\bar x, y) \mid \bar x \in \mathbb{R}^n, \hspace{2 mm} \hspace{2 mm} y = +1, \hspace{2 mm} \xi \geq 0 \end{aligned}\end{equation*}

\begin{equation*}\begin{aligned}\bar w.\bar x = b - 1 + \xi \hspace{5 mm} \forall \hspace{2 mm} (\bar x, y) \mid \bar x \in \mathbb{R}^n, \hspace{2 mm} \hspace{2 mm} y = -1, \hspace{2 mm} \xi \geq 0 \end{aligned}\end{equation*}

In either of the above supporting hyperplanes, $\xi \geq 0$ is known as the slack variable or error term that measures how far a particular point lies on the wrong side of its respective hyperplane.

The optimization problem to compute the a soft-margin decision surface $\bar w^*.\bar x = b^*$ is expressed as follows:

\begin{equation*}\begin{aligned} \underset{\bar w, b, \xi}{\text{min}} & \left( \frac{1}{2} \bar w.\bar w \hspace{2 mm} + \hspace{2 mm} C \sum\limits_{i=1}^m \xi_i \right) \end{aligned}\end{equation*}

\begin{aligned}\text{subject to: }\end{aligned}
\begin{equation*}\begin{aligned} \hspace{0.7 in} \bar w.\bar x_i \geq b + 1 - \xi_i \hspace{5 mm} \forall \hspace{2 mm} (\bar x_i, y_i) \in \mathbb{D}, \hspace{2 mm} \hspace{2 mm} y_i = +1, \hspace{2 mm} \end{aligned}\end{equation*}

\begin{equation*}\begin{aligned} \hspace{0.7 in} \bar w.\bar x_j \leq b - 1 + \xi_j \hspace{5 mm} \forall \hspace{2 mm} (\bar x_j, y_j) \in \mathbb{D}, \hspace{2 mm} \hspace{2 mm} y_j = -1, \hspace{2 mm} \end{aligned}\end{equation*}

\begin{equation*}\begin{aligned} \hspace{0.7 in} \xi_i \geq 0, \hspace{2 mm} i = 1, \ldots, m \end{aligned}\end{equation*}

\begin{equation*}\begin{aligned} \hspace{0.7 in} \text{where} \hspace{2 mm} \mathbb{D} = \left\{ (\bar x_i, y_i) : \bar x_i \in \mathbb{R}^n, y_i \in \{+1, -1\} \right\} \hspace{2 mm} \text{ is Training Set} \end{aligned}\end{equation*}

Rewriting Supporting Hyperplane Constraints in Compact Form:

The above Supporting Hyperplane constraints can be formatted into compact form as follows:

For Supporting Hyperplane representing all training points labeled as +1, we can rewrite the dot product and free term as follows:

\begin{equation*}\begin{aligned} \hspace{0.7 in} \bar w. (+1)\bar x_i \geq 1 + (+1)b - \xi_i \hspace{5 mm} \forall \hspace{2 mm} (\bar x_i, y_i) \in \mathbb{D}, \hspace{2 mm} \hspace{2 mm} y_i = +1, \hspace{2 mm} \end{aligned}\end{equation*}

Now we can substitute $y_i$ in place of +1, since the inequality will still hold true:

\begin{equation*}\begin{aligned} \hspace{0.7 in} \bar w. y_i\bar x_i \geq 1 + y_ib - \xi_i \hspace{5 mm} \forall \hspace{2 mm} (\bar x_i, y_i) \in \mathbb{D}, \hspace{2 mm} \hspace{2 mm} y_i = +1, \hspace{2 mm} \end{aligned}\end{equation*}

Rearranging the terms, the constraint becomes:

\begin{equation*}\begin{aligned} \hspace{0.7 in} \bar w. y_i\bar x_i - y_ib + \xi_i - 1 \geq 0 \hspace{5 mm} \forall \hspace{2 mm} (\bar x_i, y_i) \in \mathbb{D}, \hspace{2 mm} \hspace{2 mm} y_i = +1, \hspace{2 mm} \end{aligned}\end{equation*}

\begin{equation*}\begin{aligned}\hspace{0.7 in} \Rightarrow y_i(\bar w.\bar x_i - b) + \xi_i - 1 \geq 0 \hspace{5 mm} \forall \hspace{2 mm} (\bar x_i, y_i) \in \mathbb{D}, \hspace{2 mm} \hspace{2 mm} y_i = +1, \hspace{2 mm} \end{aligned}\end{equation*}

For Supporting Hyperplane representing all training points labeled as -1, multiplying LHS and RHS by -1 and changing the inequality from $\leq$ to $\geq$ we get:

\begin{equation*}\begin{aligned} \hspace{0.7 in} (-1)( \bar w.\bar x_j) \geq (-1)(b - 1 + \xi_j) \hspace{5 mm} \forall \hspace{2 mm} (\bar x_j, y_j) \in \mathbb{D}, \hspace{2 mm} \hspace{2 mm} y_j = -1, \hspace{2 mm} \end{aligned}\end{equation*}

\begin{equation*}\begin{aligned} \hspace{0.7 in} \Rightarrow \bar w.(-1)\bar x_j \geq 1 - \xi_j + (-1)b \hspace{5 mm} \forall \hspace{2 mm} (\bar x_j, y_j) \in \mathbb{D}, \hspace{2 mm} \hspace{2 mm} y_j = -1, \hspace{2 mm} \end{aligned}\end{equation*}

Now we can substitute $y_j$ in place of -1, since the inequality will still hold true:

\begin{equation*}\begin{aligned} \hspace{0.7 in} \bar w.y_j\bar x_j \geq 1 - \xi_j + y_jb \hspace{5 mm} \forall \hspace{2 mm} (\bar x_j, y_j) \in \mathbb{D}, \hspace{2 mm} \hspace{2 mm} y_j = -1, \hspace{2 mm} \end{aligned}\end{equation*}

Rearranging the terms, the constraint becomes:

\begin{equation*}\begin{aligned} \hspace{0.7 in} \bar w.y_j\bar x_j - y_jb + \xi_j -1 \geq 0 \hspace{5 mm} \forall \hspace{2 mm} (\bar x_j, y_j) \in \mathbb{D}, \hspace{2 mm} \hspace{2 mm} y_j = -1, \hspace{2 mm} \end{aligned}\end{equation*}

\begin{equation*}\begin{aligned} \hspace{0.7 in} \Rightarrow y_j(\bar w.\bar x_j - b) + \xi_j -1 \geq 0 \hspace{5 mm} \forall \hspace{2 mm} (\bar x_j, y_j) \in \mathbb{D}, \hspace{2 mm} \hspace{2 mm} y_j = -1, \hspace{2 mm} \end{aligned}\end{equation*}

As evident, the compacted constraints for both the supporting hyperplanes are in similar form. Hence the constraint can be now expressed as one constraint for all points in the training set $\mathbb{D}$ as follows:

\begin{equation*}\begin{aligned}\hspace{0.7 in} y_i(\bar w.\bar x_i - b) + \xi_i - 1 \geq 0 \hspace{5 mm} \forall \hspace{2 mm} (\bar x_i, y_i) \in \mathbb{D}, \hspace{2 mm} y_i \in \{+1,-1\} \end{aligned}\end{equation*}

Optimization Problem with Constraints in Compact Form

\begin{equation*}\begin{aligned} \underset{\bar w, b, \xi}{\text{min}} & \left( \frac{1}{2} \bar w.\bar w \hspace{2 mm} + \hspace{2 mm} C \sum\limits_{i=1}^m \xi_i \right) \end{aligned}\end{equation*}

\begin{aligned}\text{subject to: }\end{aligned}
\begin{equation*}\begin{aligned}\hspace{0.7 in} y_i(\bar w.\bar x_i - b) + \xi_i - 1 \geq 0 \hspace{5 mm} \forall \hspace{2 mm} (\bar x_i, y_i) \in \mathbb{D}, \hspace{2 mm} y_i \in \{+1,-1\} \end{aligned}\end{equation*}

\begin{equation*}\begin{aligned} \hspace{0.7 in} \xi_i \geq 0, \hspace{2 mm} i = 1, \ldots, m \end{aligned}\end{equation*}

\begin{equation*}\begin{aligned} \hspace{0.7 in} \text{where} \hspace{2 mm} \mathbb{D} = \left\{ (\bar x_i, y_i) : \bar x_i \in \mathbb{R}^n, y_i \in \{+1, -1\} \right\} \hspace{2 mm} \text{ is Training Set} \end{aligned}\end{equation*}

Optimization Problem in Lagrange Form

\begin{equation*}\begin{aligned} \underset{\bar \alpha, \bar \beta}{\text{max}} \hspace{1 mm} \underset{\bar w, b, \xi}{\text{min}} & \left( \frac{1}{2} \bar w.\bar w \hspace{2 mm} + \hspace{2 mm} C \sum\limits_{i=1}^m \xi_i - \sum\limits_{i=1}^m \alpha_i (y_i(\bar w.\bar x_i - b) + \xi_i - 1) - \sum\limits_{i=1}^m \beta_i \xi_i \right) \end{aligned}\end{equation*}

\begin{aligned}\text{subject to: }\end{aligned}
\begin{equation*}\begin{aligned} \hspace{0.7 in} \alpha_i \geq 0, \hspace{2 mm} \beta_i \geq 0, \hspace{2 mm} i = 1, \ldots, m. \end{aligned}\end{equation*}

Where there is a will, there is a way - Intuition behind Lagrange Optimization

In many constrained optimization problems (both maximization and minimization), the optimization problem is first converted into something called Lagrange form and then the optimal solution is evaluated.

The following is the general representation of a Constrained minimization problem:

\begin{equation*}\begin{aligned} \underset{\bar x}{\text{min}} & & f(\bar x) \hspace{5 mm} \bar x \in \mathbb{R}^n \end{aligned}\end{equation*}

\begin{aligned}\text{subject to: }\end{aligned}
\begin{equation*}\begin{aligned} \hspace{0.7 in} g_i(\bar x) \geq 0, \; i = 1, \ldots, m. \end{aligned}\end{equation*}

$f(\bar x)$ is the Objective function and $g_i(\bar x)$ are Constraint functions.

Note that constraint functions $g_i(\bar x)$ are assumed to be linearly independent - meaning there is no linear relationship between themselves and hence one function cannot be expressed in terms of others scaled by some constant.

\begin{aligned}\Rightarrow \sum\limits_{i=1}^m c_i \times g_i(\bar x) \neq 0, \hspace{2 mm} \text{for some} \hspace{2 mm} c_1, c_2, \ldots, c_m \in \mathbb{R} \hspace{2 mm} \text{not all zero} \end{aligned}

The above optimization problem is then written in Lagrange optimization form as follows:

\begin{equation*}\begin{aligned}\underset{\bar \alpha}{\text{max}} \hspace{2 mm} \underset{\bar x}{\text{min}} & f(\bar x) - \sum\limits_{i=1}^m \alpha_i \times g_i(\bar x) \end{aligned}\end{equation*}
\begin{aligned}\text{where } \alpha_i \geq 0, \; i = 1, \ldots, m. \end{aligned}

The values $\alpha_1, \alpha_2, \ldots \alpha_m$ in the above representation are called Lagrange Multipliers (a.k.a. KKT Multipliers) .

There seems to be two intuitions to above formulation of Lagrangian optimization - One simple enough to explain in a top-down manner and the second, somewhat related to first one but involves deep understanding. We will look at both the intuitions.

Intuition # 1:
If an optimal solution $\bar x^*$ exists for the constrained optimization problem, then at that optimal point, the gradient vector of the objective function $\nabla f(\bar x)$ is parallel to the gradient vectors of constraint functions $\nabla g_i(\bar x)$. This means that objective function gradient vector at the optimal point $\bar x^*$ can be expressed as some linear combination of constraint function gradient vectors that are parallel to it:

\begin{aligned}\nabla f(\bar x^*) = \alpha_1 \nabla g_1(\bar x^*) + \alpha_2 \nabla g_2(\bar x^*) + \ldots + \alpha_m \nabla g_m(\bar x^*), where \hspace{2 mm} \alpha_i \geq 0 \end{aligned}

\begin{aligned}\Rightarrow \nabla f(\bar x^*) - (\alpha_1 \nabla g_1(\bar x^*) + \alpha_2 \nabla g_2(\bar x^*) + \ldots + \alpha_m \nabla g_m(\bar x^*)) = 0 \end{aligned}

\begin{aligned}\Rightarrow \nabla f(\bar x^*) - \sum\limits_{i=1}^m \alpha_i \nabla g_i(\bar x^*) = 0 \end{aligned}

The above equation with terms $\nabla f(\bar x^*)$ and $\nabla g_i(\bar x^*)$ seems to suggest that there exists a function $L(\bar \alpha, \bar x)$ whose gradient $\nabla L$ is 0 at $\bar x^*$.

The function $L(\bar \alpha, \bar x)$ and its gradient $\nabla L$ can be represented more generally as follows:

\begin{aligned}\Rightarrow L(\bar \alpha, \bar x) = f(\bar x) - \sum\limits_{i=1}^m \alpha_i \times g_i(\bar x) \end{aligned}

\begin{aligned}\Rightarrow \nabla L(\bar \alpha, \bar x) = \nabla f(\bar x) - \sum\limits_{i=1}^m \alpha_i \nabla g_i(\bar x) \end{aligned}

Note that the values $\alpha_1, \alpha_2, \ldots \alpha_m$ in the above equations are represented as vector $\bar \alpha$ for convenience such that $\bar \alpha = (\alpha_1, \alpha_2, \ldots \alpha_m)$.

Also note that gradient $\nabla L$ is a vector and since $\bar x \in \mathbb{R}^n$, $\nabla L$ will have vector components $( \frac{\partial}{\partial x_1}, \frac{\partial}{\partial x_2}, \ldots \frac{\partial}{\partial x_n} )$.

If $\nabla L = 0$ at $\bar x^*$ then each of the vector components $\frac{\partial}{\partial x_i}$ will be equal to 0.

Since $\bar x^*$ is unknown and since finding optimal $\bar x^*$ is our goal, the generalized function $L(\bar \alpha, \bar x)$ seems to give some confidence that there is possibility of finding optimal $\bar x^*$ by solving for point where $\nabla L = 0$. However, for the generalized function $L(\bar \alpha, \bar x)$, there can be many points $\bar x$ where $\nabla L = 0$. All such points are called saddle points or stationary points and the challenge then would be which one of them will be the optimal point $\bar x^*$ that will yield optimal solution to primary objective function $f(\bar x)$ subject to constraints $g_i(\bar x) \geq 0$.

The intuition postulates that under certain special circumstances (which we will quantify mathematically in later paragraphs), one of the saddle points $\bar x^*$ on $L(\bar \alpha, \bar x)$ can potentially be the optimal point for the main objective function $f(\bar x)$ subject to constraints $g_i(\bar x) \geq 0$.

Note that the values for multipliers $\alpha_i$ in function $L(\bar \alpha, \bar x)$ are unknown. Essentially, if we have to go about using the function $L(\bar \alpha, \bar x)$ to find the optimal point $\bar x^*$ then the challenge is two fold. We have to not only search for $\bar x^*$ but we have to also find optimal values for $\alpha_i$ simultaneously such that the desired saddle point $\bar x^*$ can be found.

This implies that we have to find an optimal combination of $\bar x^* = ( x_1^*, x_2^*, \ldots x_n^* )$ and $\bar \alpha^* = ( \alpha_1^*, \alpha_2^*, \ldots \alpha_m^*)$ with $\nabla L(\bar \alpha^*, \bar x^*) = 0$ such that $\bar x^*$ can potentially be the optimal point for main objective function $f(\bar x)$ subject to contraints $g_i(\bar x) \geq 0$.

Note that:

\begin{aligned}\nabla L(\bar \alpha^*, \bar x^*) = 0 \Rightarrow \frac{\partial L(\bar \alpha^*, \bar x^*)}{\partial x_1} = \frac{\partial L(\bar \alpha^*, \bar x^*)}{\partial x_2} = \ldots = \frac{\partial L(\bar \alpha^*, \bar x^*)}{\partial x_n} = 0 \end{aligned}

If we go with the above intuition solving for $(\bar \alpha^*, \bar x^*)$ using $L(\bar \alpha, \bar x)$, analytically speaking, the real challenge is that we have to find n variables in $\bar x^* = ( x_1^*, x_2^*, \ldots x_n^* )$ and m values in $( \alpha_1^*, \alpha_2^*, \ldots \alpha_m^*)$. Essentially there are (n+m) unknowns that we need to solve for. So far going with the notion of $\nabla L(\bar \alpha^*, \bar x^*) = 0$, we have arrived at the following n potential equations in (n+m) unknowns.

\begin{aligned}\nabla L = 0 \Rightarrow \frac{\partial L}{\partial x_1} = \frac{\partial L}{\partial x_2} = \ldots = \frac{\partial L}{\partial x_n} = 0 \end{aligned}

Missing are the m additional equations in (n+m) unknowns.

This is where the primary intuition can be revisited. Primary intuition suggested that under certain special conditions, one of the saddle points $\bar x^*$ on $L(\bar \alpha, \bar x)$ can potentially be the optimal point for the main objective function $f(\bar x)$ subject to constraints $g_i(\bar x) \geq 0$. Essentially this means that when such special conditions are met, the optimal value of objective function $f(\bar x)$ at $\bar x^*$ subject to constraints $g_i(\bar x^*) \geq 0$ is nothing but the value of function $L(\bar \alpha, \bar x)$ at $(\bar \alpha^*, \bar x^*)$.

\begin{aligned}\Rightarrow L(\bar \alpha^*, \bar x^*) = f(\bar x^*)\end{aligned}

However by formulation of generalized function $L(\bar \alpha, \bar x)$, we know that:

\begin{aligned} L(\bar \alpha, \bar x) = f(\bar x) - \sum\limits_{i=1}^m \alpha_i \times g_i(\bar x) \end{aligned}

So the value of function $L(\bar \alpha, \bar x)$ at $(\bar \alpha^*, \bar x^*)$ is then given by:

\begin{aligned}\Rightarrow L(\bar \alpha^*, \bar x^*) = f(\bar x^*) - \sum\limits_{i=1}^m \alpha_i^* \times g_i(\bar x^*) \end{aligned}

This can be re-written as:

\begin{aligned}\Rightarrow L(\bar \alpha^*, \bar x^*) - f(\bar x^*) + \sum\limits_{i=1}^m \alpha_i^* \times g_i(\bar x^*) = 0 \end{aligned}

If \begin{aligned}L(\bar \alpha^*, \bar x^*) = f(\bar x^*)\end{aligned} then the above equation results in:

\begin{aligned}\Rightarrow L(\bar \alpha^*, \bar x^*) - L(\bar \alpha^*, \bar x^*) + \sum\limits_{i=1}^m \alpha_i^* \times g_i(\bar x^*) = 0 \end{aligned}

\begin{aligned}\Rightarrow \sum\limits_{i=1}^m \alpha_i^* \times g_i(\bar x^*) = 0 \end{aligned}

Since we know that $\alpha_i^* \geq 0$ and constraint functions $g_i(\bar x)$ are linearly independent, the above equation will hold true only when each of the terms $\alpha_i^* \times g_i(\bar x^*)$ is equal to 0.

And Voila....!! We have the special conditions and the missing m equations that must be satisfied at $(\bar \alpha^*, \bar x^*)$:

\begin{aligned}\Rightarrow \alpha_i \times g_i(\bar x) = 0, i = 1, \ldots, m. \end{aligned}

Now using the following (n+m) equations in (n+m) unknowns, we can find an optimal combination of $\bar x^* = ( x_1^*, x_2^*, \ldots x_n^* )$ and $\bar \alpha^* = ( \alpha_1^*, \alpha_2^*, \ldots \alpha_m^*)$ with $\nabla L(\bar \alpha^*, \bar x^*) = 0$ such that $\bar x^*$ can potentially be the optimal point for main objective function $f(\bar x)$ subject to contraints $g_i(\bar x) \geq 0$.

\begin{aligned}\frac{\partial L}{\partial x_1} = \frac{\partial L}{\partial x_2} = \ldots = \frac{\partial L}{\partial x_n} = 0 \hspace{2 mm} and \hspace{2 mm} \alpha_i \times g_i(\bar x) = 0, i = 1, \ldots, m. \end{aligned}

Hence this intuition#1 forms the basis for the formulation of Lagrangian Optimization problem from the Constrained Optimization problem.

• In case of Minimization problems, the standard convention is to use constraints as $g_i(\bar x) \geq 0$. In this case, the gradient vector of objective function $\nabla f$ and
• and gradient vectors of constraint functions $\nabla g_i$ at the optimal point $\bar x^*$ will point in same direction and $\alpha_i$ in $\nabla f(\bar x^*) = \sum\limits_{i=1}^m \alpha_i \nabla g_i(\bar x^*)$ will be $\geq 0$. If constraints are declared as $g_i(\bar x) \leq 0$, gradient vector of objective function $\nabla f$ and gradient vectors of constraint functions $\nabla g_i$ at the optimal point $\bar x^*$ will point in opposite directions and $\alpha_i$ in $\nabla f(\bar x^*) = \sum\limits_{i=1}^m \alpha_i \nabla g_i(\bar x^*)$ will be $\leq 0$ to compensate for opposite directions
• Alternatively, in case of Minimization problems, if the constraints are declared as $g_i(\bar x) \leq 0$, gradient vectors linear combination can be expressed as $\nabla f(\bar x^*) = -\sum\limits_{i=1}^m \alpha_i \nabla g_i(\bar x^*)$ to account for opposite directions and $\alpha_i$ will be $\geq 0$
• In case of Maximization problems, the standard convention is to use constraints as $g_i(\bar x) \leq 0$. In this case, the gradient vector of objective function $\nabla f$ and
• and gradient vectors of constraint functions $\nabla g_i$ at the optimal point $\bar x^*$ will point in same direction and $\alpha_i$ in $\nabla f(\bar x^*) = \sum\limits_{i=1}^m \alpha_i \nabla g_i(\bar x^*)$ will be $\geq 0$. If constraints are declared as $g_i(\bar x) \geq 0$, gradient vector of objective function $\nabla f$ and gradient vectors of constraint functions $\nabla g_i$ at the optimal point $\bar x^*$ will point in opposite directions and $\alpha_i$ in $\nabla f(\bar x^*) = \sum\limits_{i=1}^m \alpha_i \nabla g_i(\bar x^*)$ will be $\leq 0$ to compensate for opposite directions
• Alternatively, in case of Maximization problems, if the constraints are declared as $g_i(\bar x) \geq 0$, gradient vectors linear combination can be expressed as $\nabla f(\bar x^*) = -\sum\limits_{i=1}^m \alpha_i \nabla g_i(\bar x^*)$ to account for opposite directions and $\alpha_i$ will be $\geq 0$