## Three Customer Buying Behaviors Omnichannel Should Solve

It seems like omnichannel is discussed everywhere in retail these days.  However, even with all this talk, there isn't significant content about the specific customer buying behaviors that omnichannel should solve.

So, as a way to get the discussion going on, we'll share three customer buying behaviors that omnichannel should focus on and by doing so, should generate incremental revenue for the merchant.

### Definition of Omnichannel

Before we share the buying behaviors, let's ground ourselves with a definition.  Although there are several floating around, we define omnichannel as:

Omnichannel is a customer focused approach that works alongside the customer as he or she interacts with the merchant.  With omnichannel, when a customer engages in a particular channel, his or her prior interactions in other channels are known and that knowledge is used to optimize the interaction in the current channel.  Omnichannel should also leverage knowledge of one other huge "channel" - what's going on in the outside world.  This includes information like location, weather, traffic and national / local events.

### Three Customer Buying Behaviors Omnichannel Should Solve

#### 1. Buy online, pick up in store

Ok, this is definitely not new.  Many merchants, like The Container Store, have implemented this customer solution as it delivers real value to the shopper.  The shopper gains the efficiency of the website purchase process and same day fulfillment of an in store visit while minimizing the time spent at the store.  Implementation requires in store operational changes; namely, establish a process to receive these orders from online channel, have staff available to package order, and create a customer service line for customers to pick the orders up.  In addition, the merchant must work through supply chain and inventory management requirements in order to ensure items bought online are available at the local store.

#### 2. Right offer, right time, right place

As the shopper fluidly moves back and forth and across channels, he or she seeks information during the buying process.  This includes marketing offers, as a way to identify saving opportunities, but it must be relevant and timely.

Ideally, a merchant would want a 360 degree view of all the shopper activity and be able to uniquely identify the shopper at each and every channel interaction.  This would enable the merchant to best select which offer to present and ensure consistency across channels.

However, the 360 degree view will not be available in all cases and for every merchant.  Fortunately, it's not needed to optimize marketing offers.  Classification algorithms coupled with real time decision capabilities can take whatever data is known at the channel interaction (e.g., customer transaction history, mobile identifier, IP address, profile data) and assign the customer to a micro segment.  Optimization algorithms will decide which offers are best for each micro segment. Again, knowing exactly who the customer isn't necessary because segment specific offers will perform far better than a "one size fits all" approach.

#### 3. Research one in channel, buy in another

More frequently, shoppers research items online and then go to a store to make the actual purchase - i.e., webrooming.   However, this behavior also works in the opposite direction where a shopper may touch, feel and try on a product in a store and then go home to shop online for the best deal - i.e., showrooming.

In the online to offline scenario, the merchant would ideally want to identify the customer as he or she walks into the store and greet the shopper with a message - e.g., "the product you've researched online is sold in aisle 10, here's 5% off coupon and you should consider purchasing this additional item with it".

To execute tactics like this, the merchant must possess exact knowledge and tracking of the customer across channels.  The Loyalty ID of a merchant's loyalty program is the best way to enable this.  Of course, the merchant must motivate the shopper through the loyalty program to "mark" themselves along his or her buying process.  In addition, real time decision capabilities are necessary in order to know when a shopper is interacting with a channel and intelligently decide the optimal action to take based on prior interactions.

What do you think?  Are there other customer buying behaviors omnichannel should solve?

## A New Way to Engage Customers With Real Time Decisions

Most merchants miss a huge opportunity to intelligently engage in real time with their customers.  Capitalizing on this missed opportunity could deliver 3% to 5% revenue growth through increased transactions and reduction of customer churn.

Traditionally, merchants have engaged shoppers through pricing, promotion and loyalty program tactics delivered through marketing campaigns and customer service.  Highly effective marketing organizations would take the data and insight from the customer interactions to determine how best to optimize future marketing efforts (e.g., target audience, message, offers, promotion channel).

These practices are tried and true.  Improving execution against this tactics will continue to deliver results and no doubt should remain a focus area for marketing organizations, especially with access to Big Data as an input into the analytics.  However, this engagement model is static by nature with periodic batch updates.

Real time decision capabilities enable marketing organizations an additional way to engage customers.  This is based on "in the moment" intelligence that dynamically determines an optimal action to take in real time.  Although current personalization tactics are a simple example, robust real time decision capabilities go well beyond it, delivering significantly higher effectiveness.

### Where Real Time Decisions Sit in the Customer Engagement Model

Role of Real Time Decisions in Customer Engagement

Whether initiated by a marketing campaign or the customer, "events" occur all the time where a customer interacts with the merchant.  A basic example of "events" is when a customer enters a merchant channel - i.e., visit website, use mobile app, enter store, call contact center.  At that moment, the merchant has the opportunity to optimize the interaction with the customer.  Desired outcomes could be any combination of

• Close a sale that was initiated but not completed in another channel
• Optimize offers presented to drive transactions
• Recognize key customer lifecycle milestones
• Deliver excellent customer service - either surprise & delight or recovery actions.

In addition, once a customer interaction reaches an outcome (e.g., purchase, loyalty accrual/redemption, customer service resolution, abandonment), the merchant has another opportunity to re-engage the customer to drive additional desirable behaviors or outcomes.  This cycle could repeat continuously.

Beyond just capturing the "event" itself, real time decision capabilities leverage additional data to help determine an optimal action.  This data could be from both internal and external sources - e.g., customer profile, transaction history, prior channel activity, weather, geo-location.  Sophisticated and flexible decision intelligence frameworks (e.g., machine learning algorithms, complex event processing, optimization techniques, rules engine) are prerequisite tools to derive intelligence (for more detail on required real time decision capabilities see How to Execute Real Time Decisions).

When you step back, real time decisions is an automated, algorithmic means to engage customers in a similar fashion that store clerks have been doing for ages.  However, because of the automation, the capabilities can pull in large amounts of data to feed "intelligence" and work in 21st century e-channels like mobile.

Tell us what you think?  We'd love to hear.

## What Retailers Can Learn From Airlines About Omnichannel

The financial troubles of airlines are well documented. So what can retailers learn from a historically struggling industry about how to deliver omnichannel, an innovation that has promise to drive incremental revenue and deliver a fantastic customer experience?  Well, actually quite a lot.

### Airline Omnichannel Requirements

Airlines have been executing omnichannel long before the omnichannel buzz word existed.  Fulfillment of an airline's product (getting a person from point A to point B) is arguably the most complex, high touch consumer experiences for relatively frequent purchases.  After a ticket is purchased, a traveler could have 10 or more customer service interactions before the final destination is reached on issues like

• Seat assignments
• Special requests (e.g., meals, wheelchair)
• Changing travel plans
• Flight delays and cancellations
• Flight check in
• Boarding plane
• In flight experience
• Lost baggage and other baggage claims.

It gets even more complicated when considering the number of channels which must be in synch with one another, including:

• Distributors (i.e., travel agencies like American Express and Expedia)
• Website
• Mobile app
• Call center
• Email and text for proactive communications
• Airport kiosks
• Airport customer service reps
• Flight attendants.

Each of these channels perform some combination of selling air travel, selling ancillary products and handling a myriad of customer service issues.

Further, a customer often interacts with multiple channels for a specific need.  For example, if a flight is delayed, the traveler may visit the website, check the mobile app, contact the call center, receive an email, look at the airport kiosk and speak with a gate agent all for the same issue.

Putting aside everyone's travel horror stories (and we all have them), airlines do a lot of things right in this challenging environment.

### Airline Lessons for Retailer Omnichannel

There are many lessons a retailer can learn from airlines about omnichannel.  We'll share our four biggest insights.

##### 1. Effective use of unique identifiers

Like other retail sectors, airlines have a unique transaction ID for purchased tickets called a passenger name record or PNR.  The PNR ensures linkage of customer service interactions across the channels and through the travel experience. In addition, loyalty frequent flier numbers connect transactions over time to customers.  The loyalty ID has been used effectively for personalization and customer life cycle management.  Leveraging these identifiers, especially loyalty IDs, can be an effective enabler to deliver omnichannel for retailers.  For example, loyalty IDs can be used to link customer research of products in digital channels with a purchase transaction in a brick & mortar store.

##### 2. "Intelligent" middleware investments to speed cross channel deployment

Most companies have legacy technology systems that are difficult to enhance.   Airlines are no different and some have learned the hard way of not investing in "intelligent" middleware that can act as channel "brains".  Without channel "brains", new cross-channel features become a different project in each channel.  This drives significant costs and elongates timelines.  Since investment dollars are constrained, often functionality is implemented in one channel but not another.  This leads to missed revenue opportunities and poor customer experiences.  Instead with "intelligent" middleware, new functionality can be launched much more like a single deployment, enabled all at once.

##### 3. Leveraging events for insights to trigger action

In several contexts, airlines have leveraged the occurrence of an event to take action in real time.  Events are essentially customer digital footprints that can be captured at every touch point and is usually initiated by a customer action; examples include,

• Customer visits website
• Customer uses mobile app
• Customer visits store
• Customer purchases product
• Customer logs a complaint
• Retailer geo-locates customer.

By applying predictive analytics and correlation techniques on events, intelligence can be extracted in real time that can be used for instantaneous decision making and subsequent actions.  For example, airlines might take several immediate actions if a flight delay causes a missed connecting flight - rebook the traveler on next available flight, ensure the bags of traveler are transferred accordingly, offer a food and hotel compensation voucher to the traveler.

##### 4. Data management and data virtualization

Data management and access are critical to the execution of omnichannel.  Airlines have many legacy systems with each acting as a source of truth for different elements of essential data fields.  It is a natural tendency to undergo a large scale enterprise data management project to bring all of these sources of information together into one system (e.g., CRM or MDM ).  However, going down this path will significantly delay omnichannel delivery.  Instead, airlines that have enabled virtualized data stores fed from the different systems have improved time to market for critical omnichannel functionality like real time decisions and offer optimization.

Omnichannel is relatively new for retail and the sector is still trying define the opportunity.  But retail has the opportunity to learn from the lessons of airlines, an industry that has been executing omnichannel for decades.

Tell us what you think?  We'd like to hear.

## The Value of Big Data - Defining It Once and For All

Big Data is a huge buzz word.  Like so many "next big things", there is often misunderstanding about what it is and it's value.  This leaves many companies and industries with an open and lingering question about what they should be doing with Big Data.

However, unlike other "next big things", Big Data is not overhyped in its possibilities.  It truly opens up a tremendous frontier of business intelligence.  Unfortunately, given the misunderstandings, it is not always clear how to take advantage of it.

## What is Big Data

Before answering the value of Big Data, it's worth a quick summary of what it is.  Big Data is generally thought of as the "3 Vs"; i.e., data that has

• Volume (terabytes & beyond)
• Velocity (streaming real time)
• Variation (structured & unstructured).

Twitter is a great example.  It's data is very large, generated real time and unstructured (admittedly hashtags and handles provide some structure but no where near a traditional relational database).

It's also important to understand that not all "3 Vs" need to be present.  The concept of Big Data is relevant if Big Data processing technologies are needed to unlock the value of a company's data or take advantage of external data.

At a high level, Big Data processing technologies offer two key capabilities.  First, it is a novel way to store data that is especially well suited for any of the "3 Vs" with the focus of providing fast access in terms of queries and updates.  Second, it provides Map / Reduce functionality that identifies relationships between unstructured data elements and helps in building ‘keys’ or ‘indices’ that are needed for fast access and cross-data relationships.

Big Data's "4th V" - Value

It is critical to emphasize that processing Big Data is not an objective unto itself and value is not created by just implementing Big Data processing and storage.  Big Data must be directly used to help improve core objectives such as

• Optimize offers (marketing offers, cross/up sell, product configuration)
• Improve customer service (omnichannel, surprise & delight, recovery)
• Predict and prevent customer churn
• Improve inventory management / product forecasting.

Big Data can be leveraged to accomplish this and create value if used an input into any of the following business intelligence processes within your organization:

1. Collection and maintenance of enterprise data assets
2. Batch analytics and static decisions
3. Real time analytics and decisions

Each process is described in more detail below as how Big Data can create value.  We've shared a Marketing related example for each but obviously there are many other operational processes that leverage business intelligence and can benefit from Big Data.

#### Enterprise Data Assets

By processing Big Data, intelligence can be extracted to build data assets over time which plug into existing enterprise data management solutions like customer relationship management (CRM).  These new data elements help improve performance for any process that leverages CRM data (e.g., marketing campaign management).  It is analogous to the objective of building marketing databases that contain customer profile and email address.  However, with Big Data, the profile information that is built contains detailed digital activity and social activity, sentiment and influence.  By knowing these attributes, marketing campaign target lists, channel selection and offer/content can all be improved.

#### Batch Analytics and Static Decisions

Batch analytics and static decisions are traditionally how companies make decisions.  Historical data is compiled and analysis is done to inform decisions like marketing mix / planning decisions.  However, with Big Data, a company can now bring in granular data about digital interactions that could be terabytes in size.  This granular, cross channel data enables much more sophisticated and accurate marketing optimization models leading to more effective marketing campaigns and resource allocation.

#### Real Time Analytics and Decisions

Real time decisions is the ability to intelligently engage customers and improve outcomes based on real time events. It requires processing events as they occur, combining the event with other valuable data, gaining intelligence from the data and deciding on an action to improve the customer interaction.  All of this done in real time.  Big Data opens up new data sources like granular data on digital interactions and external data like weather to feed algorithms that optimize which marketing offer to present on a customer by customer basis.

In the end, Big Data is similar to a lot of innovations.  It is a new and innovative way to improve solutions for existing core objectives like marketing offer optimization.  Perhaps, Big Data's value is not mysterious after all.

What do you think?  We'd love to hear your thoughts.

## How to Deliver Omnichannel Real Time Decisions

Omnichannel execution has become imperative for retailers as consumers increasingly combine their shopping activity across online, brick & mortar and mobile channels.

With omnichannel, when a customer engages in any channel, the retailer is aware of their prior interactions in other channels and uses that knowledge to optimize the interaction in the current channel.

A well executed omnichannel strategy requires real time decision capabilities - the ability to process events as they occur, combine the event with other valuable data, gain intelligence from the data and decide on an action to improve the customer interaction.  All of this done in real time.

#### Real Time Decision Process

Real time decisions process

A Real Time Decision Process can be viewed as five distinct steps.

###### Events

Execution of omnichannel real time decisions is triggered by an event.  An event can be any number of things but is usually initiated by a customer action; examples include,

• Customer visits website
• Customer uses mobile app
• Customer visits store
• Customer calls contact center
• Retailer geo-locates customer.

The event presents an opportunity to engage and must be captured in order to initiate the real time decision process.

##### Virtualized Data

Sometimes, knowledge of the event is sufficient information to take action.  More often, additional data must be leveraged to improve intelligence.  Many different types of information is potentially needed including

• Customer profile
• Transaction/sales history
• Channel interaction history
• Social activity history
• External data like weather, traffic, national/local events.

Data virtualization is the technical process that integrates these disparate data sources in real time into a usable format.

##### Intelligence

Intelligence must be derived based on the event and virtualized data to determine the optimal action.  Predictive analytic capabilities are necessary and a wide range of decision frameworks must be available, including

• Rules engine
• Complex event processing
• Classification / clustering
• Optimization
• Machine learning
• Artificial intelligence.

The breadth of decision frameworks is necessary because different business objectives require different analytical approaches.  For example, a rules engine works great when recognizing a customer for a milestone.  Likewise, event processing is well suited for identifying potential customer dis-service scenarios.  Finally, optimization techniques work well when making decisions about which promotions to place in front of the customer.

###### Action

Once determined, the decision must be integrated with a customer facing channel or business process in order to impact the outcome in real time.  The types of actions should be related to achieving core objectives such as

• Offer optimization
• Winning/completing the sale
• Customer lifecycle milestones
• Customer service (cross channel coordination, surprise & delight, recovery).
##### Feedback

Feedback takes two forms.  First, the outcome of the action/decision is fed back to the algorithms used in the Intelligence step.  This can be done in real time for online learning algorithms or stored and leveraged in a batch mode for off line learning algorithms.

Second, data and insight from the Real Time Decision Process is fed into enterprise data management processes like customer relationship management (CRM) and customer data management (CDM).

It's important to note that real time decisions are related but separate to enterprise data management processes in which both rely on each other as inputs to one another.  As a consequence, implementation of real time decisions does not require a multi-year, multi-million dollar enterprise data management project to be complete.

Execution of real time decisions is a complex set of capabilities that include predictive analytics, data virtualization and real time decision software.  However, when done well, real time decisions delivers the full promise of omnichannel benefits.

What do you think?  We'd love to hear your thoughts.

## Importance of Real Time Decisions in Omnichannel Marketing

With the growing trend of consumers combining their shopping activity across online, brick & mortar and mobile channels, significant discussion has arisen in the retail industry about the need for omnichannel marketing.

### What is Omnichannel

While perhaps a single definition doesn't yet exist, one way of thinking about omnichannel marketing is that it is centered around the customer.  Historically, multi-channel strategies have tried to ensure brand consistency across channels and optimize performance in each channel based on respective strengths.  Multi-channel is more of an inward focused approach.

With omnichannel, when a customer engages in a particular channel, their prior interactions in other channels is known and that knowledge is used to optimize the interaction in the current channel.  Omnichannel is a more of a customer focused approach that works alongside the customer as they interact with the retailer.

Taking this one step further, omnichannel should also leverage knowledge of one other huge "channel" - what's going on in the outside world.  This includes information like location, weather, traffic and national / local events which can have as much impact on optimizing the customer interaction as anything else.

High level concept of omnichannel

### Omnichannel Execution

In order to create maximum value, omnichannel strategies must be directly applied to improving performance of core business objectives.  Otherwise, retailers run the risk of implementing a very complex initiative without an end goal in sight. For the marketer, a well executed omnichannel strategy can improve

• Offer optimization
• Conversions / transactions of sales leads
• Customer lifecycle management - recognition of events & milestones
• Customer service - cross channel coordination, surprise & delight tactics, recovery.

Omnichannel implementation requires many capabilities including enterprise data management solutions, predictive analytics and real time decision software.  It may also require changes to organizational design and operational processes.

### Role of Real Time Decisions in Omnichannel

Real time decisions is critical for a well executed omnichannel strategy.  Omnichannel leverages knowledge of prior interactions across channels to optimize the interaction in the current channel.  In most cases, the opportunity for the retailer to interact with the customer in the current channel is "now" or real time.

Execution of omnichannel and real time decisions is triggered by an event.  An event can be any number of things but is usually initiated by a customer action; examples include

• Customer visits website
• Customer visits brick & mortar store
• Customer engages in a social channel
• Retailer determines geo-location of customer.

A retailer must be able to capture this event and leverage additional data like prior interactions across channels in order to derive intelligence.  A wide range of tools from sophisticated predictive analytics like machine learning algorithms to more straight forward business rules engines should be leveraged to derive intelligence.  From this intelligence, a decision is determined for the optimal action.  The decision must be integrated into a customer facing channel or business process in order to impact the outcome in real time.

The entire process, outlined above, is called real time decisions.   As you can see, it is central to enabling the knowledge and coordination of actions across channels.  In the next post, we'll talk more about the execution of omnichannel real time decisions.

In the meantime, what do you think?  We'd love to hear your thoughts.

## How Loyalty Programs Should Leverage Big Data

With so much buzz around Big Data, it seems like virtually every industry is scrambling to figure out what to do with it.  I had the honor of speaking at the Loyalty Americas 2013 conference this week to experts in the Travel and Loyalty industries about that very topic, namely "how should Loyalty programs leverage Big Data".

Since the topic generated so much interest and a lot of great questions, I thought I would share the highlights of what was discussed.

First, loyalty programs may not have the Volume part of Big Data's three Vs (volume, variation, velocity) but it definitely has Variation and Velocity.  As such, Big Data technology solutions still have relevance.

However, processing Big Data is not an objective unto itself.  It's a means to an end.  For Loyalty, Big Data capabilities open up new sources of data that are fast moving and sometimes unstructured (e.g., social data).  The real question is what do the new sources of data help enable.

Regardless of Big Data, Loyalty's key levers to motivate, engage and reward their members are:

• Points accrual offers and opportunities
• Points redemption offers and opportunities
• Cross / up sell offers of merchant's core products
• Customer service - either 'surprise and delight' or recovery actions.

The Loyalty industry should think of Big Data as an exciting new input to better execute against one of the aforementioned levers.  I stress the word 'input' because Big Data is just that.  Predictive analytics / decision frameworks that leverage Big Data are the true enabler of value.

Yes, processing Big Data into a usable format is critical and foundational.  However, you must still extract 'intelligence' out of the data in order to take action.  That's where predictive analytics / decision frameworks like complex event processing, machine learning algorithms, optimization algorithms, rules engines, etc. come in.  To capture value (i.e., improved execution against one of the value levers), the output of the enablers must be plugged into business processes in order to take action.

For Loyalty, the goal must be to take action based on the input in real time.  Loyalty is all about engagement with members to gain and retain stickiness.  There is a huge opportunity to engage with your membership in real time based on actions they are taking on web, mobile and social channels and experiences in your operation.  It's the loyalty program of the future and it's here now.

With that said, there is also tremendous value to be gained from using Big Data to build a deeper understanding of their members over time.  In parallel, loyalty programs should be enhancing their CRM to capture these data insights but this effort should not be in replace of real time decisions and actions.

What do you think about Loyalty and Big Data?  We'd love to hear your thoughts.

## Math behind Soft-Margin SVM Classifier - Sequential Minimal Optimization

The decision function for an SVM classifier is given by:

\begin{equation*}\begin{aligned}\hat{f}(\bar x) = sgn(\bar w.\bar x - b) \hspace{5 mm} \bar x \in \mathbb{R}^n \hspace{5 mm} and \hspace{2 mm} \bar w \in \mathbb{R}^n\end{aligned}\end{equation*}

$\bar w$ is the normal vector and b is the offset term for the decision surface $\bar w.\bar x = b$.

The corresponding supporting hyperplanes are as follows:

\begin{equation*}\begin{aligned}\bar w.\bar x = b + 1 - \xi \hspace{5 mm} \forall \hspace{2 mm} (\bar x, y) \mid \bar x \in \mathbb{R}^n, \hspace{2 mm} \hspace{2 mm} y = +1, \hspace{2 mm} \xi \geq 0 \end{aligned}\end{equation*}

\begin{equation*}\begin{aligned}\bar w.\bar x = b - 1 + \xi \hspace{5 mm} \forall \hspace{2 mm} (\bar x, y) \mid \bar x \in \mathbb{R}^n, \hspace{2 mm} \hspace{2 mm} y = -1, \hspace{2 mm} \xi \geq 0 \end{aligned}\end{equation*}

In either of the above supporting hyperplanes, $\xi \geq 0$ is known as the slack variable or error term that measures how far a particular point lies on the wrong side of its respective hyperplane.

The optimization problem to compute the a soft-margin decision surface $\bar w^*.\bar x = b^*$ is expressed as follows:

\begin{equation*}\begin{aligned} \underset{\bar w, b, \xi}{\text{min}} & \left( \frac{1}{2} \bar w.\bar w \hspace{2 mm} + \hspace{2 mm} C \sum\limits_{i=1}^m \xi_i \right) \end{aligned}\end{equation*}

\begin{aligned}\text{subject to: }\end{aligned}
\begin{equation*}\begin{aligned} \hspace{0.7 in} \bar w.\bar x_i \geq b + 1 - \xi_i \hspace{5 mm} \forall \hspace{2 mm} (\bar x_i, y_i) \in \mathbb{D}, \hspace{2 mm} \hspace{2 mm} y_i = +1, \hspace{2 mm} \end{aligned}\end{equation*}

\begin{equation*}\begin{aligned} \hspace{0.7 in} \bar w.\bar x_j \leq b - 1 + \xi_j \hspace{5 mm} \forall \hspace{2 mm} (\bar x_j, y_j) \in \mathbb{D}, \hspace{2 mm} \hspace{2 mm} y_j = -1, \hspace{2 mm} \end{aligned}\end{equation*}

\begin{equation*}\begin{aligned} \hspace{0.7 in} \xi_i \geq 0, \hspace{2 mm} i = 1, \ldots, m \end{aligned}\end{equation*}

\begin{equation*}\begin{aligned} \hspace{0.7 in} \text{where} \hspace{2 mm} \mathbb{D} = \left\{ (\bar x_i, y_i) : \bar x_i \in \mathbb{R}^n, y_i \in \{+1, -1\} \right\} \hspace{2 mm} \text{ is Training Set} \end{aligned}\end{equation*}

Rewriting Supporting Hyperplane Constraints in Compact Form:

The above Supporting Hyperplane constraints can be formatted into compact form as follows:

For Supporting Hyperplane representing all training points labeled as +1, we can rewrite the dot product and free term as follows:

\begin{equation*}\begin{aligned} \hspace{0.7 in} \bar w. (+1)\bar x_i \geq 1 + (+1)b - \xi_i \hspace{5 mm} \forall \hspace{2 mm} (\bar x_i, y_i) \in \mathbb{D}, \hspace{2 mm} \hspace{2 mm} y_i = +1, \hspace{2 mm} \end{aligned}\end{equation*}

Now we can substitute $y_i$ in place of +1, since the inequality will still hold true:

\begin{equation*}\begin{aligned} \hspace{0.7 in} \bar w. y_i\bar x_i \geq 1 + y_ib - \xi_i \hspace{5 mm} \forall \hspace{2 mm} (\bar x_i, y_i) \in \mathbb{D}, \hspace{2 mm} \hspace{2 mm} y_i = +1, \hspace{2 mm} \end{aligned}\end{equation*}

Rearranging the terms, the constraint becomes:

\begin{equation*}\begin{aligned} \hspace{0.7 in} \bar w. y_i\bar x_i - y_ib + \xi_i - 1 \geq 0 \hspace{5 mm} \forall \hspace{2 mm} (\bar x_i, y_i) \in \mathbb{D}, \hspace{2 mm} \hspace{2 mm} y_i = +1, \hspace{2 mm} \end{aligned}\end{equation*}

\begin{equation*}\begin{aligned}\hspace{0.7 in} \Rightarrow y_i(\bar w.\bar x_i - b) + \xi_i - 1 \geq 0 \hspace{5 mm} \forall \hspace{2 mm} (\bar x_i, y_i) \in \mathbb{D}, \hspace{2 mm} \hspace{2 mm} y_i = +1, \hspace{2 mm} \end{aligned}\end{equation*}

For Supporting Hyperplane representing all training points labeled as -1, multiplying LHS and RHS by -1 and changing the inequality from $\leq$ to $\geq$ we get:

\begin{equation*}\begin{aligned} \hspace{0.7 in} (-1)( \bar w.\bar x_j) \geq (-1)(b - 1 + \xi_j) \hspace{5 mm} \forall \hspace{2 mm} (\bar x_j, y_j) \in \mathbb{D}, \hspace{2 mm} \hspace{2 mm} y_j = -1, \hspace{2 mm} \end{aligned}\end{equation*}

\begin{equation*}\begin{aligned} \hspace{0.7 in} \Rightarrow \bar w.(-1)\bar x_j \geq 1 - \xi_j + (-1)b \hspace{5 mm} \forall \hspace{2 mm} (\bar x_j, y_j) \in \mathbb{D}, \hspace{2 mm} \hspace{2 mm} y_j = -1, \hspace{2 mm} \end{aligned}\end{equation*}

Now we can substitute $y_j$ in place of -1, since the inequality will still hold true:

\begin{equation*}\begin{aligned} \hspace{0.7 in} \bar w.y_j\bar x_j \geq 1 - \xi_j + y_jb \hspace{5 mm} \forall \hspace{2 mm} (\bar x_j, y_j) \in \mathbb{D}, \hspace{2 mm} \hspace{2 mm} y_j = -1, \hspace{2 mm} \end{aligned}\end{equation*}

Rearranging the terms, the constraint becomes:

\begin{equation*}\begin{aligned} \hspace{0.7 in} \bar w.y_j\bar x_j - y_jb + \xi_j -1 \geq 0 \hspace{5 mm} \forall \hspace{2 mm} (\bar x_j, y_j) \in \mathbb{D}, \hspace{2 mm} \hspace{2 mm} y_j = -1, \hspace{2 mm} \end{aligned}\end{equation*}

\begin{equation*}\begin{aligned} \hspace{0.7 in} \Rightarrow y_j(\bar w.\bar x_j - b) + \xi_j -1 \geq 0 \hspace{5 mm} \forall \hspace{2 mm} (\bar x_j, y_j) \in \mathbb{D}, \hspace{2 mm} \hspace{2 mm} y_j = -1, \hspace{2 mm} \end{aligned}\end{equation*}

As evident, the compacted constraints for both the supporting hyperplanes are in similar form. Hence the constraint can be now expressed as one constraint for all points in the training set $\mathbb{D}$ as follows:

\begin{equation*}\begin{aligned}\hspace{0.7 in} y_i(\bar w.\bar x_i - b) + \xi_i - 1 \geq 0 \hspace{5 mm} \forall \hspace{2 mm} (\bar x_i, y_i) \in \mathbb{D}, \hspace{2 mm} y_i \in \{+1,-1\} \end{aligned}\end{equation*}

Optimization Problem with Constraints in Compact Form

\begin{equation*}\begin{aligned} \underset{\bar w, b, \xi}{\text{min}} & \left( \frac{1}{2} \bar w.\bar w \hspace{2 mm} + \hspace{2 mm} C \sum\limits_{i=1}^m \xi_i \right) \end{aligned}\end{equation*}

\begin{aligned}\text{subject to: }\end{aligned}
\begin{equation*}\begin{aligned}\hspace{0.7 in} y_i(\bar w.\bar x_i - b) + \xi_i - 1 \geq 0 \hspace{5 mm} \forall \hspace{2 mm} (\bar x_i, y_i) \in \mathbb{D}, \hspace{2 mm} y_i \in \{+1,-1\} \end{aligned}\end{equation*}

\begin{equation*}\begin{aligned} \hspace{0.7 in} \xi_i \geq 0, \hspace{2 mm} i = 1, \ldots, m \end{aligned}\end{equation*}

\begin{equation*}\begin{aligned} \hspace{0.7 in} \text{where} \hspace{2 mm} \mathbb{D} = \left\{ (\bar x_i, y_i) : \bar x_i \in \mathbb{R}^n, y_i \in \{+1, -1\} \right\} \hspace{2 mm} \text{ is Training Set} \end{aligned}\end{equation*}

Optimization Problem in Lagrange Form

\begin{equation*}\begin{aligned} \underset{\bar \alpha, \bar \beta}{\text{max}} \hspace{1 mm} \underset{\bar w, b, \xi}{\text{min}} & \left( \frac{1}{2} \bar w.\bar w \hspace{2 mm} + \hspace{2 mm} C \sum\limits_{i=1}^m \xi_i - \sum\limits_{i=1}^m \alpha_i (y_i(\bar w.\bar x_i - b) + \xi_i - 1) - \sum\limits_{i=1}^m \beta_i \xi_i \right) \end{aligned}\end{equation*}

\begin{aligned}\text{subject to: }\end{aligned}
\begin{equation*}\begin{aligned} \hspace{0.7 in} \alpha_i \geq 0, \hspace{2 mm} \beta_i \geq 0, \hspace{2 mm} i = 1, \ldots, m. \end{aligned}\end{equation*}

## Where there is a will, there is a way - Intuition behind Lagrange Optimization

In many constrained optimization problems (both maximization and minimization), the optimization problem is first converted into something called Lagrange form and then the optimal solution is evaluated.

The following is the general representation of a Constrained minimization problem:

\begin{equation*}\begin{aligned} \underset{\bar x}{\text{min}} & & f(\bar x) \hspace{5 mm} \bar x \in \mathbb{R}^n \end{aligned}\end{equation*}

\begin{aligned}\text{subject to: }\end{aligned}
\begin{equation*}\begin{aligned} \hspace{0.7 in} g_i(\bar x) \geq 0, \; i = 1, \ldots, m. \end{aligned}\end{equation*}

$f(\bar x)$ is the Objective function and $g_i(\bar x)$ are Constraint functions.

Note that constraint functions $g_i(\bar x)$ are assumed to be linearly independent - meaning there is no linear relationship between themselves and hence one function cannot be expressed in terms of others scaled by some constant.

\begin{aligned}\Rightarrow \sum\limits_{i=1}^m c_i \times g_i(\bar x) \neq 0, \hspace{2 mm} \text{for some} \hspace{2 mm} c_1, c_2, \ldots, c_m \in \mathbb{R} \hspace{2 mm} \text{not all zero} \end{aligned}

The above optimization problem is then written in Lagrange optimization form as follows:

\begin{equation*}\begin{aligned}\underset{\bar \alpha}{\text{max}} \hspace{2 mm} \underset{\bar x}{\text{min}} & f(\bar x) - \sum\limits_{i=1}^m \alpha_i \times g_i(\bar x) \end{aligned}\end{equation*}
\begin{aligned}\text{where } \alpha_i \geq 0, \; i = 1, \ldots, m. \end{aligned}

The values $\alpha_1, \alpha_2, \ldots \alpha_m$ in the above representation are called Lagrange Multipliers (a.k.a. KKT Multipliers) .

There seems to be two intuitions to above formulation of Lagrangian optimization - One simple enough to explain in a top-down manner and the second, somewhat related to first one but involves deep understanding. We will look at both the intuitions.

Intuition # 1:
If an optimal solution $\bar x^*$ exists for the constrained optimization problem, then at that optimal point, the gradient vector of the objective function $\nabla f(\bar x)$ is parallel to the gradient vectors of constraint functions $\nabla g_i(\bar x)$. This means that objective function gradient vector at the optimal point $\bar x^*$ can be expressed as some linear combination of constraint function gradient vectors that are parallel to it:

\begin{aligned}\nabla f(\bar x^*) = \alpha_1 \nabla g_1(\bar x^*) + \alpha_2 \nabla g_2(\bar x^*) + \ldots + \alpha_m \nabla g_m(\bar x^*), where \hspace{2 mm} \alpha_i \geq 0 \end{aligned}

\begin{aligned}\Rightarrow \nabla f(\bar x^*) - (\alpha_1 \nabla g_1(\bar x^*) + \alpha_2 \nabla g_2(\bar x^*) + \ldots + \alpha_m \nabla g_m(\bar x^*)) = 0 \end{aligned}

\begin{aligned}\Rightarrow \nabla f(\bar x^*) - \sum\limits_{i=1}^m \alpha_i \nabla g_i(\bar x^*) = 0 \end{aligned}

The above equation with terms $\nabla f(\bar x^*)$ and $\nabla g_i(\bar x^*)$ seems to suggest that there exists a function $L(\bar \alpha, \bar x)$ whose gradient $\nabla L$ is 0 at $\bar x^*$.

The function $L(\bar \alpha, \bar x)$ and its gradient $\nabla L$ can be represented more generally as follows:

\begin{aligned}\Rightarrow L(\bar \alpha, \bar x) = f(\bar x) - \sum\limits_{i=1}^m \alpha_i \times g_i(\bar x) \end{aligned}

\begin{aligned}\Rightarrow \nabla L(\bar \alpha, \bar x) = \nabla f(\bar x) - \sum\limits_{i=1}^m \alpha_i \nabla g_i(\bar x) \end{aligned}

Note that the values $\alpha_1, \alpha_2, \ldots \alpha_m$ in the above equations are represented as vector $\bar \alpha$ for convenience such that $\bar \alpha = (\alpha_1, \alpha_2, \ldots \alpha_m)$.

Also note that gradient $\nabla L$ is a vector and since $\bar x \in \mathbb{R}^n$, $\nabla L$ will have vector components $( \frac{\partial}{\partial x_1}, \frac{\partial}{\partial x_2}, \ldots \frac{\partial}{\partial x_n} )$.

If $\nabla L = 0$ at $\bar x^*$ then each of the vector components $\frac{\partial}{\partial x_i}$ will be equal to 0.

Since $\bar x^*$ is unknown and since finding optimal $\bar x^*$ is our goal, the generalized function $L(\bar \alpha, \bar x)$ seems to give some confidence that there is possibility of finding optimal $\bar x^*$ by solving for point where $\nabla L = 0$. However, for the generalized function $L(\bar \alpha, \bar x)$, there can be many points $\bar x$ where $\nabla L = 0$. All such points are called saddle points or stationary points and the challenge then would be which one of them will be the optimal point $\bar x^*$ that will yield optimal solution to primary objective function $f(\bar x)$ subject to constraints $g_i(\bar x) \geq 0$.

The intuition postulates that under certain special circumstances (which we will quantify mathematically in later paragraphs), one of the saddle points $\bar x^*$ on $L(\bar \alpha, \bar x)$ can potentially be the optimal point for the main objective function $f(\bar x)$ subject to constraints $g_i(\bar x) \geq 0$.

Note that the values for multipliers $\alpha_i$ in function $L(\bar \alpha, \bar x)$ are unknown. Essentially, if we have to go about using the function $L(\bar \alpha, \bar x)$ to find the optimal point $\bar x^*$ then the challenge is two fold. We have to not only search for $\bar x^*$ but we have to also find optimal values for $\alpha_i$ simultaneously such that the desired saddle point $\bar x^*$ can be found.

This implies that we have to find an optimal combination of $\bar x^* = ( x_1^*, x_2^*, \ldots x_n^* )$ and $\bar \alpha^* = ( \alpha_1^*, \alpha_2^*, \ldots \alpha_m^*)$ with $\nabla L(\bar \alpha^*, \bar x^*) = 0$ such that $\bar x^*$ can potentially be the optimal point for main objective function $f(\bar x)$ subject to contraints $g_i(\bar x) \geq 0$.

Note that:

\begin{aligned}\nabla L(\bar \alpha^*, \bar x^*) = 0 \Rightarrow \frac{\partial L(\bar \alpha^*, \bar x^*)}{\partial x_1} = \frac{\partial L(\bar \alpha^*, \bar x^*)}{\partial x_2} = \ldots = \frac{\partial L(\bar \alpha^*, \bar x^*)}{\partial x_n} = 0 \end{aligned}

If we go with the above intuition solving for $(\bar \alpha^*, \bar x^*)$ using $L(\bar \alpha, \bar x)$, analytically speaking, the real challenge is that we have to find n variables in $\bar x^* = ( x_1^*, x_2^*, \ldots x_n^* )$ and m values in $( \alpha_1^*, \alpha_2^*, \ldots \alpha_m^*)$. Essentially there are (n+m) unknowns that we need to solve for. So far going with the notion of $\nabla L(\bar \alpha^*, \bar x^*) = 0$, we have arrived at the following n potential equations in (n+m) unknowns.

\begin{aligned}\nabla L = 0 \Rightarrow \frac{\partial L}{\partial x_1} = \frac{\partial L}{\partial x_2} = \ldots = \frac{\partial L}{\partial x_n} = 0 \end{aligned}

Missing are the m additional equations in (n+m) unknowns.

This is where the primary intuition can be revisited. Primary intuition suggested that under certain special conditions, one of the saddle points $\bar x^*$ on $L(\bar \alpha, \bar x)$ can potentially be the optimal point for the main objective function $f(\bar x)$ subject to constraints $g_i(\bar x) \geq 0$. Essentially this means that when such special conditions are met, the optimal value of objective function $f(\bar x)$ at $\bar x^*$ subject to constraints $g_i(\bar x^*) \geq 0$ is nothing but the value of function $L(\bar \alpha, \bar x)$ at $(\bar \alpha^*, \bar x^*)$.

\begin{aligned}\Rightarrow L(\bar \alpha^*, \bar x^*) = f(\bar x^*)\end{aligned}

However by formulation of generalized function $L(\bar \alpha, \bar x)$, we know that:

\begin{aligned} L(\bar \alpha, \bar x) = f(\bar x) - \sum\limits_{i=1}^m \alpha_i \times g_i(\bar x) \end{aligned}

So the value of function $L(\bar \alpha, \bar x)$ at $(\bar \alpha^*, \bar x^*)$ is then given by:

\begin{aligned}\Rightarrow L(\bar \alpha^*, \bar x^*) = f(\bar x^*) - \sum\limits_{i=1}^m \alpha_i^* \times g_i(\bar x^*) \end{aligned}

This can be re-written as:

\begin{aligned}\Rightarrow L(\bar \alpha^*, \bar x^*) - f(\bar x^*) + \sum\limits_{i=1}^m \alpha_i^* \times g_i(\bar x^*) = 0 \end{aligned}

If \begin{aligned}L(\bar \alpha^*, \bar x^*) = f(\bar x^*)\end{aligned} then the above equation results in:

\begin{aligned}\Rightarrow L(\bar \alpha^*, \bar x^*) - L(\bar \alpha^*, \bar x^*) + \sum\limits_{i=1}^m \alpha_i^* \times g_i(\bar x^*) = 0 \end{aligned}

\begin{aligned}\Rightarrow \sum\limits_{i=1}^m \alpha_i^* \times g_i(\bar x^*) = 0 \end{aligned}

Since we know that $\alpha_i^* \geq 0$ and constraint functions $g_i(\bar x)$ are linearly independent, the above equation will hold true only when each of the terms $\alpha_i^* \times g_i(\bar x^*)$ is equal to 0.

And Voila....!! We have the special conditions and the missing m equations that must be satisfied at $(\bar \alpha^*, \bar x^*)$:

\begin{aligned}\Rightarrow \alpha_i \times g_i(\bar x) = 0, i = 1, \ldots, m. \end{aligned}

Now using the following (n+m) equations in (n+m) unknowns, we can find an optimal combination of $\bar x^* = ( x_1^*, x_2^*, \ldots x_n^* )$ and $\bar \alpha^* = ( \alpha_1^*, \alpha_2^*, \ldots \alpha_m^*)$ with $\nabla L(\bar \alpha^*, \bar x^*) = 0$ such that $\bar x^*$ can potentially be the optimal point for main objective function $f(\bar x)$ subject to contraints $g_i(\bar x) \geq 0$.

\begin{aligned}\frac{\partial L}{\partial x_1} = \frac{\partial L}{\partial x_2} = \ldots = \frac{\partial L}{\partial x_n} = 0 \hspace{2 mm} and \hspace{2 mm} \alpha_i \times g_i(\bar x) = 0, i = 1, \ldots, m. \end{aligned}

Hence this intuition#1 forms the basis for the formulation of Lagrangian Optimization problem from the Constrained Optimization problem.

• In case of Minimization problems, the standard convention is to use constraints as $g_i(\bar x) \geq 0$. In this case, the gradient vector of objective function $\nabla f$ and
• and gradient vectors of constraint functions $\nabla g_i$ at the optimal point $\bar x^*$ will point in same direction and $\alpha_i$ in $\nabla f(\bar x^*) = \sum\limits_{i=1}^m \alpha_i \nabla g_i(\bar x^*)$ will be $\geq 0$. If constraints are declared as $g_i(\bar x) \leq 0$, gradient vector of objective function $\nabla f$ and gradient vectors of constraint functions $\nabla g_i$ at the optimal point $\bar x^*$ will point in opposite directions and $\alpha_i$ in $\nabla f(\bar x^*) = \sum\limits_{i=1}^m \alpha_i \nabla g_i(\bar x^*)$ will be $\leq 0$ to compensate for opposite directions
• Alternatively, in case of Minimization problems, if the constraints are declared as $g_i(\bar x) \leq 0$, gradient vectors linear combination can be expressed as $\nabla f(\bar x^*) = -\sum\limits_{i=1}^m \alpha_i \nabla g_i(\bar x^*)$ to account for opposite directions and $\alpha_i$ will be $\geq 0$
• In case of Maximization problems, the standard convention is to use constraints as $g_i(\bar x) \leq 0$. In this case, the gradient vector of objective function $\nabla f$ and
• and gradient vectors of constraint functions $\nabla g_i$ at the optimal point $\bar x^*$ will point in same direction and $\alpha_i$ in $\nabla f(\bar x^*) = \sum\limits_{i=1}^m \alpha_i \nabla g_i(\bar x^*)$ will be $\geq 0$. If constraints are declared as $g_i(\bar x) \geq 0$, gradient vector of objective function $\nabla f$ and gradient vectors of constraint functions $\nabla g_i$ at the optimal point $\bar x^*$ will point in opposite directions and $\alpha_i$ in $\nabla f(\bar x^*) = \sum\limits_{i=1}^m \alpha_i \nabla g_i(\bar x^*)$ will be $\leq 0$ to compensate for opposite directions
• Alternatively, in case of Maximization problems, if the constraints are declared as $g_i(\bar x) \geq 0$, gradient vectors linear combination can be expressed as $\nabla f(\bar x^*) = -\sum\limits_{i=1}^m \alpha_i \nabla g_i(\bar x^*)$ to account for opposite directions and $\alpha_i$ will be $\geq 0$