Thursday, March 30, 2017

How to choose a machine learning algorithm/model?



I draw a figure to illustrate how to select a ML algorithm according to different factors (click to see large fig.): 










NOTE: following materia/contents are quoted from: (not my original ..)
http://www.kdnuggets.com/2016/04/deep-learning-vs-svm-random-forest.html


If we tackle a supervised learning problem, my advice is to start with the simplest hypothesis space first. I.e., try a linear model such as logistic regression. If this doesn't work "well" (i.e., it doesn't meet our expectation or performance criterion that we defined earlier), I would move on to the next experiment.
Deep learning
Random Forests vs. SVMs
I would say that random forests are probably THE "worry-free" approach - if such a thing exists in ML: There are no real hyperparameters to tune (maybe except for the number of trees; typically, the more trees we have the better). On the contrary, there are a lot of knobs to be turned in SVMs: Choosing the "right" kernel, regularization penalties, the slack variable, ...
Both random forests and SVMs are non-parametric models (i.e., the complexity grows as the number of training samples increases). Training a non-parametric model can thus be more expensive, computationally, compared to a generalized linear model, for example. The more trees we have, the more expensive it is to build a random forest. Also, we can end up with a lot of support vectors in SVMs; in the worst-case scenario, we have as many support vectors as we have samples in the training set. Although, there are multi-class SVMs, the typical implementation for mult-class classification is One-vs.-All; thus, we have to train an SVM for each class -- in contrast, decision trees or random forests, which can handle multiple classes out of the box.
To summarize, random forests are much simpler to train for a practitioner; it's easier to find a good, robust model. The complexity of a random forest grows with the number of trees in the forest, and the number of training samples we have. In SVMs, we typically need to do a fair amount of parameter tuning, and in addition to that, the computational cost grows linearly with the number of classes as well.
Deep Learning
As a rule of thumb, I'd say that SVMs are great for relatively small data sets with fewer outliers. Random forests may require more data but they almost always come up with a pretty robust model. And deep learning algorithms... well, they require "relatively" large datasets to work well, and you also need the infrastructure to train them in reasonable time. Also, deep learning algorithms require much more experience: Setting up a neural network using deep learning algorithms is much more tedious than using an off-the-shelf classifiers such as random forests and SVMs. On the other hand, deep learning really shines when it comes to complex problems such as image classification, natural language processing, and speech recognition. Another advantage is that you have to worry less about the feature engineering part. Again, in practice, the decision which classifier to choose really depends on your dataset and the general complexity of the problem -- that's where your experience as machine learning practitioner kicks in.
If it comes to predictive performance, there are cases where SVMs do better than random forests and vice versa:
The same is true for deep learning algorithms if you look at the MNIST benchmarks (http://yann.lecun.com/exdb/mnist/): The best-performing model in this set is a committee consisting of 35 ConvNets, which were reported to have a 0.23% test error; the best SVM model has a test error of 0.56%. The ConvNet ensemble may reach a better accuracy (for the sake of this ensemble, let's pretend that these are totally unbiased estimates), but without a question, I'd say that the 35 ConvNet committee is far more expensive (computationally). So, if you make that decision: Is a 0.33% improvement worth it? In some cases, it's maybe worth it (e.g., in the financial sector for non-real time predictions), in other cases it perhaps won't be worth it, though.
So, my practical advice is:
  • Define a performance metric to evaluate your model
  • Ask yourself: What performance score is desired, what hardware is required, what is the project deadline
  • Start with the simplest model
  • If you don't meet your expected goal, try more complex models (if possible)
  • /////////////////////////////////////////////////////////////


Cross Validation
What you do is simply to split your dataset into K non-overlapping subsets (folds), train a model using K-1 folds and predict its performance using the fold you left out. This you do for each possible combination of folds (first leave 1st fold out, then 2nd, .. , then kth and train with the remaining folds). After finishing you estimate the mean performance of all folds (maybe also the variance/standard deviation of the performance).
How to choose the parameter K depends on time you have. Usual Ks are 3,5,10 or even N, where N is the size of your data (thats the same as Leave-One-Out Cross Validation). I prefer 5 or 10.
Model Selection
Let's say you have 5 methods (ANN, SVM, KNN etc) and 10 parameter combinations for each method (depend on the method). You simply have to run Cross Validation for each method and parameter combination (5x10 = 50) and select the best model, method and parameters. Then you re-train with the best method and parameters on all your data and you have your final model!
Well, there are some more things to say. If for example you use a lot of methods and parameter combinations for each it's very likely you will overfit. In cases like these you have to use nested Cross Validation.
Nested Cross Validation
In nested Cross Validation you perform Cross Validation on the Model Selection algorithm. Again you first split your data into K folds. After each step you choose K-1 as your training data and the remaining one as your test data. Then you run Model Selection (the procedure I explained above) for each possible combination of those K folds. After finishing this you will have K models, one for each combination of folds. After that you test each model with the remaining test data and choose the best one. Again, after having the last model you train a new one with the same method and parameters on all the data you have. Thats your final model.
Of course there are many variations of these methods and other things I didn't mention. If you need more information about these look for some publications about these topics.



////////////////////////////////////////////////////////////////


Advantages of some particular algorithms

Advantages of Naive Bayes: Super simple, you’re just doing a bunch of counts. If the NB conditional independence assumption actually holds, a Naive Bayes classifier will converge quicker than discriminative models like logistic regression, so you need less training data. And even if the NB assumption doesn’t hold, a NB classifier still often does a great job in practice. A good bet if want something fast and easy that performs pretty well. Its main disadvantage is that it can’t learn interactions between features (e.g., it can’t learn that although you love movies with Brad Pitt and Tom Cruise, you hate movies where they’re together).
Advantages of Logistic Regression: Lots of ways to regularize your model, and you don’t have to worry as much about your features being correlated, like you do in Naive Bayes. You also have a nice probabilistic interpretation, unlike decision trees or SVMs, and you can easily update your model to take in new data (using an online gradient descent method), again unlike decision trees or SVMs. Use it if you want a probabilistic framework (e.g., to easily adjust classification thresholds, to say when you’re unsure, or to get confidence intervals) or if you expect to receive more training data in the future that you want to be able to quickly incorporate into your model.
Advantages of Decision Trees: Easy to interpret and explain (for some people – I’m not sure I fall into this camp). They easily handle feature interactions and they’re non-parametric, so you don’t have to worry about outliers or whether the data is linearly separable (e.g., decision trees easily take care of cases where you have class A at the low end of some feature x, class B in the mid-range of feature x, and A again at the high end). One disadvantage is that they don’t support online learning, so you have to rebuild your tree when new examples come on. Another disadvantage is that they easily overfit, but that’s where ensemble methods like random forests (or boosted trees) come in. Plus, random forests are often the winner for lots of problems in classification (usually slightly ahead of SVMs, I believe), they’re fast and scalable, and you don’t have to worry about tuning a bunch of parameters like you do with SVMs, so they seem to be quite popular these days.
Advantages of SVMs: High accuracy, nice theoretical guarantees regarding overfitting, and with an appropriate kernel they can work well even if you’re data isn’t linearly separable in the base feature space. Especially popular in text classification problems where very high-dimensional spaces are the norm. Memory-intensive, hard to interpret, and kind of annoying to run and tune, though, so I think random forests are starting to steal the crown.

But…

Recall, though, that better data often beats better algorithms, and designing good features goes a long way. And if you have a huge dataset, then whichever classification algorithm you use might not matter so much in terms of classification performance (so choose your algorithm based on speed or ease of use instead).
And to reiterate what I said above, if you really care about accuracy, you should definitely try a bunch of different classifiers and select the best one by cross-validation. Or, to take a lesson from the Netflix Prize (and Middle Earth), just use an ensemble method to choose them all.



//////////////////////////////////////

Logistic Regression Pros:

  • Convenient probability scores for observations
  • Multi-collinearity is not really an issue and can be countered with L2 regularization to an extent
  • can be on-line learning or take new data in the furture;
  • Efficient implementations available across tools
  • Wide spread industry comfort for logistic regression solutions [ oh that’s important too!]

Logistic Regression Cons:

  • Doesn’t perform well when feature space is too large
  • Doesn’t handle large number of categorical features/variables well
  • Relies on transformations for non-linear features
  • Relies on entire data [ Not a very serious drawback I’d say]
Logistic regression algorithm
Let’s discuss Decision Trees and Support Vector Machines .
Decision trees are inherently indifferent to monotonic transformation or non-linear features [ this is different from non linear correlation among predictors] because they simply cut feature space in rectangles [ or (hyper)cuboids] which can adjust themselves to any monotonic transformation. Since decision trees anyway are designed to work with discrete intervals or classes of predictors, any number of categorical variables are not really an issue with decision trees. Models obtained from decision tree is fairly intuitive and easy to explain to business. Probability scores are not a direct result but you can use class probabilities assigned to terminal nodes instead. This brings us to the biggest problem associated with Decision Trees, that is, they are highly biased class of models. You can make a decision tree model on your training set which might outperform all other algorithms but it’ll prove to be a poor predictor on your test set. You’ll have to rely heavily on pruning and cross validation to get a non-over-fitting model with Decision Trees.
This problem of over-fitting is overcome to large extent by using Random Forests, which are nothing but a very clever extension of decision trees. But random forest take away easy to explain business rules because now you have thousands of such trees and their majority votes to make things complex. Also by decision trees have forced interactions between variables , which makes them rather inefficient if most of your variables have no or very weak interactions. On the other hand this design also makes them rather less susceptible to multicollinearity. Whew!
Summarizing Decision Trees:

Decision Trees Pros:

  • Intuitive Decision Rules
  • Can handle non-linear features
  • Take into account variable interactions

Decision Trees Cons:

  • Highly biased to training set [Random Forests to your rescue]
  • No ranking score as direct result
Decision Trees
Now to Support Vector Machines. The best thing about support vector machines is that they rely on boundary cases to build the much needed separating curve. They can handle non linear decision boundaries as we saw earlier. Reliance on boundary cases also enables them to handle missing data for “obvious” cases. SVM can handle large feature spaces which makes them one of the favorite algorithms in text analysis which almost always results in huge number of features where logistic regression is not a very good choice.
Result of SVMs are not as not as intuitive as decision trees for a layman. With non linear kernels, SVMs can be very costly to train on huge data. In Summary:

SVM Pros:

  • Can handle large feature space
  • Can handle non-linear feature interactions
  • Do not rely on entire data

SVM Cons:

  • Not very efficient with large number of observations
  • It can be tricky to find appropriate kernel sometimes
I have tried to compile a simple workflow for you to decide which algorithm to use out of these three, which is as follows:
  • Always start with logistic regression, if nothing then to use the performance as baseline
  • See if decision trees (Random Forests) provide significant improvement. Even if you do not end up using the resultant model, you can use random forest results to remove noisy variables
  • Go for SVM if you have large number of features and number of observations are not a limitation for available resources and time
Support vector Machines
At the end of the day, remember that good data beats any algorithm anytime. Always see if you can engineer a good feature by using your domain knowledge. Try various iterations of your ideas while experimenting with feature creation. Another thing to try with efficient computing infra available these days is to use ensembles of multiple models. We’ll discuss them next, so, stay tuned!

Friday, March 10, 2017

Largest Rectangle in Histogram

题目: Largest Rectangle in Histogram
Given n non-negative integers representing the histogram's bar height where the width of each bar is 1, find the area of largest rectangle in the histogram.
Above is a histogram where width of each bar is 1, given height = [2,1,5,6,2,3].

The largest rectangle is shown in the shaded area, which has area = 10 unit.
For example,
Given height = [2,1,5,6,2,3],
return 10.


例子就用题目中的[2,1,5,6,2,3]吧。
首先,如果栈是空的,那么索引i入栈。那么第一个i=0就进去吧。注意栈内保存的是索引,不是高度。然后i++。
然后继续,当i=1的时候,发现h[i]小于了栈内的元素,于是出栈。(由此可以想到,哦,看来stack里面只存放单调递增的索引
这时候stack为空,所以面积的计算是h[t] * i.t是刚刚弹出的stack顶元素。也就是蓝色部分的面积。
继续。这时候stack为空了,继续入栈。注意到只要是连续递增的序列,我们都要keep pushing,直到我们遇到了i=4,h[i]=2小于了栈顶的元素。
这时候开始计算矩形面积。首先弹出栈顶元素,t=3。即下图绿色部分。
接下来注意到栈顶的(索引指向的)元素还是大于当前i指向的元素,于是出栈,并继续计算面积,桃红色部分。
最后,栈顶的(索引指向的)元素大于了当前i指向的元素,循环继续,入栈并推动i前进。直到我们再次遇到下降的元素,也就是我们最后人为添加的dummy元素0.
同理,我们计算栈内的面积。由于当前i是最小元素,所以所有的栈内元素都要被弹出并参与面积计算。
注意我们在计算面积的时候已经更新过了maxArea。
总结下,我们可以看到,stack中总是保持递增的元素的索引,然后当遇到较小的元素后,依次出栈并计算栈中bar能围成的面积,直到栈中元素小于当前元素。
可是为什么这个方法是正确的呢? 我也没搞清楚。只是觉得不明觉厉了。
 -------------------------------------------------更新----------------------------------------------------------------
可以这样理解这个算法,看下图。
例如我们遇到最后遇到一个递减的bar(红色)。高度位于红线上方的(也就是算法中栈里面大于最右bar的)元素,他们是不可能和最右边的较小高度bar围成一个比大于在弹栈过程中的矩形面积了(黄色面积),因为红色的bar对他们来说是一个短板,和红色bar能围成的最大面积也就是红色的高度乘以这些“上流社会”所跨越的索引范围。但是“上流社会”的高度个个都比红色bar大,他们完全只计算彼此之间围成的面积就远远大于和红色bar围成的任意面积了。所以红色bar是不可能参与“上流社会”的bar的围城的(好悲哀)。
但是屌丝也不用泄气哦。因为虽然长度不占优势,但是团结的力量是无穷的。它还可以参与“比较远的”比它还要屌丝的bar的围城。他们的面积是有可能超过上流社会的面积的,因为距离啊!所以弹栈到比红色bar小就停止了。
另外一个细节需要注意的是,弹栈过程中面积的计算。
h[t] * (stack.isEmpty() ? i : i - stack.peek() - 1)
h[t]是刚刚弹出的栈顶端元素。此时的面积计算是h[t]和前面的“上流社会”能围成的最大面积。这时候要注意哦,栈内索引指向的元素都是比h[t]小的,如果h[t]是目前最小的,那么栈内就是空哦。而在目前栈顶元素和h[t]之间(不包括h[t]和栈顶元素),都是大于他们两者的。如下图所示:
那h[t]无疑就是Stack.Peek和t之间那些上流社会的短板啦,而它们的跨越就是i - Stack.Peek - 1。
所以说,这个弹栈的过程也是维持程序不变量的方法啊:栈内元素一定是要比当前i指向的元素小的。


--------------------------------------------------------------------
NOTE: for example like [2,1,1,1,1,2,3], the stack always store i=1. and when i sweeps to the end like i=7, the final calculation will be h[1]*7 due to (S.empty() ? i : i - S.top() - 1)), the S is empty now. 
NOTE also: in example, [2,1,0,1,1,1,2,3], the S always stores i=2, and the S will NEVER become empty!!

 class Solution {
 public:
int largestRectangleArea(vector<int> &h) {
stack<int> S;
h.push_back(0);
int sum = 0;
for (int i = 0; i < h.size(); i++) {
if (S.empty() || h[i] > h[S.top()]) S.push(i);
else {
int tmp = S.top();
S.pop();
sum = max(sum, h[tmp] * (S.empty() ? i : i - S.top() - 1));
i--;

}

}
return sum;

}
 };