I draw a figure to illustrate how to select a ML algorithm according to different factors (click to see large fig.):

NOTE: following materia/contents are quoted from: (not my original ..)

http://www.kdnuggets.com/2016/04/deep-learning-vs-svm-random-forest.html

If we tackle a supervised learning problem, my advice is to start with the simplest hypothesis space first. I.e., try a linear model such as logistic regression. If this doesn't work "well" (i.e., it doesn't meet our expectation or performance criterion that we defined earlier), I would move on to the next experiment.

Random Forests vs. SVMs

I would say that random forests are probably THE "worry-free" approach - if such a thing exists in ML: There are no real hyperparameters to tune (maybe except for the number of trees; typically, the more trees we have the better). On the contrary, there are a lot of knobs to be turned in SVMs: Choosing the "right" kernel, regularization penalties, the slack variable, ...

Both random forests and SVMs are non-parametric models (i.e., the complexity grows as the number of training samples increases). Training a non-parametric model can thus be more expensive, computationally, compared to a generalized linear model, for example. The more trees we have, the more expensive it is to build a random forest. Also, we can end up with a lot of support vectors in SVMs; in the worst-case scenario, we have as many support vectors as we have samples in the training set. Although, there are multi-class SVMs, the typical implementation for mult-class classification is One-vs.-All; thus, we have to train an SVM for each class -- in contrast, decision trees or random forests, which can handle multiple classes out of the box.

To summarize, random forests are much simpler to train for a practitioner; it's easier to find a good, robust model. The complexity of a random forest grows with the number of trees in the forest, and the number of training samples we have. In SVMs, we typically need to do a fair amount of parameter tuning, and in addition to that, the computational cost grows linearly with the number of classes as well.

Deep Learning

As a rule of thumb, I'd say that SVMs are great for relatively small data sets with fewer outliers. Random forests may require more data but they almost always come up with a pretty robust model. And deep learning algorithms... well, they require "relatively" large datasets to work well, and you also need the infrastructure to train them in reasonable time. Also, deep learning algorithms require much more experience: Setting up a neural network using deep learning algorithms is much more tedious than using an off-the-shelf classifiers such as random forests and SVMs. On the other hand, deep learning really shines when it comes to complex problems such as image classification, natural language processing, and speech recognition. Another advantage is that you have to worry less about the feature engineering part. Again, in practice, the decision which classifier to choose really depends on your dataset and the general complexity of the problem -- that's where your experience as machine learning practitioner kicks in.

If it comes to predictive performance, there are cases where SVMs do better than random forests and vice versa:

Caruana, Rich, and Alexandru Niculescu-Mizil. "An empirical comparison of supervised learning algorithms." Proceedings of the 23rd international conference on Machine learning. ACM, 2006.

The same is true for deep learning algorithms if you look at the MNIST benchmarks (http://yann.lecun.com/exdb/mnist/): The best-performing model in this set is a committee consisting of 35 ConvNets, which were reported to have a 0.23% test error; the best SVM model has a test error of 0.56%. The ConvNet ensemble may reach a better accuracy (for the sake of this ensemble, let's pretend that these are totally unbiased estimates), but without a question, I'd say that the 35 ConvNet committee is far more expensive (computationally). So, if you make that decision: Is a 0.33% improvement worth it? In some cases, it's maybe worth it (e.g., in the financial sector for non-real time predictions), in other cases it perhaps won't be worth it, though.

So, my practical advice is:

Define a performance metric to evaluate your model
Ask yourself: What performance score is desired, what hardware is required, what is the project deadline
Start with the simplest model
If you don't meet your expected goal, try more complex models (if possible)

/////////////////////////////////////////////////////////////

Cross Validation

What you do is simply to split your dataset into K non-overlapping subsets (folds), train a model using K-1 folds and predict its performance using the fold you left out. This you do for each possible combination of folds (first leave 1st fold out, then 2nd, .. , then kth and train with the remaining folds). After finishing you estimate the mean performance of all folds (maybe also the variance/standard deviation of the performance).

How to choose the parameter K depends on time you have. Usual Ks are 3,5,10 or even N, where N is the size of your data (thats the same as Leave-One-Out Cross Validation). I prefer 5 or 10.

Model Selection

Let's say you have 5 methods (ANN, SVM, KNN etc) and 10 parameter combinations for each method (depend on the method). You simply have to run Cross Validation for each method and parameter combination (5x10 = 50) and select the best model, method and parameters. Then you re-train with the best method and parameters on all your data and you have your final model!

Well, there are some more things to say. If for example you use a lot of methods and parameter combinations for each it's very likely you will overfit. In cases like these you have to use nested Cross Validation.

Nested Cross Validation

In nested Cross Validation you perform Cross Validation on the Model Selection algorithm. Again you first split your data into K folds. After each step you choose K-1 as your training data and the remaining one as your test data. Then you run Model Selection (the procedure I explained above) for each possible combination of those K folds. After finishing this you will have K models, one for each combination of folds. After that you test each model with the remaining test data and choose the best one. Again, after having the last model you train a new one with the same method and parameters on all the data you have. Thats your final model.

Of course there are many variations of these methods and other things I didn't mention. If you need more information about these look for some publications about these topics.

////////////////////////////////////////////////////////////////

Advantages of some particular algorithms

Advantages of Naive Bayes: Super simple, you’re just doing a bunch of counts. If the NB conditional independence assumption actually holds, a Naive Bayes classifier will converge quicker than discriminative models like logistic regression, so you need less training data. And even if the NB assumption doesn’t hold, a NB classifier still often does a great job in practice. A good bet if want something fast and easy that performs pretty well. Its main disadvantage is that it can’t learn interactions between features (e.g., it can’t learn that although you love movies with Brad Pitt and Tom Cruise, you hate movies where they’re together).

Advantages of Logistic Regression: Lots of ways to regularize your model, and you don’t have to worry as much about your features being correlated, like you do in Naive Bayes. You also have a nice probabilistic interpretation, unlike decision trees or SVMs, and you can easily update your model to take in new data (using an online gradient descent method), again unlike decision trees or SVMs. Use it if you want a probabilistic framework (e.g., to easily adjust classification thresholds, to say when you’re unsure, or to get confidence intervals) or if you expect to receive more training data in the future that you want to be able to quickly incorporate into your model.

Advantages of Decision Trees: Easy to interpret and explain (for some people – I’m not sure I fall into this camp). They easily handle feature interactions and they’re non-parametric, so you don’t have to worry about outliers or whether the data is linearly separable (e.g., decision trees easily take care of cases where you have class A at the low end of some feature x, class B in the mid-range of feature x, and A again at the high end). One disadvantage is that they don’t support online learning, so you have to rebuild your tree when new examples come on. Another disadvantage is that they easily overfit, but that’s where ensemble methods like random forests (or boosted trees) come in. Plus, random forests are often the winner for lots of problems in classification (usually slightly ahead of SVMs, I believe), they’re fast and scalable, and you don’t have to worry about tuning a bunch of parameters like you do with SVMs, so they seem to be quite popular these days.

Advantages of SVMs: High accuracy, nice theoretical guarantees regarding overfitting, and with an appropriate kernel they can work well even if you’re data isn’t linearly separable in the base feature space. Especially popular in text classification problems where very high-dimensional spaces are the norm. Memory-intensive, hard to interpret, and kind of annoying to run and tune, though, so I think random forests are starting to steal the crown.

But…

Recall, though, that better data often beats better algorithms, and designing good features goes a long way. And if you have a huge dataset, then whichever classification algorithm you use might not matter so much in terms of classification performance (so choose your algorithm based on speed or ease of use instead).

And to reiterate what I said above, if you really care about accuracy, you should definitely try a bunch of different classifiers and select the best one by cross-validation. Or, to take a lesson from the Netflix Prize (and Middle Earth), just use an ensemble method to choose them all.

//////////////////////////////////////

Logistic Regression Pros:

Convenient probability scores for observations
Multi-collinearity is not really an issue and can be countered with L2 regularization to an extent
can be on-line learning or take new data in the furture;
Efficient implementations available across tools
Wide spread industry comfort for logistic regression solutions [ oh that’s important too!]

Logistic Regression Cons:

Doesn’t perform well when feature space is too large
Doesn’t handle large number of categorical features/variables well
Relies on transformations for non-linear features
Relies on entire data [ Not a very serious drawback I’d say]

Let’s discuss Decision Trees and Support Vector Machines .

Decision trees are inherently indifferent to monotonic transformation or non-linear features [ this is different from non linear correlation among predictors] because they simply cut feature space in rectangles [ or (hyper)cuboids] which can adjust themselves to any monotonic transformation. Since decision trees anyway are designed to work with discrete intervals or classes of predictors, any number of categorical variables are not really an issue with decision trees. Models obtained from decision tree is fairly intuitive and easy to explain to business. Probability scores are not a direct result but you can use class probabilities assigned to terminal nodes instead. This brings us to the biggest problem associated with Decision Trees, that is, they are highly biased class of models. You can make a decision tree model on your training set which might outperform all other algorithms but it’ll prove to be a poor predictor on your test set. You’ll have to rely heavily on pruning and cross validation to get a non-over-fitting model with Decision Trees.

This problem of over-fitting is overcome to large extent by using Random Forests, which are nothing but a very clever extension of decision trees. But random forest take away easy to explain business rules because now you have thousands of such trees and their majority votes to make things complex. Also by decision trees have forced interactions between variables , which makes them rather inefficient if most of your variables have no or very weak interactions. On the other hand this design also makes them rather less susceptible to multicollinearity. Whew!

Summarizing Decision Trees:

Decision Trees Pros:

Intuitive Decision Rules
Can handle non-linear features
Take into account variable interactions

Decision Trees Cons:

Highly biased to training set [Random Forests to your rescue]
No ranking score as direct result

Now to Support Vector Machines. The best thing about support vector machines is that they rely on boundary cases to build the much needed separating curve. They can handle non linear decision boundaries as we saw earlier. Reliance on boundary cases also enables them to handle missing data for “obvious” cases. SVM can handle large feature spaces which makes them one of the favorite algorithms in text analysis which almost always results in huge number of features where logistic regression is not a very good choice.

Result of SVMs are not as not as intuitive as decision trees for a layman. With non linear kernels, SVMs can be very costly to train on huge data. In Summary:

SVM Pros:

Can handle large feature space
Can handle non-linear feature interactions
Do not rely on entire data

SVM Cons:

Not very efficient with large number of observations
It can be tricky to find appropriate kernel sometimes

I have tried to compile a simple workflow for you to decide which algorithm to use out of these three, which is as follows:

Always start with logistic regression, if nothing then to use the performance as baseline
See if decision trees (Random Forests) provide significant improvement. Even if you do not end up using the resultant model, you can use random forest results to remove noisy variables
Go for SVM if you have large number of features and number of observations are not a limitation for available resources and time

At the end of the day, remember that good data beats any algorithm anytime. Always see if you can engineer a good feature by using your domain knowledge. Try various iterations of your ideas while experimenting with feature creation. Another thing to try with efficient computing infra available these days is to use ensembles of multiple models. We’ll discuss them next, so, stay tuned!

Machine Learning | Optimization

Thursday, March 30, 2017

How to choose a machine learning algorithm/model?