上QQ阅读APP看书，第一时间看更新

From data science to ML

Pick up any book on data science; there is a fair chance that you will come across an elaborate explanation, involving the intersection of fields such as statistics and computer science, as well as some domain knowledge. As you flip through the pages rapidly, you will notice some nice visualizations, graphs, bar charts—the works. You will be introduced to statistical models, significance tests, data structures, and algorithms, each providing impressive results for some demonstrative use case. This is not data science. These are indeed the very tools you will be using as a successful data scientist. However, the essence of what data science is can be summarized in a much simpler manner: data science is the scientific domain that deals with generating actionable knowledge from raw data. This is done by iteratively observing a real-world problem, quantifying the overall phenomena in different dimensions, or features, and predicting future outcomes that permit desired ends to be achieved. ML is just the discipline of teaching machines data science.

While some computer scientists may appreciate this recursive definition, some of you may ponder what is meant by quantifying a phenomenon. Well, you see, most observations in the real world, be it the amount of food you eat, the kind of programs you watch, or the colors you like on your clothes, can be all defined as (approximate) functions of some other quasi-dependent features. For example, the amount of food you will eat in any given day can be defined as a function of other things, such as how much you ate in your previous meal, your general inclination to certain types of food, or even the amount of physical exertion you get.

Similarly, the kind of programs you like to watch may be approximated by features such as your personality traits, interests, and the amount of free time in your schedule. Reductively speaking, we work with quantifying and representing differences between observations (for example, the viewing habits between people), to deduce a functional predictive rule that machines may work with.

We induce these rules by defining the possible outcomes that we are trying to predict (that is, whether a given person likes comedies or thrillers) as a function of input features (that is, how this person ranks on the Big Five personality test) that we collect when observing a phenomenon at large (such as personalities and the viewing habits of a population):

If you have selected the right set of features, you will be able to derive a function that is able to reliably predict the output classes that you are interested in (in our case, this is viewer preferences). What do I mean by the right features? Well, it stands to reason that viewing habits have more to do with a person's personality traits than their travel habits. Predicting whether someone is inclined towards horror movies as a function of, say, their eye color and real-time GPS coordinates, will be quite useless, as they are not informative to what we are trying to predict. Hence, we always choose relevant features (through domain knowledge or significance tests) to reductively represent a real-world phenomenon. Then, we simply use this representation to predict the future outcomes that we are interested in. This representation itself is what we call a predictive model.