Machine Learning for Hackers
Format: PDF / Kindle (mobi) / ePub
If you’re an experienced programmer interested in crunching data, this book will get you started with machine learning—a toolkit of algorithms that enables computers to train themselves to automate useful tasks. Authors Drew Conway and John Myles White help you understand machine learning and statistics tools through a series of hands-on case studies, instead of a traditional math-heavy presentation.
Each chapter focuses on a specific problem in machine learning, such as classification, prediction, optimization, and recommendation. Using the R programming language, you’ll learn how to analyze sample datasets and write simple machine learning algorithms. Machine Learning for Hackers is ideal for programmers from any background, including business, government, and academic research.
- Develop a naïve Bayesian classifier to determine if an email is spam, based only on its text
- Use linear regression to predict the number of page views for the top 1,000 websites
- Learn optimization techniques by attempting to break a simple letter cipher
- Compare and contrast U.S. Senators statistically, based on their voting records
- Build a “whom to follow” recommendation system from Twitter data
date.strings<-strftime(date.range, "%Y-%m") With the new date.strings vector, we need to create a new data frame that has all yearmonths and states. We will use this to perform the matching with the UFO sighting data. As before, we will use the lapply function to create the columns and the do.call function to convert this to a matrix and then a data frame: states.dates<-lapply(us.states,function(s) cbind(s,date.strings)) states.dates<-data.frame(do.call(rbind, states.dates),
listed in Table 3-2. Table 3-2. Testing our classifier against “hard ham” Email type Number classified as ham Number classified as spam Hard ham 184 65 Congratulations! You’ve written your first classifier, and it did fairly well at identifying hard ham as nonspam. In this case, we have approximately a 26% false-positive rate. That is, about one-quarter of the hard ham emails are incorrectly identified as spam. You may think this is poor performance, and in production we would not want to
subjects and bodies of emails received by a user, then future emails that contain these terms in the subject and body may be more important than those that do not. This is actually a common technique, and it is mentioned briefly in the description of Google’s priority inbox. By adding content features based on terms for both the email subject and body, we will encounter an interesting problem of weighting. Typically, there are considerably fewer terms in an email’s subject than the body;
complete introduction to the R programming language. As you might expect, no such introduction could fit into a single book chapter. Instead, this chapter is meant to prepare the reader for the tasks associated with doing machine learning in R, specifically the process of loading, exploring, cleaning, and analyzing data. There are many excellent resources on R that discuss language fundamentals such as data types, arithmetic concepts, and coding best practices. In so far as those topics are
package: set.seed(1) library('glmnet') Having done that setup work, we can loop over several possible values for Lambda to see which gives the best results on held-out data. Because we don’t have a lot of data, we do this split 50 times for each value of Lambda to get a better sense of the accuracy we get from different levels of regularization. In the following code, we set a value for Lambda, split the data into a training set and test set 50 times, and then assess our model’s performance on