Machine Learning with Apache Mahout: The Lay of the Land

By admin,

  Filed under: Apache, Architecture, Artificial Intelligence, Computing, Machine Learning
  Comments: Comments Off on Machine Learning with Apache Mahout: The Lay of the Land

ORIGINAL: Dr. Dobbs

Mahout greatly simplifies extracting recommendations and relationships from input datasets. Here we look at setting up Mahout and running its recommender on a small data sample.Building intelligent applications that learn from user input and data they process is becoming a popular requirement, and these applications require machine learning techniques.

Apache Mahout is a highly scalable machine learning library that enables developers to use optimized algorithms, such as collaborative filtering and random forest decision-tree-based classifiers. As such, Apache Mahout is becoming one of the most popular library for machine-learning projects. In this first of a pair of articles, I’ll start explaining how to create a Mahout recommender by taking advantage of one of its collaborative filtering algorithms.

Working with Collaborative Filtering Recommenders

If you have visited e-commerce or social network websites, you’ve probably seen a recommender engine in action. Recommender engines try to infer tastes and preferences for a user based on his or her past actions and similarities to other users. In addition, recommender engines try to identify unknown items that might be of interest to users.

People follow patterns to like and dislike. For example, people usually tend to like things that are similar to other things they like, and they usually tend to like things that similar people like. Recommendation algorithms use these patterns to predict likes and dislikes. It is possible to generate recommendations based on either users or items.

Apache Mahout is usable in a wide range of machine-learning and data-mining algorithms. However, Mahout has a specific focus on collaborative filtering (recommender engines), clustering, and classification. Here, I’ll focus on one of the recommender engines that Mahout includes out of the box. For reference, you can peruse the official instructions to check out the code and build the latest Mahout version.

Collaborative filtering” recommenders, such as the one I’m going to look at, require you to specify a relationship between the users and the items. The collaborative filtering recommender engine doesn’t need to know details about the properties for each item to produce a recommendation. Mahout provides a collaborative filtering framework that enables you to use a simple input, and generate recommendations based on this input. In addition, you can build a domain-specific content-based recommender that considers the specific attributes of either the items or the users on top of the framework that Mahout provides.

A small database with relationships between users and items makes it easy to understand how collaborative filter recommenders work in Mahout. Consider the following IDs for six users:

Find the pom.xml file within the project’s root folder (shown at the bottom of the file panel in Figure 2). The following lines show the initial content of this file, with just a junit dependency. In my case, the Mahout version is 0.8. If you are working with a different Mahout version, a different value will appear for version. ?

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
<?xml version="1.0"?>
<modelVersion>4.0.0</modelVersion>
<parent>
<groupId>org.apache.mahout</groupId>
<artifactId>mahout</artifactId>
<version>0.8</version>
</parent>
<groupId>com.first</groupId>
<artifactId>firstrecommender</artifactId>
<version>1.0-SNAPSHOT</version>
<name>firstrecommender</name>
<properties>
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
</properties>
<dependencies>
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<version>3.8.1</version>
<scope>test</scope>
</dependency>
</dependencies>
</project>

Because you want to use the different Mahout libraries to create a recommender, it is necessary to include all the dependencies in the pom.xml file. The following lines show the new dependencies in the edited pom.xml that include four Mahout libraries:

  • mahout-core,
  • mahout-math, and
  • mahout-utils.

In addition, there is a value specified for parent/relativePath to set the relative path to the parent project.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
<?xml version="1.0"?>
<modelVersion>4.0.0</modelVersion>
<parent>
<groupId>org.apache.mahout</groupId>
<artifactId>mahout</artifactId>
<version>0.8</version>
<relativePath>../pom.xml</relativePath>
</parent>
<groupId>com.first</groupId>
<artifactId>firstrecommender</artifactId>
<version>1.0-SNAPSHOT</version>
<name>firstrecommender</name>
<properties>
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
</properties>
<dependencies>
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<version>3.8.1</version>
<scope>test</scope>
</dependency>
<dependency>
<groupId>org.apache.mahout</groupId>
<artifactId>mahout-core</artifactId>
<version>0.8</version>
</dependency>
<dependency>
<groupId>org.apache.mahout</groupId>
<artifactId>mahout-math</artifactId>
<version>0.8</version>
</dependency>
<dependency>
<groupId>org.apache.mahout</groupId>
<artifactId>mahout-math</artifactId>
<version>0.8</version>
<type>test-jar</type>
<scope>test</scope>
</dependency>
</dependencies>
</project>
Working with a Generic-User-Based Recommender
The org.apache.mahout.cf.taste.impl.recommender.GenericUserBasedRecommender class implements a recommender that uses a DataModel and a UserNeighborhood to produce recommendations. The org.apache.mahout.cf.taste.model.DataModel implementations represent a repository of information about users and their associated preferences for items. I will use the CSV file created above as the DataModel.

The org.apache.mahout.cf.taste.neighborhood.UserNeighborhood implementations to compute a neighborhood of users similar to a given user and the recommender engine can use this neighborhood to compute recommendations.


  • 1001
  • 1002
  • 1003
  • 1004
  • 1005
  • 1006
Each user has one or more scores that indicate their preference for each item ID. The score is a value from 1 to 10. The item IDs start with a 9 prefix to easily differentiate them from the user IDs. Figure 1 shows the six users (blue circles) with relationships to the different items (orange circles) and the score values represented by lines with different colors according to the following ranges:
  • Score value from 1 to 4: the user dislikes the item (red solid line).
  • Score value from 5 to 7: the user likes the item, but isn’t excited with the item and has some criticisms (red dashed line).
  • Score value from 8 to 10: the user really likes the item (green line).

 

Mahout
Figure 1: Six users (1001-1006) and their how much they like items (9001-9015).

You can see that user 1001 very much likes items 9001 and 9003, but this user doesn’t like item 9002. Based on the preferences of other users that have similar tastes to user 1001, I want to know the best items to recommended to user 1001.

The following data listing shows the contents of a comma-separated values (CSV) file that defines the input data represented in Figure 1, with the user IDs, the item IDs, and the score values. You should create a text file named dataset1.csv because you will use it later.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
#userId, itemId, score
1001,9001,10
1001,9002,1
1001,9003,9
1002,9001,3
1002,9002,5
1002,9003,1
1002,9004,10
1003,9001,2
1003,9002,6
1003,9003,2
1003,9004,9
1003,9005,10
1003,9006,8
1003,9007,9
1004,9001,9
1004,9002,2
1004,9003,8
1004,9004,3
1004,9010,10
1004,9011,9
1004,9012,8
1005,9001,8
1005,9002,3
1005,9003,7
1005,9004,1
1005,9010,9
1005,9011,10
1005,9012,9
1005,9013,8
1005,9014,1
1005,9015,1
1006,9001,7
1006,9002,4
1006,9003,8
1006,9004,1
1006,9010,7
1006,9011,6
1006,9012,9


It is possible to use a CSV file as the input for a Mahout recommender engine and generate a specific number of recommendations for one of the users with just a few lines of code.




Creating a New Mahout Project with Maven and EclipseMahout requires both Maven and Java JDK, and I assume you’ve already built Mahout and that you have Maven installed. Follow the next steps to create a new Mahout project with Maven. I’ve also added the necessary steps to work with the Eclipse IDE. You can skip the steps related to Eclipse if you are using another IDE.


  • Open a command prompt or console in your operating system.
  • Go to your Mahout folder.
  • Run the Maven command to create an empty project named firstrecommender with the package namespace com.first: mvn archetype:create -DarchetypeGroupId=org.apache.maven.archetypes -DgroupId=com.first -DartifactId=firstrecommender
  • Go to the firstcommender folder that Maven has created for you with the new project.
  • Execute the mvn compile Maven command to build the recently created project, which contains some code to display Hello world!
  • Now, execute the mvn exec:java -Dexec.mainClass="com.first.App" Maven command to run the built project. Notice that the main class is com.first.App. This class has a main method with a single line of code: System.out.println( "Hello World!" );.


Of course, a project that displays a Hello World! message isn’t our goal. But you can use it as a template to start working with the different Mahout libraries. Import the Maven project into Eclipse or your favorite IDE to see the structure of the project (see Figure 2). The src/main/java folder includes the com.first.App.java file with the com.first.App class.


Mahout
Figure 2: The initial structure for the generated project in Eclipse.
First, I need to add the CSV file to the project. Create a new data subdirectory within the root folder and add the previously saved dataset1.csv file in this new subdirectory. Then, add a new Java class named GenericUserBasedRecommender1 in the src/main/java folder and include it in the com.first package. The following lines show the code for GenericUserBasedRecommender1.java:


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
package com.first;
import java.io.*;
import java.util.*;
import org.apache.mahout.cf.taste.impl.model.file.*;
import org.apache.mahout.cf.taste.impl.neighborhood.*;
import org.apache.mahout.cf.taste.impl.recommender.*;
import org.apache.mahout.cf.taste.impl.similarity.*;
import org.apache.mahout.cf.taste.model.*;
import org.apache.mahout.cf.taste.neighborhood.*;
import org.apache.mahout.cf.taste.recommender.*;
import org.apache.mahout.cf.taste.similarity.*;
public class GenericUserBasedRecommender1 {
public static void main(String[] args) throws Exception {
// Create a data source from the CSV file
File userPreferencesFile = new File("data/dataset1.csv");
DataModel dataModel = new FileDataModel(userPreferencesFile);
UserSimilarity userSimilarity = new PearsonCorrelationSimilarity(dataModel);
UserNeighborhood userNeighborhood = new NearestNUserNeighborhood(2, userSimilarity, dataModel);
// Create a generic user based recommender with the dataModel, the userNeighborhood and the userSimilarity
Recommender genericRecommender = new GenericUserBasedRecommender(dataModel, userNeighborhood, userSimilarity);
// Generate a list of 3 recommended items for user 1001
List<RecommendedItem> itemRecommendations = genericRecommender.recommend(1001, 3);
// Display the item recommendations generated by the recommendation engine
for (RecommendedItem recommendedItem : itemRecommendations) {
System.out.println(recommendedItem);
}
}
}
Execute the mvn compile Maven command to rebuild the recently modified project.

Then, execute the

mvn exec:java -Dexec.mainClass="com.first.GenericUserBasedRecommender1"

Maven command to run the built project. Notice that the main class is now com.first.GenericUserBasedRecommender1.

The following lines show the last lines of the output generated by the execution.
1
2
3
4
5
6
7
8
9
10
RecommendedItem[item:9010, value:9.500863]
RecommendedItem[item:9011, value:9.499137]
RecommendedItem[item:9012, value:8.499137]
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 1.883s
[INFO] Finished at: Wed Oct 16 23:25:14 PST 2013
[INFO] Final Memory: 14M/154M
[INFO] ------------------------------------------------------------------------
The code is easy to understand and uses many Mahout classes to
recommend the following three items to user 1001 with different score
values:
  • Item 9010 with a value of 9.500863.
  • Item 9011 with a value of 9.499137.
  • Item 9012 with a value of 8.499137.
Thus, the first item that the recommender engine would suggest to
user 1001 based on the preferences of similar users (neighbors) is item
9010, with the highest value of 9.5009863.

How the Calculation Works

The code in the GenericUserBasedRecommender1.main method creates a data source from the data/dataset1.csv CSV file.

The

org.apache.mahout.cf.taste.impl.model.file.FileDataModel.FileDataModel constructor receives the File instance containing the preferences data.

Then, the code uses the FileDataModel instance to create an instance of the org.apache.mahout.cf.taste.impl.similarity.PearsonCorrelationSimilarity class. This class provides an implementation of the Pearson correlation. For example, for two users, named user1 and user2, PearsonCorrelationSimilarity calculates the following values:
  • sumSquareUser1: Sum of the square of all the preference values for user1.
  • sumSquareUser2: Sum of the square of all the preference values for user2.
  • sumUser1XUser2: Sum of the product of the preference values for user1 and user2, for all the items that include preferences from both users.
Then, PearsonCorrelationSimilarity calculates the correlation with the following formula: sumUser1XUser2 / sqrt(sumSquareUser1 * sumY2).
This way, this correlation shifts the user preference values to make
each of their means equal to 0, and it is equivalent to the cosine
similarity. You can interpret this correlation as the cosine of the
angle between two vectors generated with the user preference values.

Next, the code uses the FileDataModel and the PearsonCorrelationSimilarity instances to create an instance of the

org.apache.mahout.cf.taste.impl.neighborhood.NearestNUserNeighborhood.NearestNUserNeighborhood class. This class computes a neighborhood consisting of the two nearest users to a given user because the n argument that defines the neighborhood size is set to 2. There are many other constructors for this class that allow you to specify values for additional arguments.

The code creates a generic-user-based recommender

(org.apache.mahout.cf.taste.impl.recommender.GenericUserBasedRecommender.GenericUserBasedRecommender) instance with the FileDataModel, the NearestNUserNeighborhood, and the PearsonCorrelationSimilarity instances. Then, it is simply necessary to call the recommend method for the new GenericUserBasedRecommender instance with the user ID and the desired number of recommendations to generate. This method returns a List<org.apache.mahout.cf.taste.recommender.RecommendedItem>. Each RecommendedItem instance encapsulates a recommended item and includes the item ID (recommendedItem.getItemID()) and a float value (recommendedItem.getValue()) that expresses the strength of the preference. A simple for loop displays each RecommendedItem in the console.

This shows how you can use one of the Mahout recommender engines with
just a few lines of code. In my example, the code uses a simple CSV file as the data source, but it is just as easy to work with larger and more complex data sources. In addition, several Mahout features run on top of Apache Hadoop and take advantage of its great scalability. In the next article, I’ll discuss more-advanced machine learning algorithms included in Apache Mahout — which you you can also use with just a few lines of code.

Gaston Hillar is a frequent contributor to Dr. Dobb’s.

Comments are closed for this post.