Item-based Collaborative Filtering 1 (Distributed Version Using Mrjob/MapReduce)


推荐系统实践 (by 项亮)





The growth of the Internet has made it much more difficult to effectively extract useful information from all the available online information. The overwhelming amount of data necessitates mechanisms for efficient information filtering. One of the techniques used for dealing with this problem is called collaborative filtering.

The motivation for collaborative filtering comes from the idea that people often get the best recommendations from someone with similar tastes to themselves. Collaborative filtering explores techniques for matching people with similar interests and making recommendations on this basis.

Item-based Collaborative Filtering

Item-based collaborative filtering invented by (users who bought x also bought y), proceeds in an item-centric manner:

1)Build an item-item matrix determining relationships between pairs of items

2)Infer the tastes of the current user by examining the matrix and matching that user’s data


This data set contains 10000054 ratings and 95580 tags applied to 10681 movies by 71567 users of the online movie recommender service MovieLens.

Users were selected at random for inclusion. All users selected had rated at least 20 movies. No demographic information is included. Each user is represented by an id, and no other information is provided.

The data are contained in three files, movies.dat, ratings.dat and tags.dat. Also included are scripts for generating subsets of the data to support five-fold cross-validation of rating predictions. More details about the contents and use of all these files follows.


This implementation using ratings.dat in the MovieLens data set as the input file.

Below are contents of which sum up the item popularity.

Below are contents of calculating the correlations among items.


2 comments to this article

  1. daniel

    on April 26, 2015 at 2:04 pm - Reply

    I am new to python and I recently read your article about Item-based CF. But I really dont understand the meaning of reduce3(),where is key and value come from without mapper()? Could you please tell me?
    A lot of thanks!!

    • yekeren

      on July 22, 2015 at 3:29 pm - Reply

      It comes from the key and value generated in reduce2(), then a default mapper() is added to the 3rd round(like linux pipeline command ‘cat’). In the next step, keys are shuffled for reduce3().

Leave a Reply