Item-based Collaborative Filtering 1 (Distributed Version Using Mrjob/MapReduce)

Reference

推荐系统实践 (by 项亮)

MovieLens http://grouplens.org/datasets/movielens/

Wikipedia http://en.wikipedia.org/wiki/Collaborative_filtering

Mrjob https://github.com/Yelp/mrjob

Introduction

The growth of the Internet has made it much more difficult to effectively extract useful information from all the available online information. The overwhelming amount of data necessitates mechanisms for efficient information filtering. One of the techniques used for dealing with this problem is called collaborative filtering.

The motivation for collaborative filtering comes from the idea that people often get the best recommendations from someone with similar tastes to themselves. Collaborative filtering explores techniques for matching people with similar interests and making recommendations on this basis.

Item-based Collaborative Filtering

Item-based collaborative filtering invented by Amazon.com (users who bought x also bought y), proceeds in an item-centric manner:

1)Build an item-item matrix determining relationships between pairs of items

2)Infer the tastes of the current user by examining the matrix and matching that user’s data

MovieLens

This data set contains 10000054 ratings and 95580 tags applied to 10681 movies by 71567 users of the online movie recommender service MovieLens.

Users were selected at random for inclusion. All users selected had rated at least 20 movies. No demographic information is included. Each user is represented by an id, and no other information is provided.

The data are contained in three files, movies.dat, ratings.dat and tags.dat. Also included are scripts for generating subsets of the data to support five-fold cross-validation of rating predictions. More details about the contents and use of all these files follows.

Implementation

This implementation using ratings.dat in the MovieLens data set as the input file.

Below are contents of itempop.py which sum up the item popularity.

Below are contents of itemcf.py calculating the correlations among items.

 


2 comments to this article

  1. daniel

    on April 26, 2015 at 2:04 pm - Reply

    Hi:
    I am new to python and I recently read your article about Item-based CF. But I really dont understand the meaning of reduce3(),where is key and value come from without mapper()? Could you please tell me?
    A lot of thanks!!
    Dan

    • yekeren

      on July 22, 2015 at 3:29 pm - Reply

      It comes from the key and value generated in reduce2(), then a default mapper() is added to the 3rd round(like linux pipeline command ‘cat’). In the next step, keys are shuffled for reduce3().

Leave a Reply