Hadoop Experiment - Using Pig


Using the Pig language, we can make a script to perform the MapReduce actions similar to the previous post. Note that I will be using the same CSV file as before.


gamedata = LOAD 'nesgamedata.csv' AS (index:int, name:chararray, grade:chararray, publisher:chararray, reader_rating:chararray, number_of_votes:int, publish_year:int, total_grade:chararray);

DESCRIBE gamedata;

DUMP gamedata;
[[email protected] gamedata]# pig -f gamedata_01.pig
(269,Winter Games,12,Epyx,13,24,1987,12.96)
(270,Wizards and Warriors,9,Rare,6,55,1987,6.053571428571429)
(271,World Games,6,Epyx,9,8,1986,8.666666666666666)
(272,Wrath of the Black Manta,7,Taito,6,31,1989,6.03125)
(273,Wrecking Crew,10,Nintendo,8,18,1985,8.105263157894736)
(275,Xexyz,10,Hudson Soft,5,26,1989,5.185185185185185)
(277,Yoshi's Cookie,5,Nintendo,7,23,1993,6.916666666666667)
(279,Zelda II: The Adventure of Link,3,Nintendo,4,112,1989,3.9911504424778763)
(280,Zelda, The Legend of,3,Nintendo,3,140,1986,3.0)
(281,Zombie Nation,4,Kaze,8,26,1991,7.851851851851852)

Now lets calculate the average rating given by users for each different rating given by the author of the website for all Nintendo games.


gamedata = LOAD 'nesgamedata.csv' AS (index:int, name:chararray, grade:int, publisher:chararray, reader_rating:int, number_of_votes:int, publish_year:int, total_grade:float);

gamesNintendo = FILTER gamedata BY publisher == 'Nintendo';

gamesRatings = GROUP gamesNintendo BY grade;

averaged = FOREACH gamesRatings GENERATE group as rating,
        AVG(gamesNintendo.total_grade) AS avgRating;

DUMP averaged;

Run the script on the Hadoop machine:

[[email protected] gamedata]# pig -f gamedata_02.pig

From this we can observe that on average the users do not really agree with the author on the ratings. Often the author gives higher grades to a game than the users.