Pig IntroductionDS 730OverviewActivity TasksTask 1: Install PigInstall PigTest PigTask 2: Run PigCheck PigMove FilesTask 3: Practice Basic Pig Latin StatementsTask 4: Practice More Detailed Pig…

Pig
IntroductionDS
730OverviewActivity
TasksTask
1: Install PigInstall
PigTest
PigTask
2: Run PigCheck
PigMove
FilesTask
3: Practice Basic Pig Latin StatementsTask
4: Practice More Detailed Pig Latin ExamplesEnter
Commands at PromptRun
ScriptsTask
5: Solve Baseball Problems Using Pig CodeSubmitting
Your WorkOverview
In this activity, you will be running some Pig code on the Linux
Hadoop system. As with Hadoop and HDFS, you need to be mindful of
where your files are located. Pig often retrieves files from HDFS so
be sure your files are located in the correct directory. You must
also be very careful with capitalization. A file called Baseball.csv
cannot be referenced to as baseball.csv nor can a variable called
agged be used as Agged. Activity Tasks Task 1: Install Pig

Important: If
you used my script to install Hadoop in a previous activity, Pig
is already installed for you. You are encouraged to read Task 1,
but it is already done. If you did not install Hadoop using my
script, you must do Task 1.

This
step will install Pig on your Linux machine. It takes about 10
minutes depending on your internet connection speed. If you used my
script to install Hadoop, then Pig is already installed and you
should skip to Task 2.

Install
Pig

I
am assuming you have your Ubuntu setup with Hadoop from The Hadoop
Introduction activity. If so, open up Ubuntu and log in using your
ubuntu account. You should see something similar to the following:

Now
we need to install Pig. Before installing Pig, make sure to start
Hadoop by typing start-all.sh

The
current version of Pig is 0.17.0. Type the following command to get
Pig:

wget
http://www.uwosh.edu/faculty_staff/krohne/ds730/pig-0.17.0.tar.gz

Extract
Pig to the home directory:

tar
xvzf pig-0.17.0.tar.gz

Move
Pig to a better location:

sudo
mv pig-0.17.0 /usr/local/pig

Type vim
~/.bashrc and
this will open up a program called vim.

Press
the DOWN
ARROW key
to scroll all the way to the bottom.

Press
the letter i key
to get into “insert” mode.

You’ll
notice the word —
INSERT — on
the bottom.

Append
this to the bottom of your file:

#PIG
VARIABLES STARTexport
PIG_HOME=/usr/local/pigexport
PATH=$PATH:$PIG_HOME/bin#PIG
VARIABLES END

After
you have added that information to your file, press the ESC key.

You
should notice the — INSERT — disappear.

Type
in the following exactly
as it’s written:

:wqThat’s colon,
then w,
then q.
This should save your file and take you back to the command prompt.
If you are struggling with vim, search for a vim tutorial (such
ashttp://www.openvim.com/) and read up on the commands.

When
you are back at the command line, type:

source
~/.bashrc

Test
Pig

Congratulations,
you’ve just installed Pig. Now we just need to test it. To ensure
Pig is working, type:

pig
-x local

Confirm
you see grunt> at
the bottom of the screen, which means you probably did everything
right.

It
says “Connecting to hadoop file system…” but this is not true.
You are connecting to your local filesystem when you use the local
switch.

In
order to quit Pig, simply type quit and
Pig will end.

If
you want to connect to your Hadoop file system, first start up HDFS
by typing in start-all.sh.
Once the HDFS is running, you can:

type pig
-x mapreduce

OR

type pig and
it will default to the HDFS.

If
you set yours up identical to mine, at the grunt> prompt,
you should be able to type ls
/home/ubuntu and
see something very similar to the following (note the hdfs://
portion of the filename):

Type quit to
quit out of Pig and continue to Task 2.

Task
2: Run Pig Now that we have a Pig implementation ready to use, let’s
run it and make sure everything is correct.

Check
Pig

Connect
to your Linux machine.

Type start-all.sh to
start Hadoop.

It
may take a few seconds to start up.

Once
it is started, type jps and
confirm you see something like this:

1642
SecondaryNameNode1913
NodeManager2218
Jps1431
DataNode1781
ResourceManager1301
NameNodeThe
numbers may be different but the programs should be the same. If you
do not see SecondaryNameNode, it is ok.

Type pig and
you will be at the grunt> prompt.

Type ls
/home/ubuntuand
make sure files show up in the list.

If
files do appear, type quit to
quit out of Pig.

Move
Files

Type
in the following to get the Pig sample files onto your Linux
system:wget
http://www.uwosh.edu/faculty_staff/krohne/ds730/pigFiles.tar.gz

Decompress
the file by using the following command:tar
xvzf pigFiles.tar.gz

Move
those files to the HDFS into /home/ubuntu/pigtest/

First,
open up the csv files so you can get a handle on what is saved in
each file. As we are solving each one of these problems, you should
think about how difficult it would be to do this using MapReduce.
Similarly, if your file size is on the order of terabytes or higher,
Microsoft Excel or some other spreadsheet software would struggle to
even open it.

Once
the csv files are moved into the pigtest
folderon
the HDFS,
typepig and
press ENTER.

Task
3: Practice Basic Pig Latin Statements We will spend quite a bit of
time going through the Pig Latin syntax now. Each step will contain a
statement and an explanation of the statement.

Complaints
= LOAD ‘hdfs:/home/ubuntu/pigtest/Complaints.csv’ USING
PigStorage(‘,’) AS (date:chararray, product:chararray,
subprod:chararray, issue:chararray, subissue:chararray,
narrative:chararray, response:chararray, company:chararray,
state:chararray, zip:chararray, submitted:chararray,
senttocomp:chararray, compresponse:chararray,
timelyresponse:chararray, disputed:chararray, id:int);

Explanation:

The
first Complaints id
is simply the name of the relation I am using. It can be called
anything you want.

The LOAD call
is a built-in command to load the data from the file. The name of
the file follows the LOAD command.

Then
we need to tell Pig how to parse it. Since it’s comma delimited,
we use the USING
PigStorage(‘,’) command.

Finally,
we give a schema to the relation. Instead of having to reference
columns by their position in the relation, it’s easier to give
them names and types.

The
actual load will not happen until we need it.

DESCRIBE
Complaints;

Explanation:
This will print out our schema. It should print out something like:

Complaints:
{date: chararray,product: chararray,subprod: chararray,issue:
chararray,subissue: chararray,narrative: chararray,response:
chararray,company: chararray,state: chararray,zip:
chararray,submitted: chararray,senttocomp: chararray,compresponse:
chararray,timelyresponse: chararray,disputed: chararray,id: int}

Oshkosh
= FILTER Complaints BY state==’WI’ AND zip==’54904′;

Explanation:
This will filter all tuples where the state is equal to WI and the
zip code is equal to 54904. Oshkosh is now a relation that only
contains tuples where state is Wisconsin and the zip is 54904.

Note: All
of your favorite AND, OR and NOT work
in the expressions. You can read about them
here:http://pig.apache.org/docs/r0.17.0/basic.html#boolops

DUMP
Oshkosh;

Explanation:
This command will print out the relation to the screen. It will start
a MapReduce program to run through all of this for us. There should
be 19 records that print out. Instead of having to count them, you
can use the following commands (explained in the future) and it
should print out (19):grouped
= GROUP Oshkosh ALL;total = FOREACH grouped GENERATE
COUNT(Oshkosh);DUMP total;

STORE
Oshkosh INTO ‘hdfs:/home/ubuntu/pigtest/wisconsin’ USING

PigStorage(‘:’,
‘-schema’);Explanation:

We
are creating a MapReduce program to write out our Oshkosh relation
to some (possibly more than 1) files. As with MapReduce outputs, the
folder must not have
been created.

We
are using PigStorage again
as a way to simplify our code. We want to output our relation using
a different delimiter this time. We will delimit with a colon
instead of a comma.

The
last -schema tag
will save our schema to a file called .pig_schema.
This will be great for future loads because we won’t have to
specify a schema each time. If the .pig_schema file
is already in the folder we are reading in, that will be the schema
that is used. This lets us do away with everything after
the AS command
in step 1. Our Oshkosh relation will be stored in some number
of part-m-xxxxx files.
Mine turned out to only save into 1 file. Keep track of this
wisconsin folder as you will be uploading it at the end of this
activity.

readable
= FOREACH Oshkosh GENERATE date, product, company, compresponse;

Explanation:
We may not be interested in all of the data from the original
relation. If so, we can only include the columns we want. For
instance, I may only want the date, product, company and what the
company responded with. We can run this and then run a DUMP readable
to see what the relation is.

DUMP
readable;

firstfilter
= FOREACH Complaints GENERATE date, product, company, state, zip,
submitted;

Explanation:
We may only care about some of the columns.

filtered
= FILTER firstfilter BY state IS NOT NULL;

Explanation:
Cleans up some of the bogus data that gets read in. Ensures that all
states are not an empty string.

groupedbystate
= GROUP filtered BY state;

Explanation:
This will group all of our tuples by the state value so that we can
run aggregate functions on them. This groupedbystate relation
contains two columns, a group column that is the name of the state
and a bag of tuples.

Example: If
we look at a tuple where group==X, every tuple in the bag of
tuples is a tuple in the original filtered relation where
state==X. I’ll explain this a second time with the next command
if it didn’t make sense now.

DESCRIBE
groupedbystate;

Explanation:

You’ll
notice the first column is called group. This group is the same type
as state because that is what we grouped by. We will use this group
column in our aggregate functions a bit later.

The
second column is a subset of the filtered relation. The second
column, which is called a bag, is all of the tuples of filtered that
have “group” as its state column.

Example: Assume
you have a tuple in groupedbystate that
has its group value set to WI. The second column of
that groupedbystate tuple
is going to be a bag of tuples such that each tuple (from
filtered) has a state value of WI.Assuming
this DESCRIBE made
sense, the next statement should be rather straightforward. Don’t
move on to the next step unless this step is crystal clear. Look
at the example here if this is not clear:
http://pig.apache.org/docs/r0.17.0/basic.html#groupYou may
have to scroll down a little bit but the example starts with
“Suppose we have relation A” and it talks about names, ages
and gpa. It is a very good example.

agged
= FOREACH groupedbystate GENERATE group, COUNT(filtered) AS total;

Explanation: The FOREACH is
straightforward; for every tuple in groupedbystate,
do something… we want to create a new tuple. Remember, each tuple
has a bag of tuples that all have the same state. For each tuple,
generate a tuple in our new relation. The first column in our new
relation is simply the name of the state since that is what we
grouped by in the previous step. COUNT is
a built in function that simply counts all of the tuples in the
second column (which is a bag of tuples).

sorted
= ORDER agged BY total DESC;

Explanation: Sort
all of the states by the ones who have the most complaints. Sort them
in descending order.

topten
= LIMIT sorted 10;

Explanation: Limit
the number of states that end up in our output to 10.

DUMP
topten;

Explanation: Should
end up printing California, followed by Florida, followed by Texas
and the rest of the top ten.

wisc
= FILTER agged BY group==’WI’;

DUMP
wisc;

webandphone
= FILTER filtered BY (submitted==’Web’ ORsubmitted==’Phone’)
AND state==’WI’;

groupbymore=
GROUP webandphone BY (state,submitted);

Explanation: We
can group tuples from our relation using multiple values. Here I am
interested in not only how many complaints are from Wisconsin, but
how many of them were submitted via the web and via phone. Because I
know I only care about those two options, I filtered them out first
(step 18) before working with them. The more filtering I can do early
on, the better. You always want to filter first, group and aggregate
second.

agged
= FOREACH groupbymore GENERATE group, COUNT(webandphone) AStotal;

Explanation: Aggregate
all of the results into a simple relation we can read.

dump
agged;

Task
4: Practice More Detailed Pig Latin Examples Task 3 covered the basic
Pig Latin statements. In this task, we will go over some more
detailed examples. We will look at a different example where we have
separate csv files and we want to join them together in some
fashion.

Enter
Commands at Prompt

batters
= LOAD ‘hdfs:/home/ubuntu/pigtest/Batting.csv’ using
PigStorage(‘,’);

Explanation: We
will load up the data without specifying a schema as the number of
columns may be huge and we may only need a subset of them.

realbatters
= FILTER batters BY $1>0;

Explanation: If
we do not specify a schema, we can reference the columns in the
relation by column numbers starting at 0. The year column is
therefore referenced by $1.
If a batter doesn’t have a year, we want to filter them out.

run_data
= FOREACH realbatters GENERATE $0 AS id, $1 AS year, $6 AS runs;

Explanation: We
are creating a new relation that only contains the player’s id, the
year and the total number of runs scored that year.

grouped_by_year
= GROUP run_data BY year;

best_per_year
= FOREACH grouped_by_year GENERATE group, MAX(run_data.runs) AS
best;

Explanation: Here
we are looking at each tuple in our grouping. Each tuple has a group
key of the year. Each tuple in our bag contains a person who played
during that year. We are interested in the person who scored the most
runs in that particular year. If two players had the same number of
runs, both will show up in our relation.

DUMP
best_per_year;

get_player_ids
= JOIN best_per_year BY ($0, best), run_data BY (year,runs);

Explanation: In
order to join the maximum runs back up with the player that got it,
we need to do a join. We are joining our new relation that has the $0
(which is the year) and best (which is a number of runs) with our
original relation with (year, runs). Every time the ($0, best)
exactly equals the (year, runs), the tuples will combine. This will
give us multiple rows if two players had the same number (see 1967).

Example: In
our best_per_year,
we have (2011,
136.0).
In our run_datarelation,
we have (grandcu01,2011,136).
Since 2011==2011 and 136==136, we get a new tuple in our joined
relation consisting of (2011,136.0,grandcu01,2011,136).
However, we probably don’t want them in this order nor do we
want duplicate column values, so… (continue with next steps)

nicer_data
= FOREACH get_player_ids GENERATE $0 AS year, $2 AS id, $4 AS runs;

DUMP
nicer_data;

Explanation:
The data looks good but the id may not mean anything to some people.
Therefore, we need to join up with the master relation and get the
names of the players instead of the ids. The following steps are
everything we’ve done with a different file.

names
= LOAD ‘hdfs:/home/ubuntu/pigtest/Master.csv’ usingPigStorage(‘,’);

master_data
= FOREACH names GENERATE $0 AS id, $13 AS first, $14 ASlast;

complete_data
= JOIN nicer_data BY id, master_data BY id;

finished
= FOREACH complete_data GENERATE $0 AS year, $4 AS first, $5AS
last, $2 AS runs;

sorted
= ORDER finished BY year DESC;

DUMP
sorted;

Run
Scripts

The
grunt shell was nice for testing our code and making sure everything
worked. However, we will be creating pig scripts to simply run
everything without any interaction from us. Therefore, you can save
all of this in a file called baseball.pig(script
copied below for convenience).

Then
run the file by doing pig
baseball.pig

Confirm
that your relation will be stored in the baseballsorted folder,
as seen in the last line of the script:

batters
= LOAD ‘hdfs:/home/ubuntu/pigtest/Batting.csv’ using
PigStorage(‘,’);realbatters
= FILTER batters BY $1>0;run_data
= FOREACH realbatters GENERATE $0 AS id, $1 AS year, $6 AS
runs;grouped_by_year
= GROUP run_data BY year;best_per_year
= FOREACH grouped_by_year GENERATE group, MAX(run_data.runs) AS
best;get_player_ids
= JOIN best_per_year BY ($0, best), run_data BY
(year,runs);nicer_data
= FOREACH get_player_ids GENERATE $0 AS year, $2 AS id, $4 AS
runs;names
= LOAD ‘hdfs:/home/ubuntu/pigtest/Master.csv’ using
PigStorage(‘,’);master_data
= FOREACH names GENERATE $0 AS id, $13 AS first, $14 AS
last;complete_data
= JOIN nicer_data BY id, master_data BY id;finished
= FOREACH complete_data GENERATE $0 AS year, $4 AS first, $5 AS
last, $2 AS runs;sorted
= ORDER finished BY year DESC;STORE
sorted INTO ‘hdfs:/home/ubuntu/pigtest/baseballsorted’ USING
PigStorage(‘,’);

Note: We
have seen a lot of the syntax that you will need for the majority
of your pig scripts. For more information on anything else you may
want to know about Pig, go to http://pig.apache.org/docs/r0.17.0/
and you will find a ton of information about Pig. I recommend
starting here: http://pig.apache.org/docs/r0.17.0/basic.html.

Task
5: Solve Baseball Problems Using Pig Code Solve the following 4
problems using the baseball relations and store those answers into a
text file that you will submit as part of your work. Make sure you
are reading in your data from the /home/ubuntu/pigtest folder.
Along with the answers, you should store the Pig code you used to get
your answers in a Pig script and upload those scripts with your work.
The column headings will tell you which column represents what
statistic. For each question, print out the playerID(s) of the
player(s) who answer the question. If there are any ties, report all
players who tied. You should dump your output to the terminal window
for each question.

Who
was the heaviest player to hit more than 5 triples (3B) in 2005?

In
the batting file, if a player played for more than 1 team in a
season, that player will have his name show up in multiple tuples
with the same year. For example, in 2011, Francisco Rodriguez
(rodrifr03)
played for the New York Mets and then played for the Milwaukee
Brewers (see tuples
95279 and 95280).
The question you have to answer is this: what player played for the
most teams in any single season and how many teams did he play for?
A player may have played for the same team twice in the same season
at different times in the season. If this is the case, you should
count this as two different teams.

What
player had the most extra base hits during the entire 1980’s (1980
to 1989)? Note that this question is not asking about any 1 specific
year. It is asking about the entire 10 year span in the 80’s.

An
extra base hit is a double, triple or home run (columns 2B, 3B, HR).

Of
the right-handed batters who were born in October and died in 2011,
which one had the most hits in his career? The column with the
heading of H is
the hits column. Do not consider switch hitters to be right-handed
batters.

Submitting
Your Work

Save
your output from the following steps into a folder called part1
containing:

The
wisconsin folder from Task 3, step 5

The
baseballsorted folder from Task 4, Run Scripts, step 3

In
another folder called part2 store:

The
answers to the four problems in Task 5 in a single text file called
answers.txt

This
file must contain 4 sections of the form:

P1P2P3P4

Note
that each answer above must contain only the answer (or field(s))
that are asked for, not including supporting evidence.

The
Pig scripts containing Pig code you used to get your answers stored
in files called:

P1.pig

P2.pig

P3.pig

P4.pig

When
you are fin
Let’s block ads! (Why?)

Do you need any assistance with this question?
Send us your paper details now
We’ll find the best professional writer for you!

 



error: Content is protected !!