Friday, May 30, 2008

Scraping the World of Warcraft armory with Python

Well, in the last couple of weeks I started a new project. This was totally random but gave me great results.

I wrote a Python(man this language is awesome) program to scrape the WoW armory for my guild's gear. We have been progressing really well over the last couple of months, and we have been talking about killing two of the really big bosses. I wanted to see how well we were geared for these two encounters.

I googled around and I found this post. Got to give Peter credit; he got me started down the path that forced me to learn a little Python.

My code is horribly ugly, and I know a script monkey could have done this better and in his/her sleep. But I am very happy with the results that I got.

I didn't know and still don't know very much about navigating xml with Python, so I manually filtered my guild's toons with the armory and saved the xml file.
I hacked down the file so that all that was left was a list of members.
XML guild file layout looked like this:
<character url= rank= raceid= race= name= level= genderid= gender= classid= class=>

Each character had its own line.

Once I had that in a nice format, I had to figure out how to navigate the xml.
Searching around I found the ElementTree module.
It took me a while to find out specifically how to use it, but when I did it was very easy.
I started with:
from xml.etree.ElementTree import XML, fromstring, tostring
from xml.etree import ElementTree as ET
Then to parse the xml file.
tree = ET.parse("guild70s.xml")
except Exception, inst:
print "Unexpected error opening %s: %s" % (xml_file, inst)
elem = tree.getroot()

This was the most confusing point to me. When I look at an xml file, i understand it, but I did not know how Python/ElementTree looks at it.
I just used print to figure out where I was in the xml file.

In this case, elem is equal to members element in the guild xml file.
Using a basic "for" e equals the characters in the elem members.
for e in elem:
Getting at specific attributes in the character lines was easy as well.
strUrl = "" + e.get("url")
You just need to look in the xml file and find the name of the attribute that you want to retrieve. This also points out another neat thing about the guild xml file that I downloaded earlier. Each toon has their url as an attribute including all the formating, i.e. url="r=Zul%27jin&amp;n=Noldor" for my toon. This is great so you don't have to worry about the url formating compared to what Pete did above.

At this point, I could read each toon's xml file from the armory. The problem that I ran into was I kept getting 503 errors (website error) randomly throughout the script. I don't know if there was a lot of traffic or the site was going down for maintenance. To avoid having to run the script over and over, I saved each toon's xml file to a folder.

Navigating the toon xml files was a bit more complicated. Someone more skilled than I would know how to do this better, but this is how I got it to work.
I started with this:
   children = elem[0]
child = children[1]
items = child.find("items")

Got it down to:
    children = elem[0]
items = children[1].find("items")

I could not figure out how to do it in one line, so if anyone has any thoughts that would be great.

Basically that covers all the hard parts. I downloaded all the items that each toon had.
I used:
os.path.isfile(geardir + e.get("id") + ".xml")
To skip toons and gear which I had already downloaded.

In the end, I created a useful tool to see how well my guild is geared for the content that we are in.

The code is splintered into multiple files so I can't post all of them. I will just post one.
This code reads the guild roster xml file, finds the character, and then retrieves the items for that toon. Then it goes on to the next toon until all the guild's gear has been downloaded.

import urllib2
import xml.dom.minidom
from xml.etree.ElementTree import XML, fromstring, tostring
from xml.etree import ElementTree as ET
import os
itemurl = ""
geardir = "gear/"

tree = ET.parse("guild70s.xml")
except Exception, inst:
print "Unexpected error opening %s: %s" % (xml_file, inst)
elem = tree.getroot()
for z in elem:

tree = ET.parse("characters/" + z.get("url") + ".xml")
except Exception, inst:
print "Unexpected error opening %s: %s" % (xml_file, inst)
print z.get("url")
elem = tree.getroot()
children = elem[0]
child = children[1]
items = child.find("items")

itemstr = ""
for e in items:
if os.path.isfile(geardir + e.get("id") + ".xml") :
oOpener = urllib2.build_opener()
oOpener.addheaders = [
'Mozilla/5.0 (Windows; U; Windows NT 5.0; en-GB; rv: Gecko/20070515 Firefox/'),
req = urllib2.Request(itemurl + e.get("id"))
itemstr =
fout = open(geardir + e.get("id") + ".xml", 'w')
print e.get("id")

Side Note: When I did this the special E/Alt-144 ascii character was not working. We had two people in our guild that used it, and I could not even see their pages with firefox, so it must be an Armory issue.

No comments: