0% found this document useful (0 votes)
25 views

02 - 16 2 Geocoding Visualization - en

The document discusses visualizing data by retrieving it from the network, processing it, storing it in a database, and then visualizing it. An example process is described that takes locations data, geocodes it using an API, stores it in a database, and then visualizes it on a map. This is presented as an example of a personal data mining workflow using Python.

Uploaded by

Box Box
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as TXT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views

02 - 16 2 Geocoding Visualization - en

The document discusses visualizing data by retrieving it from the network, processing it, storing it in a database, and then visualizing it. An example process is described that takes locations data, geocodes it using an API, stores it in a database, and then visualizes it on a map. This is presented as an example of a personal data mining workflow using Python.

Uploaded by

Box Box
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as TXT, PDF, TXT or read online on Scribd
You are on page 1/ 2

So now, in this last chapter, we're going to talk a little bit about visualizing

data. But what we're really doing is we're summing everything up, because we're
going to retrieve data from the network, we're going to process the data, we're
going to store it in a database, we're going to then write it out and visualize it.
So, it's all coming together, and and it turns out that this notion of gathering
data, the data gathering using the network it's pretty common thing. It might take
a cleaning or processing step where we're- the part of the problem is when you're
pulling data off the net, you want to be able to start this process because it'll
run, and run, and then your computer will crash or it'll go to sleep or something.
So, you don't want to start from the beginning because it might be quite a bit of
data and it takes a while to do it or as we've seen in some, you might be talking
to an API that's got some rate limit that says, oh, you'd have to stop at 14 of
these things, or stop at 200 or whatever. So, this is often a restartable process
and it's usually a simple process. It's usually a relatively small amount of code,
where you have a queue of things you want to retrieve, you go to the next one and
then you stored in database, next one starting database and when you start the
process up, you start filling this database up with stuff and then if it blows up
and you restart it, the first thing it does is it reads the databases. "Oh, I don't
need any of those", and then it starts to get the next one, and the next one, and
the next one, and the next one. That is how you make this restartable. Databases
are really good at having it, so that your program that's writing to the database
can blow up and you don't corrupt your data. You don't have partial data, it's
either written or it's not written and so these things can blow up in. Sometimes
you just blow them up, because you want to blow them up and start them up, and you
start them up again, they scan down, say, ''Where was I? Oh, I'll start here, here,
here, here. '' So, this is often a slow and restartable process. It also might be
limited but for some reason. So, this runs for awhile and the third thing we'll do
in this chapter, it might run for days actually. Then, you have your data and then
you start doing stuff inside your computer where you don't really care so much
about the network, it might be this is raw data that came in off the APIs and you
want to make the data in some new little format, so you might go from one database
to another database or a database to a file and produced data that's really ready
for visualization. If this might be a little complex or there might be flaws in it,
you might write scanners that go like, "Oh, wait a sec, this is inconsistent,
sometimes it looks like this and sometimes it looks like that, so I'll clean that
stuff up." Then, using some visualization or doing some Python Programs that loop
through the data once it's cleaned up and then do some summing or adding or who
knows what they are that they're doing, but analyzing or visualizing. What we're
going to use is we're going to use things like Google Maps to do our visualization,
a lot of JavaScript and a thing called D3.js, which is a JavaScript library. Now,
in this class, we're not teaching a JavaScript, we're not going teach you Google
Maps. I provided all these things, so that when you run these programs, that stuff
is all there. But if you want to learn and see some examples of how to make a
little simple JavaScript visualization with a line or a word cloud or a map, we've
got it, and you can take a look at those things. Now, this is one form of data
mining and its really a data mining for an individual, where you're pulling this
data, you're getting at local and then you're working with it there are other much
more sophisticated data mining technologies that you might find yourself using. But
often, you'll also find Python is part of these or Python helps you manage these or
you write a Python Program to scan through these things or to prepare them or to do
something. So, there's lots of different data mining technologies, this is just one
oversimplified very Python-oriented data mining technology. I'd call this personal
data mining. You should take classes. If you really want to become a Data Mining
Expert, this is just giving you some of the skills that we've learned in this class
and solving some data mining. So, the first application that we're going to data
mine is an extension of an application we played with back in the JSON chapter. The
idea is it has a Q of Locations. These are not pretty locations, meaning their user
typed in your locations, they're actually from data from many years ago. It's
anonymized data from the students who actually took one of my very first MOOCs,
MOOC on Internet history, but it's reduced in anonymized just play with it. But
it's not accurate. We don't have GPS coordinates. But if we use the Google GeoData
API, but JSON we can do this, but we need to avoid rate limiting, so we're going to
cache this in a database. Meaning, we're only going to retrieve data at once and
then we're going to use the Google Maps API to visualize this in a browser. The
sample code is right there and that sample code geodata.zip has a read me and it
tells you exactly what to do to run this and it shouldn't be very hard for you to
run it and produce a nice visual result. Here's a basic process diagram of what's
going to happen, there is a list of the things to retrieve called where.data is
just a list of the locations, but these are not correct, they don't have GPS, there
just a as typed into a text field by a user and Geoload is going to start and start
reading this and it checks to see if it's already in the database. This is a
restartable process as I mentioned and then it looks to see the first unretrieved
data and then it goes out and does a web service, parses that then puts it into the
database and then goes to the next one, parses that puts it in a database and this
runs for awhile and then maybe you blows up then you fix whatever or you start your
computer backup and runs for a while. So, this is a restartable process that in
effect is adding stuff to this database, it's an SQLite database and you can use
the SQLite Browser to look at this if you like stuff we did in the database
chapter. So, you can run that, you can see what you got, run at some more sewage
which he got, debug it by using the SQLite Browser. Then at some point you've got
all of your data and you want to you've got a couple of things we got this
application called geodump.py that read through all of this data and then print
some information out, nice summary information. It's really common to want to do
this to get some summary information just for sanity checking, so you don't have to
use SQLite Browser but this also writes out a little JavaScript file called
where.js which then combined with where.html and the Google APIs. This uses
JavaScript to put all these little pins on based on whatever data is in this
database. So, that's our first end-to-end spider process visualize. First thing.
So, up next, we're going to show how we can use this to build a very simple search
engine and then run the PageRank algorithm.

You might also like