1. Geographical data in Twitter JSON
In this lesson, we'll learn about the different types of geographical data available in the Twitter JSON object.
2. Locations in Twitter text
A common -- although imprecise -- place where location data can be located is in the Twitter text itself. Users mention or allude to a location where they are or have visited. In the tweet above, I say that I'm in Zurich, indicating that they are at that particular location. These data may be helpful but are difficult to work with for several reasons. First, they require that some natural language processing detect that there is a location. Furthermore, and that the user is actually at that location. In the tweet above, I mention Zurich, but also Toronto. It's a non-trivial problem to decide that I am actually in Zurich, however. Lastly, once we have a location, we need to resolve it into a latitude and longitude.
3. User-defined location
Another value in the Twitter JSON which which offers location information is the user-defined location field. This is a text field which the user fills out. It can be found in the `location` field in the Twitter user JSON. While this may be a somewhat reliable place to look for location, there's a number of issues which arise with using this field. First, the location may be overly vague. In my profile above, you can see my location next to the Location Pin icon. I list my location as "Bay Area", which can cover quite a large geographic location. The location can also be some place vague, nonsensical, and fictional, such as "my bed" or "inside of Justin Bieber's heart". Lastly, it doesn't resolve to a specific longitude and latitude.
4. place JSON
The next value in which we can find geographical information is `place`. `place` is a child JSON object which contains several pieces of information. The most important of these is the `bounding_box`, which contains specific coordinates. The bounding box is a special kind of geographical object which allows Twitter to encode some uncertainty in location. As the box indicates, it is a set of four longitude and latitude coordinates which create a box which surrounds the location. This is the most common type of geographical object we'll find in Twitter data, so we'll spend the most time with this. The `place` object also includes the country, country code, full and short name of the place, and the place type, along with some identification data.
5. Calculating the centroid
The bounding box can range from a city block to a whole state or even country. For simplicity's sake, one way we can deal with handling these data is by translating the bounding box into what's called a centroid, or the center of the bounding box. You can see the centroid on the image being denoted by a small cross.
The calculation of the centroid is straight-forward. We assume that there are only two unique longitudes and two unique latitudes. We obtain both of each of those values, then we add them up and divide them by two. This will find the midpoint of each side of the box.
6. coordinates JSON
The most specific type of geographical object is the `coordinates` object, which is, simply, a single set of longitude and latitude points. The `type` value indicates that this is a point, and the coordinates denote an exact place. However, given how exact this information is, even fewer tweets contain an exact place.
7. Let's practice!
In the following exercises, you're going to practice accessing these parts of the Twitter JSON and generate the centroid-calculating function.