Usually people relocate for many purposes like job opportunity or due to some family reasons from one city to other and often found that they have little to no choice in choosing their locality. They have to rely on some local person or some middle man yet no surety of finding what they want. What if we can use data science algorithms on publicly available data and foursquare places apis to get what we want ? or atleast a list of similar neighbourhoods that we can consider moving in ? Let’s find out .
Consider a person is seeking a relocation from Manhattan,New York to Toronto. We have to find a location worth considering. We are going to compare the nieghbourhoods of New York city to neighbourhoods of Toronto using Foursquare Places Api.
New York Neighborhood has a total of 5 boroughs and 306 neighborhoods.
This data set exists for free on the web. Link to the dataset is : https://geo.nyu.edu/catalog/nyu_2451_34572
For city of Toronto we are going to extract information regarding it’s neighbourhoods from wikipedia page : https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M
Geographical coordinates data of both the cities will be utilized as input for the Foursquare API, that will be leveraged to provision venues information for each neighborhood. We will use the Foursquare API to explore neighborhoods in both cities
Data mentioned in above section is used to extract the neighbourdhood information of each city. The New York city data is available in Json format. The data is processed in to the pandas dataframe to get all the neighbourhoods of New York city :
Neighbourhoods of Toronto city :
The neighbourhood data of both cities is than processed and foursquare places apis is used to get the top 100 venues located in each locality.One hot encoding is applied and data is than grouped on neighbourhood to get the dataframe ready for comparison.
Dimensionality reduction is performed on merged dataframe of neighbourhoods of both the cities using Principal Component Analysis on the dataframe in order to reduce the number of dimensions. we are able to downscale the number of features yet retaining the same variance.
Then, Silhouette score method is used in order to find out the best k or best optimal number to perform K-means clustering Algorithm. Here is the result:
Using above method, we reached at the conclusion that the best value of K is 3 for performing kMeans algorithms on merged dataframe. Hence, we have performed KMeans algorithms to classify all the neighbourhoods of both the cities in to three clusters.
After performing the KMeans function on the processed dataframe, neighbourhoods of both cities are assigned to their respective cluster groups. Here are the results calibrated on map :
New York Neighbourhood :
Toronto Neighbourhood :
Using Neighbourhood Analysis of each city, we have answers to the very question we have begin with. A person who lives in Manhattan and likes his own neighbourhood, wants to relocate to Toronto, now have shortlist of possible localities s/he can shift to and enjoy being there. Now s/he has a choice.
code : click here
report: click here