Adrian Kay
Extract information from news articles by using Web scraping and NLP

Photo by Hannah Gibbs on Unsplash

An alternative way to verify your findings from your algorithm and machine learning models could be from the news articles and what better way to do if the results have been vetted and verified by a journalist.

This article would mainly focus on New York City’s commercial real estate space and information that we would be extracting from these news articles would be their addresses and the overall status of the property(ies). The process is broken down into three stages:

  1. Web-Scraping — Scrapy
  2. NLP ( Natural Language Processing) — Natural Language Toolkit (NLTK)
  3. Utilizing third-party APIs — ZoLa NYC’s Zoning & Land Use Map API

Web-scraping is a very powerful tool to extract information if there are not available. For this exercise, Python Scarpy would be used as the website sites itself are not coded javascript nor AJAX. The targeted news websites would be as follows:

  1. The Real Deal
  2. New York Yimby
  3. New York Post
  4. Commercial Observer
  5. Brownstoner

Most of these websites are fairly easy to scrape since the site structure is based on page numbers.

Stage 1 & 2

After understanding the site structure and knowing the maximum pages for a certain category, generate a list of URLs to scrape. The next stage is to extract all links “<href>” for each article and loop stage 3.

# Sample Residential Real Estate List
resi_urls = ['{}/'.format(i) for i in range(1,270)]
#Sample xPath extraction from html
article_urls = response.xpath('//*[@class="entry-title entry-summary"]/a[1]/@href').extract()

Stage 3 & 4

Once the article is loaded, the title, second title, date published, content and tags are all extracted and stored.
Lesson learned: When paragraph text contents hyperlink a single XPath expression would not include the hyperlink text. An “OR (|)” condition at the XPath expression is needed to include all text.

# Sample code if content contants tags
content =''.join(response.xpath('//*[@class="post-content-box"]/p/text() | //p/a/text()').extract())

This section would lay out the steps necessary to build a name entity recognizer with NLTK to identify the addresses, locations, or organizations associated with the article.

After transforming all scraped news articles to a clean database. The next step is to pre-process the text data.

  1. Remove all stop words
  2. Word tokenization
  3. Part-of-speech tagging (POS tagging)
from nltk.corpus import stopwords
stop = stopwords.words('english')
def nltk_process(document):
document = “ “.join([i for i in document.split() if i not in stop])
sentences = nltk.sent_tokenize(document)
sentences = [nltk.word_tokenize(sent) for sent in sentences]
sentences = [nltk.pos_tag(sent) for sent in sentences]
return sentence

4. Pattern recognition
Most addresses in the news articles are normally structured as follows:

205 Fifth Avenue
123 Stone Street
35th West 5th Avenue

Note that it doesn’t necessarily have the traditional structure as standard addresses, which increases the complexity of the regular expression patterns. The pattern below should be available to capture the majority of addresses in the US. Link to POS tags list

# Sample pattern for addresses
ADDRESS: {<JJ.?|CD.?>+<CD.?|JJ.?|NNP.?>+<CD|NNP>+}

5. Chunking
Chunking is a process where the predefined pattern is parsed through the sentences and outputs a label that contains the pattern.

To take these parsed addresses to the next level and add a geospatial component to it, an API would be used. Since these articles mainly focused on New York City and only NYC’s Zoning & Land Use Map website contains a hidden API in its search box. Therefore, it would be possible to piggyback this API by using the python package request to help extract the full address and coordinates for further analysis of our platform. In this case, the address and Brough, Block and Lot (BBL) are extracted

Example output from ZoLa’s API. (Note Lat/Long is also available)
import urllib.parse
import requests
def address_api(address):
if address == ' ':
return None , None
address = urllib.parse.quote(address.encode('utf-8'))
url = "[]=geosearch&helpers[]=bbl&helpers[]=neighborhood&helpers[]=zoning-district&helpers[]=zoning-map-amendment&helpers[]=special-purpose-district&helpers[]=commercial-overlay&q={}".format(address)
r = requests.get(url)
if (r.status_code == 200):
if ('json' in r.headers.get('Content-Type'))& len(r.json()) > 0:
json = r.json()
# json = list(chain(*json))
addresses = list(map(lambda x: x['label'] if x['type'] == 'lot' else None ,json))
bbl = list(map(lambda x: x['bbl'] if x['type'] == 'lot' else None,json))
return addresses, bbl
return None, None
return None, None
return None, None

There is an endless possibility in used NLTK, for example, auto-tagging, translations services, sentence summarization, etc. the list goes on. Also, with larger files with Big Data, Hadoop and Apache Spark would also be needed to speed the process.

Link to GitHub

Source link