Big Data With Purpose: How We Calculated the Fines of 1.55 Million Laws

This is a technical explanation of how we built our “PenaltyAI Search” service that combs 1.55 million world laws from 79 countries for fines. It can answer questions like “What would I pay for violating money laundering laws in Jamaica?” or “How much would a smuggler who warehouses stolen goods in China pay if they’re caught?“.

The penalties are extracted by an offline algorithm that runs on an Azure VM that does the following steps:

  1. Find laws that mention keywords associated with civil penalties (as a first pass)
  2. Convert all word numbers (like “one million”) into international number format (“1,000,000.00”)
  3. Identify the paragraphs that likely contain civil penalties based on words and numbers
  4. Merge several penalties into one, whether they related to the same “clause” (section) of a law
  5. Extract all the clauses and penalties
  6. Exclude certain classes of text that are almost never penalties but look like penalties (such as laws about gold coins and section references in laws that have to do with money)
  7. Recognize currencies in text, and combine this data with our table of national currencies, and convert penalties into USD using Yahoo! Finance rates (through the XML API call)
  8. Store the penalties and clauses in a MySQL database (RDS)

Screenshot of one of the MySQL tables for penalties

We then note in our search instance whether or not a law has penalties attached to it, so that the search instance can filter by laws that have penalties (as opposed to our regular search that includes laws that don’t have explicit fines attached to them). This process is run as a batch job offline because our 1.55 million+ laws takes several hours to process and no one would wait that long for their search results!

When a user does a search, the search is first sent to our Elasticsearch instance, and then the penalties are looked up from the MySQL database afterwards. This allows full-text search of laws to be combined with penalties, and in a way that results in much less strain on our relational database (because penalties are looked up by IDs rather than a JOIN). Storing the penalties separately allows us to reduce the amount of data in the in-memory search instance, and decouples our services (since we have other types of search like technical standards and law analytics).

The laws themselves are indexed, downloaded, converted to text, parsed, and converted to English, using our pipeline that runs on another Azure VM with RDS as the data store. We make extensive use of the Microsoft Translator API to convert foreign legislation to English (since most of the world’s laws are published in languages other than English). Our use of the service is actually listed on the “Customers” page for Microsoft Translator. We’ve written elsewhere on our blog about some of the ways we gather and process world legislation.