NLP in the real world: How Global-Regulation organized its comparative law search engine with Watson

In the legal services industry, Global-Regulation is using NLP and machine translation to build the most comprehensive world law search engine. I recently spoke with CTO Sebastian Dusterwald to discuss how Global-Regulation uses Watson NLP technology to translate laws into English.

Sean: Sebastian, tell us about Global-Regulation and what your team does.

Sebastian: At Global-Regulation, it is our mission to democratize access to laws from across the globe. We handle large amounts of text data. We index, process, and translate nearly 2 million laws from nearly 100 countries, from Brazil to China to France to Italy and more, using machine translation. We help make laws searchable and accessible in English. We do all of this with a very small team, and none of it would be possible without the amazing AI-powered cloud services provided by the Watson platform.

Sean: Very cool. Do you have any recent examples to share about how the team is using Watson?

Sebastian: Recently one of our clients approached us about adding categories to our law metadata, in order to make it easier for them to find the laws that are relevant to their business use case, of monitoring specific types of laws (such as those in healthcare and cybersecurity) to maintain regulatory compliance. With so many laws in our database, discoverability is always an issue, so we thought this could be a great feature to add to our site. The problem is that very few of our sources provide any sort of categorization metadata, and those that do all use slightly different categories, so simply grabbing this data during indexing was out.

We needed a system that could analyze and process our text data, and then categorize it in preset bins. IBM suggested that we try out the IBM Watson Natural Language Understanding (NLU) API. This does exactly what we want out of the box: it allows us to upload training data and then to classify natural language text based on that.

Sean: Interesting, so what did you do next?

Sebastian: Well, we went through our database to find several laws that we thought were representative of each category, from finance to cybersecurity to environment-based laws. We then went through each of those laws and picked out chunks of text that we thought were relevant to the category. This was the most complicated and labor-intensive part of the implementation process. Care had to be taken to take chunks of text that were specific enough to train the NLP algorithm about the domain the law refers to, while being generic enough to not over-train the algorithm. This meant avoiding words such as specific names of countries or people, or dates. Including them would have risked training the algorithm on keywords that would look very specific but have nothing to do with the category on hand.

The training set was simply entered into a spreadsheet, and then uploaded to the IBM Watson NLU API. After a short wait for it to process the data, the API was now ready to accept queries. Our approach was to use the first 1024 characters of a law to classify it. This generated quite good results, in part because the first 1024 characters of a law typically include its title, which tends to include a number of keywords that the algorithm can use. At this stage we were now pretty sure that the IBM Watson NLP technology would be suitable for our use case, albeit with a little bit of fine tuning.

Sean: That’s great! Can you tell me a bit more about how you then fined tuned Watson to meet your client’s use case?

Sebastian: The first thing we did was to take samples across each of the laws in our database, such as healthcare, welfare, and privacy-based laws. Instead of taking just the first 1024 characters of each document, we took 5 samples of 1024 character chunks evenly spread across the document. We then averaged out the confidence scores returned by the IBM Watson NLU API and chose the highest value returned as the category for that law. This significantly increased the accuracy of the classifier for our dataset.

Next we looked at laws that we found to be classified incorrectly. We compiled a list of such laws and went through them, using more text fragments from each of these to add to the training set. Once this was completed, we uploaded it to the IBM Watson NLU API and waited for it to train a new and improved classifier. This further improved the accuracy and at this point we were happy with it. So as a final step we started to run the classifier across our entire database of laws.

Sean: Glad to hear it all worked out. Do you have any final thoughts or takeaways you would like to share about working with IBM Watson NLP technology?

Sebastian: Yes, absolutely! As you can tell, automatically translating and classifying nearly 2 million documents into a number of categories was a daunting task for a small team working with limited resources. With the volume of new laws coming in globally, our company needs to keep up with the demand and constant changes to existing laws at a global scale. We can say confidently that without the help of the Watson platform we would not be able to translate and categorize the millions of documents coming into our database in such a short time. We managed to have the basic implementation running in about a week, which is phenomenal! Thanks to the Watson platform, our small company can punch well above our weight.


Originally published in Watson Blog –


Can computers write laws?


I’ve been thinking about your implicit question for the last couple years. On and off. It’s a fascinating question that my immediate answer to was no. But could it be yes? Or why is the answer actually no?
I think I might have a decent answer to this now.
Because the laws that exist operate in systems that include other laws and elements outside of the laws (culture, geography, etc.). It’s too difficult to write a law that fits squarely within parameters.

I think that the basis for every law, and many time a sub-system of laws, is a regulatory mechanism – e.g., command and control for criminal law or green light regulation for environment law (you will get tax credit if you reduce your emission).

So the first step (i.e., the regulatory adviser of the future) is to come up with an efficient and suitable (i.e., culturally/geographically apt) solution. Articulating it into the law is probably the next step.
So, if you can distill regulatory mechanism from laws, use regulatory case studies to measure effectiveness and include the location appropriateness, with the right state of the art (and/or genius developer) you can come close to this idea. Moreover, with such a system in place you may be able to provide creative regulatory solutions.
It seems that this discussion correspond with what we are seeing both with Global-Regulation clients and the offering in the market (Archers’ Reg Content Analysis) to compare regulatory updates with corporate policies towards updating the policies accordingly.
This is fascinating both theoretically and practically.

I think we’re both right.

The current situation is so bad (i.e. divorced from reality or logic, since most places aren’t basing laws on best practices or evidence) that it’s not possible for a computer to do the same job as the current bad job.
But a better way is definitely possible and it would involve exactly what you said. But it would be wholesale replacement of systems of law with better systems. Not tweaks to regs or proposing a new law. And the real explanation for why this won’t work for the current approach is that the current approach is extremely ad-hoc and often not driven by a search for the best law.
It’s also related to another issue: expensive government IT projects that fail. They often fail because people want technology that does what the people do but that’s not possible or desirable. Redoing a process is very hard normally and when you add in a million undefined behaviours (like how people operate) it doesn’t really work.
But if you start with the tech, embodying an ideal system, then you might get an ideal system.
Modelling a broken system (that people either don’t know is broken or won’t admit to) will never result in an ideal solution. It can only result in a mediocre muddled path forward.
Sometimes the answer is a genuine new system and new approach that doesn’t relate to the old model. Like, compare Roman fraternal benefit societies to modern insurance products. Sometimes the new one is a genuine innovation in itself.

Theoretically, the elected government is supposed to represent the public’s priorities and hence instruct the administration to come up with regulatory mechanisms to be translated into laws to effectively address these priorities. This is the democratic system.

In practice, this is broken in few intersections with time playing a major role (the life span of the government and the next elections).
Our discussion is focused on the regulatory mechanisms part and what role technology is playing and could play in this part. For the simplicity we can assume (although usually the opposite occurs) that the government does not instruct the administration regarding the regulatory mechanism of choice but rather pointing to the challenge that should be resolved (there is probably tones of literature on this part alone).
With this in mind, technology can be regarded as just a tool to which people assign good and bad or technology could be regarded as setting its own agenda. I tend to support the latter approach.
Our vision with Global-Regulation was to use technology in order to assist the administration in determining the most efficient and culturally suitable way to address the challenge. We encountered laziness, short sight and search below the street light.
Hence, we now discuss whether and if so, how well, can technology replace the administration in the said regulatory and legislative process. Can it also assist in the other problematic parts of the said democratic process?
I assume that the more technology will improve in the said task of providing a regulatory solution, the more it will be difficult to ignore it and the more regulatory solutions will be based on effectiveness and lesson drawing from other jurisdictions and fields. Including the entire legislative process (e.g., combining Govtmonitor and Global-Regulation) is probably one of the steps towards this direction. Creating model laws similar to what Welters Kluwer Capital markets clause analytics and IBM Watsons’ Compare and Comply is doing with contracts could also be valuable.
One potential business idea that jumped out at me when I read what you  wrote: preemptively suggesting areas of law for governments to work on, based on what’s done elsewhere. That’s actually possible. Given the Global-Regulation database, you could actually identify some areas of regulation that could form a regulatory agenda for harmonisation or for inspiration. You can actually work out the legislative agenda pretty easily by taking the laws as passed by year and taking the most common words in those laws (which Global-Regulation already has because that’s used for the similarity search) and then find the areas that aren’t well represented in the recent laws of the given jurisdiction. That would probably work.
I’m not sure if it’s much of a business, but, if it’s actually the case that laws converge across countries (I dunno if that’s true) then the above tool could also be used to predict legislative agendas.
For example: If you know that most places in North America have recently passed laws that mention data breaches and related words, there’s a decent chance that’s going to happen in Ontario too. I have no idea how well this correlation works but it’s logical.
So you mean mapping the jurisdictional connections between laws and legal systems based on similarity of laws text and then making predictions based on the frequency of words? that sounds great!

Weekly Updates on New Laws Globally

Updating the company’s policies based on new legislation is a major part of the regulatory manager work. Until recently, this task has been especially challenging with no one-stop reliable source to go to in order to receive updates on new legislation.

While few companies provide updates on new legislation, it is limited to North America (See for example Lexis’ State Net and Pulse, Fiscal Note, and Govtmonitor in Canada. Others provide updates on financial regulation like 8of9 RegAlytics and

In order to face this challenge, Global-Regulation has utilised its system to start providing weekly updates on new laws from 46 countries, about half of which are machine translated to English.

Our ‘new laws’ section shows new laws from 46 countries with the option to filter by country.

In addition, we created an option to receive weekly email alerts on new laws based on the user’s keyword (please note: keyword email alerts for new laws can be created only by subscribers).

After creating the alerts, the user will receive an email every time her keywords appear in new laws.

These personally customised keyword based email alerts are available for unlimited users under the corporate subscription.


Are lawyers afraid of Legal Technology?!

Lawyers, or more precisely their information managers, will always find a problem in your legal information technology. It is not updated, it is not accurate, it uses machine translation, its not your junior lawyer, It does not make coffee.b3fe7c914035c07b3330f38b2d667c10

Is it because they are afraid for their jobs? imagining this IBM PR creation (ROSS) thing that will take over 90% of the firms employment opportunities and leave young associates (and old) unemployed?!
Or maybe it is because they would not settle for less than the perfect machine that will save their firm a fortune by doing all the work by itself with no need to generate a pay slip?!

I argue that they are not ready yet for legal technology. Sure, they use search engines like Lexis and Westlaw but when it comes to real legal technology – they are afraid. It disrupts their perception of the legal profession.

As always, the client is the one paying for the fear.


What is the Future of Finance? or is UBS a top performer?



“UBS never took enough interest in its risks”, Financial Times, 20.12.2012


Let’s start with the bad news – we did not win the Americas UBS Future of Finance challenge 2017. The good news is that we had the opportunity to pitch our RegTech vision and not less important, to get an inside look at UBS’s technology use (or lack of) in this field.

Our pitch was simple: you (UBS) need a regulatory compliance system (much like the one we’re currently offering for world laws – but much more advanced; a Smart system that can track, translate, map, compare and digest new regulatory change in less than an hour – globally. A learning system that will co-evolve with the bank systems and thus prevent future fines and minimize risk.

The justification was strait forward: according to a BCG recent report, the number of individual regulatory changes that banks must track on a global scale has more than tripled since 2011, to an average of 200 revisions per day. This is not a scale humans can handle efficiently. Hence it is no surprise that Banks paid $42 billion in fines in 2016 alone and $321 billion since 2008.

Technically speaking the Americas finals in which we participated were organized to the last detail. Though dietary options were not available (vegan, gluten-free etc.), the bank allocated relevant representatives to meet with each finalist and provide feedback on the pitch. For us these meeting felt like development meetings as the bank people offered great ideas to enhance our vision.

More importantly, it was an indication from a first-hand internal source that the bank (and other banks as well) is light years behind when it comes to RegTech and regulatory compliance. Given the bank spending in this field (in the billions) it is quite amazing and certainly was reassuring going to the pitching competition.

Inconveniently, while the mentoring session was held at the bank’s offices in Manhattan, the finals were held at the offices in New Jersey. This divide forced the candidates to move from one hotel to another and/or struggle with the massive transportation challenges that New York City has to offer.

With no expected diversity, the judges were all IT people. The America’s CEO Tom Natatil gave the opening speech but failed to stay for the actual competition. The judges were provided with feedback from the previous day mentors (ours was excellent) but did not provide any feedback or reasons for their choice of the winning pitch nor the 2nd and 3rd runners-up.

The winner, Authomate, pitched a mobile security system to allow the bank clients to log into the bank’s portal safely. While the technology may be new, this is by no means an innovative concept nor disruptive. Moreover, based on corporate logic, this will probably be the last technology UBS will adopt.

It is too early to say if the bank will be interested in our vision for the future. The same way that it was not clear whether the finalists were supposed to pitch a future venture that can be developed with the bank, or what they already have (Automate) to be used by the bank. Either way one thing was clear, as most big corporations, UBS structure is very fragmented and the chance to capture the attention of the relevant person is extremely challenging.

To summarize the experience, I would like to use the same citation I used at the end of my pitch: “Increasing regulation is here to stay – much like a permanent rise in sea level. In an era of rising regulatory seas, focus on management is mandatory, not optional. Top performers will use the opportunity to incorporate technical innovation” (BCG Report).

Whether UBS is a top performer is yet to be seen.


Global-Regulation + RSA Archer = More Visibility into your Compliance Needs

Governance, Risk management and Compliance (GRC) platforms are the organization’s tool to help handle, among others, its regulatory affairs. This is what the RSA Archer® Suite is designed to provide through RSA® Archer® Regulatory & Corporate Compliance Management.

With most of the world laws (1.6 million laws from 90 countries translated from 30 languages) in English, in addition to complexity map and AI driven penalty identifier, is positioned perfectly to complement the RSA Archer Suite.

This is the reason that Global-Regulation has a technology interoperability with RSA Archer Suite to offer customers an XML download of the world laws by directly to RSA Archer Regulatory & Corporate Compliance Management, to empower customers to obtain better visibility into their compliance needs.

Now, with the launch of the RSA Archer Exchange available to RSA Archer customers, this technology interoperability can be even more seamless and easy than before.


We’re on DuckDuckGo

Users can now run legislation searches directly from the DuckDuckGo search engine. We’re using “!laws“.

Just type “!laws climate change” in DuckDuckGo to be taken directly to’s search results for “climate change”.


Software That Reads Laws: PenaltyAI Search – Global Risk & Compliance Redefined

About a year ago I sat with my CTO in the Manhattan office of one of the world’s largest accounting firms. Their regulatory compliance global team was very impressed from what we’ve done so far with Global-Regulation and wanted to know what more we could do. As usual with large firms, they wanted a system that does everything – from tracking new bills to predicting the future (step 10 instead of step 3).
The ambition to create the ultimate risk and compliance system stuck with us. This ambition came into life when we realized, in one of our internal discussions about our global law search engine that penalties are the kind of information that can be identified with a high degree of certainty by an Artificially Intelligent system.

My story begins in the 2000s when I helped the Israeli court system work with IBM to digitize legal information. I’ve seen the slow evolution of legaltech and listened to the ambitious ideas of tech people. But I’ve also seen the reality of legal technology and wondered: how can we give machines the insight of lawyers?

Fast forward to 2017, after seemingly endless testing, experimenting, coding, consulting (thank you to Kyle Gorman from Google for the words to numbers converter recommendation) and hard work – we are extremely excited to present the PenaltyAI Search – the first and only AI system that identify compliance clauses in legislation on a global scale, extracts the actual penalties amount and serve it all to the user in US dollars.

Now risk and compliance professionals can search and identify risk levels across jurisdictions on a specific topic without even reading the law. Lets say that you are an IBM executive considering global expansion of your Watson services to new markets – with a click of a mouse you can now use the PenaltyAI Search feature of Global-Regulation to learn what would be the risk level of your goal.

Screenshot of PenaltyA Search for "tobacco nicotine"

Combine this with our complexity feature, suggested search ideas and related laws – and a risk & compliance team can feed Governance, Risk and Compliance (GRC) platforms with all the information needed to launch a new business line, in a matter of hours. Before, this would have taken months, require an army of translators and a division of analytics to determine risk and compliance.

We see this as a great achievement on several levels:

  1. an AI system that can really read legal text and produce useful meaning; and,
  2. enabling risk and compliance professionals to explore real and relevant data on a global scale, in English; and,
  3. allowing governments and businesses to assess and enhance their compliance efforts; and finally,
  4. for researchers to compare and contrast risk and compliance data globally.

Thank you big accounting firm for teaching us that even seemingly unsuccessful business meetings can bring great results. Thank you Microsoft Canada for your help in connecting us with the Microsoft Translator team. Thank you LegalX (now LawMade). Thank you Ken Thompson for UNIX and regular expressions. Thank you to my wife and children for your daily inspiration.

If you’d like to know more about how the system works technically, my CTO has written a blog post on building PenaltyAI Search.

Computers can now tell us about penalties for world laws.


Big Data With Purpose: How We Calculated the Fines of 1.55 Million Laws

This is a technical explanation of how we built our “PenaltyAI Search” service that combs 1.55 million world laws from 79 countries for fines. It can answer questions like “What would I pay for violating money laundering laws in Jamaica?” or “How much would a smuggler who warehouses stolen goods in China pay if they’re caught?“.

The penalties are extracted by an offline algorithm that runs on an Azure VM that does the following steps:

  1. Find laws that mention keywords associated with civil penalties (as a first pass)
  2. Convert all word numbers (like “one million”) into international number format (“1,000,000.00”)
  3. Identify the paragraphs that likely contain civil penalties based on words and numbers
  4. Merge several penalties into one, whether they related to the same “clause” (section) of a law
  5. Extract all the clauses and penalties
  6. Exclude certain classes of text that are almost never penalties but look like penalties (such as laws about gold coins and section references in laws that have to do with money)
  7. Recognize currencies in text, and combine this data with our table of national currencies, and convert penalties into USD using Yahoo! Finance rates (through the XML API call)
  8. Store the penalties and clauses in a MySQL database (RDS)

Screenshot of one of the MySQL tables for penalties

We then note in our search instance whether or not a law has penalties attached to it, so that the search instance can filter by laws that have penalties (as opposed to our regular search that includes laws that don’t have explicit fines attached to them). This process is run as a batch job offline because our 1.55 million+ laws takes several hours to process and no one would wait that long for their search results!

When a user does a search, the search is first sent to our Elasticsearch instance, and then the penalties are looked up from the MySQL database afterwards. This allows full-text search of laws to be combined with penalties, and in a way that results in much less strain on our relational database (because penalties are looked up by IDs rather than a JOIN). Storing the penalties separately allows us to reduce the amount of data in the in-memory search instance, and decouples our services (since we have other types of search like technical standards and law analytics).

The laws themselves are indexed, downloaded, converted to text, parsed, and converted to English, using our pipeline that runs on another Azure VM with RDS as the data store. We make extensive use of the Microsoft Translator API to convert foreign legislation to English (since most of the world’s laws are published in languages other than English). Our use of the service is actually listed on the “Customers” page for Microsoft Translator. We’ve written elsewhere on our blog about some of the ways we gather and process world legislation.


Graphing the World’s Laws: Visualization of 1.55 Million Laws + Our PenaltyAI Search

The graph above is the first time that penalties for non-compliance with the world’s laws has been visualized. It was made possible by the culmination of Global-Regulation Inc.’s R&D efforts over the last year to create an automated AI method for reading penalty provisions from civil laws – see the system here.

Our system (that we’re calling “PenaltyAI Search”) is now able to extract penalties from legislation (statutes and regulations) and present them in US dollars, along with the original text. This is a multi-phase process that starts with an AI based algorithm that identifies the penalty clauses. The next step is to extract the penalty amount from the penalty clause. This step includes complex linguistics mechanism that can convert amount in words into numbers like “one hundred thousand” to 100,000, and Indian English notation like “lakh” and “crore”. The next step is to convert different notation systems into a standardized decimal format (e.g. “560,99” to 560.99).The final step is converting all the world’s currency’s into USD to enable comparison on a global scale (which is done on an ongoing basis to account for currency fluctuations).

As for the graph at the top of this page, it was created by applying PenaltyAI Search to all of the laws in the database (currently around 1.55 million laws from 79 countries) and then excluding countries with only a small number of laws available or too few penalties to make any useful statistical inferences. We’re making available the Excel file for the graph here: World Penalties – Feb 9 2017. We’ve excluded any penalties other than those within the top twenty most frequent for each country in order to eliminate outliers.If you make any use of this data please link back to this blog post and let us know by pinging us on Twitter @globeregulation.

The PenaltyAI Search system has been implemented into the search engine and soon (within the next week) the user will be able to search, explore and drill down for a given topic, across jurisdictions or filtered by country. As usual, these features will be accompanied by our innovative visualization display.

We see this system as a ground breaking event in the field of extracting valuable information from legal text using algorthmic methods. On the theoretical level this is proof that the text of legislation can be mined for insights, and on the practical level, this is a celebratory milestone for compliance and GRC professionals that will be able to use our system to simplify their work.

Congratulations to our technical team that enabled us to go to where no legal tech product has gone before.

More updates will be available in the next edition of our newsletter and will be rolled out to subscribers shortly thereafter.