Great talk with Demi about Natural Language processing, machine translation and the challenges GR solved with IBM Watson.
In the legal services industry, Global-Regulation is using NLP and machine translation to build the most comprehensive world law search engine. I recently spoke with CTO Sebastian Dusterwald to discuss how Global-Regulation uses Watson NLP technology to translate laws into English.
Sean: Sebastian, tell us about Global-Regulation and what your team does.
Sebastian: At Global-Regulation, it is our mission to democratize access to laws from across the globe. We handle large amounts of text data. We index, process, and translate nearly 2 million laws from nearly 100 countries, from Brazil to China to France to Italy and more, using machine translation. We help make laws searchable and accessible in English. We do all of this with a very small team, and none of it would be possible without the amazing AI-powered cloud services provided by the Watson platform.
Sean: Very cool. Do you have any recent examples to share about how the team is using Watson?
Sebastian: Recently one of our clients approached us about adding categories to our law metadata, in order to make it easier for them to find the laws that are relevant to their business use case, of monitoring specific types of laws (such as those in healthcare and cybersecurity) to maintain regulatory compliance. With so many laws in our database, discoverability is always an issue, so we thought this could be a great feature to add to our site. The problem is that very few of our sources provide any sort of categorization metadata, and those that do all use slightly different categories, so simply grabbing this data during indexing was out.
We needed a system that could analyze and process our text data, and then categorize it in preset bins. IBM suggested that we try out the IBM Watson Natural Language Understanding (NLU) API. This does exactly what we want out of the box: it allows us to upload training data and then to classify natural language text based on that.
Sean: Interesting, so what did you do next?
Sebastian: Well, we went through our database to find several laws that we thought were representative of each category, from finance to cybersecurity to environment-based laws. We then went through each of those laws and picked out chunks of text that we thought were relevant to the category. This was the most complicated and labor-intensive part of the implementation process. Care had to be taken to take chunks of text that were specific enough to train the NLP algorithm about the domain the law refers to, while being generic enough to not over-train the algorithm. This meant avoiding words such as specific names of countries or people, or dates. Including them would have risked training the algorithm on keywords that would look very specific but have nothing to do with the category on hand.
The training set was simply entered into a spreadsheet, and then uploaded to the IBM Watson NLU API. After a short wait for it to process the data, the API was now ready to accept queries. Our approach was to use the first 1024 characters of a law to classify it. This generated quite good results, in part because the first 1024 characters of a law typically include its title, which tends to include a number of keywords that the algorithm can use. At this stage we were now pretty sure that the IBM Watson NLP technology would be suitable for our use case, albeit with a little bit of fine tuning.
Sean: That’s great! Can you tell me a bit more about how you then fined tuned Watson to meet your client’s use case?
Sebastian: The first thing we did was to take samples across each of the laws in our database, such as healthcare, welfare, and privacy-based laws. Instead of taking just the first 1024 characters of each document, we took 5 samples of 1024 character chunks evenly spread across the document. We then averaged out the confidence scores returned by the IBM Watson NLU API and chose the highest value returned as the category for that law. This significantly increased the accuracy of the classifier for our dataset.
Next we looked at laws that we found to be classified incorrectly. We compiled a list of such laws and went through them, using more text fragments from each of these to add to the training set. Once this was completed, we uploaded it to the IBM Watson NLU API and waited for it to train a new and improved classifier. This further improved the accuracy and at this point we were happy with it. So as a final step we started to run the classifier across our entire database of laws.
Sean: Glad to hear it all worked out. Do you have any final thoughts or takeaways you would like to share about working with IBM Watson NLP technology?
Sebastian: Yes, absolutely! As you can tell, automatically translating and classifying nearly 2 million documents into a number of categories was a daunting task for a small team working with limited resources. With the volume of new laws coming in globally, our company needs to keep up with the demand and constant changes to existing laws at a global scale. We can say confidently that without the help of the Watson platform we would not be able to translate and categorize the millions of documents coming into our database in such a short time. We managed to have the basic implementation running in about a week, which is phenomenal! Thanks to the Watson platform, our small company can punch well above our weight.
Originally published in Watson Blog – https://www.ibm.com/blogs/watson/2020/05/nlp-in-the-real-world-how-global-regulation-organized-its-comparative-law-search-engine-with-watson/
About a year ago I sat with my CTO in the Manhattan office of one of the world’s largest accounting firms. Their regulatory compliance global team was very impressed from what we’ve done so far with Global-Regulation and wanted to know what more we could do. As usual with large firms, they wanted a system that does everything – from tracking new bills to predicting the future (step 10 instead of step 3).
The ambition to create the ultimate risk and compliance system stuck with us. This ambition came into life when we realized, in one of our internal discussions about our global law search engine that penalties are the kind of information that can be identified with a high degree of certainty by an Artificially Intelligent system.
My story begins in the 2000s when I helped the Israeli court system work with IBM to digitize legal information. I’ve seen the slow evolution of legaltech and listened to the ambitious ideas of tech people. But I’ve also seen the reality of legal technology and wondered: how can we give machines the insight of lawyers?
Fast forward to 2017, after seemingly endless testing, experimenting, coding, consulting (thank you to Kyle Gorman from Google for the words to numbers converter recommendation) and hard work – we are extremely excited to present the PenaltyAI Search – the first and only AI system that identify compliance clauses in legislation on a global scale, extracts the actual penalties amount and serve it all to the user in US dollars.
Now risk and compliance professionals can search and identify risk levels across jurisdictions on a specific topic without even reading the law. Lets say that you are an IBM executive considering global expansion of your Watson services to new markets – with a click of a mouse you can now use the PenaltyAI Search feature of Global-Regulation to learn what would be the risk level of your goal.
Combine this with our complexity feature, suggested search ideas and related laws – and a risk & compliance team can feed Governance, Risk and Compliance (GRC) platforms with all the information needed to launch a new business line, in a matter of hours. Before, this would have taken months, require an army of translators and a division of analytics to determine risk and compliance.
We see this as a great achievement on several levels:
- an AI system that can really read legal text and produce useful meaning; and,
- enabling risk and compliance professionals to explore real and relevant data on a global scale, in English; and,
- allowing governments and businesses to assess and enhance their compliance efforts; and finally,
- for researchers to compare and contrast risk and compliance data globally.
Thank you big accounting firm for teaching us that even seemingly unsuccessful business meetings can bring great results. Thank you Microsoft Canada for your help in connecting us with the Microsoft Translator team. Thank you LegalX (now LawMade). Thank you Ken Thompson for UNIX and regular expressions. Thank you to my wife and children for your daily inspiration.
If you’d like to know more about how the system works technically, my CTO has written a blog post on building PenaltyAI Search.
Computers can now tell us about penalties for world laws.
This is a technical explanation of how we built our “PenaltyAI Search” service that combs 1.55 million world laws from 79 countries for fines. It can answer questions like “What would I pay for violating money laundering laws in Jamaica?” or “How much would a smuggler who warehouses stolen goods in China pay if they’re caught?“.
The penalties are extracted by an offline algorithm that runs on an Azure VM that does the following steps:
- Find laws that mention keywords associated with civil penalties (as a first pass)
- Convert all word numbers (like “one million”) into international number format (“1,000,000.00”)
- Identify the paragraphs that likely contain civil penalties based on words and numbers
- Merge several penalties into one, whether they related to the same “clause” (section) of a law
- Extract all the clauses and penalties
- Exclude certain classes of text that are almost never penalties but look like penalties (such as laws about gold coins and section references in laws that have to do with money)
- Recognize currencies in text, and combine this data with our table of national currencies, and convert penalties into USD using Yahoo! Finance rates (through the XML API call)
- Store the penalties and clauses in a MySQL database (RDS)
We then note in our search instance whether or not a law has penalties attached to it, so that the search instance can filter by laws that have penalties (as opposed to our regular search that includes laws that don’t have explicit fines attached to them). This process is run as a batch job offline because our 1.55 million+ laws takes several hours to process and no one would wait that long for their search results!
When a user does a search, the search is first sent to our Elasticsearch instance, and then the penalties are looked up from the MySQL database afterwards. This allows full-text search of laws to be combined with penalties, and in a way that results in much less strain on our relational database (because penalties are looked up by IDs rather than a JOIN). Storing the penalties separately allows us to reduce the amount of data in the in-memory search instance, and decouples our services (since we have other types of search like technical standards and law analytics).
The laws themselves are indexed, downloaded, converted to text, parsed, and converted to English, using our pipeline that runs on another Azure VM with RDS as the data store. We make extensive use of the Microsoft Translator API to convert foreign legislation to English (since most of the world’s laws are published in languages other than English). Our use of the service is actually listed on the “Customers” page for Microsoft Translator. We’ve written elsewhere on our blog about some of the ways we gather and process world legislation.
The graph above is the first time that penalties for non-compliance with the world’s laws has been visualized. It was made possible by the culmination of Global-Regulation Inc.’s R&D efforts over the last year to create an automated AI method for reading penalty provisions from civil laws – see the system here.
Our system (that we’re calling “PenaltyAI Search”) is now able to extract penalties from legislation (statutes and regulations) and present them in US dollars, along with the original text. This is a multi-phase process that starts with an AI based algorithm that identifies the penalty clauses. The next step is to extract the penalty amount from the penalty clause. This step includes complex linguistics mechanism that can convert amount in words into numbers like “one hundred thousand” to 100,000, and Indian English notation like “lakh” and “crore”. The next step is to convert different notation systems into a standardized decimal format (e.g. “560,99” to 560.99).The final step is converting all the world’s currency’s into USD to enable comparison on a global scale (which is done on an ongoing basis to account for currency fluctuations).
As for the graph at the top of this page, it was created by applying PenaltyAI Search to all of the laws in the Global-Regulation.com database (currently around 1.55 million laws from 79 countries) and then excluding countries with only a small number of laws available or too few penalties to make any useful statistical inferences. We’re making available the Excel file for the graph here: World Penalties – Feb 9 2017. We’ve excluded any penalties other than those within the top twenty most frequent for each country in order to eliminate outliers.If you make any use of this data please link back to this blog post and let us know by pinging us on Twitter @globeregulation.
The PenaltyAI Search system has been implemented into the Global-Regulation.com search engine and soon (within the next week) the user will be able to search, explore and drill down for a given topic, across jurisdictions or filtered by country. As usual, these features will be accompanied by our innovative visualization display.
We see this system as a ground breaking event in the field of extracting valuable information from legal text using algorthmic methods. On the theoretical level this is proof that the text of legislation can be mined for insights, and on the practical level, this is a celebratory milestone for compliance and GRC professionals that will be able to use our system to simplify their work.
Congratulations to our technical team that enabled us to go to where no legal tech product has gone before.
More updates will be available in the next edition of our newsletter and will be rolled out to subscribers shortly thereafter.
After using MS machine translation (and some Google) to translate more than 750,000 laws and regulations from 26 languages, we are featured in a new MS Translator Case Study:
What if you could discuss your search query with the search engine? well, now you can. Our new feature suggest search ideas based on the user’s query. These search ideas are extracted from our world’s laws database itself.
Here’s how it works:
1. We take the text of every law in the world and extract the most frequently mentioned word pairs, on a per-law basis. This way we create a new database of word pairs.
2. When someone does a search we check the database of word pairs and take the word pairs that occur most frequently in association with the word or word pair that the user is searching for. So a search for “coffee” will return keyword suggestions for words that appear in laws that mention “coffee” most commonly.
3. We then filter the words and take the best matches and display those to the user. These are the search ideas.
You can click on the search ideas in yellow at the top and it will be updated according to your recent search. For example, lets say you started with Coffee –> then you choose ‘Coffee Agreement’
And then choose ‘system certificates’. This is endless.
This new feature actually enable you to interact with the search engine and follow the trail that is based on the database of word pairs we created from our gigantic database of the world’s laws.
We are very excited to announce the launch of the first ever legislation chatbot: GRBOT, using the Kik mobile platform. The GRBOT enables the users to type a request: e.g., “search for United States drone laws” or “Show me EU laws about organic farming”, and the GRBOT will send the most relevant laws to the users’ mobile device, with a link to see more.
You can see the bot in the Kik Bot Shop here: https://bots.kik.com/#/grbot.
The only thing needed to connect to the GRBOT is to download the Kik app from the app store and friend the GRBOT account. When you first start chatting a message is displayed that shows a few examples of how to interact with the bot.
We think we’re the first ever legislation chatbot but there have been other legal bots created, both on chat platforms and accessible through other services.
As part of our work engaging Artificial Intelligence and especially Machine learning into Global-Regulation‘s system, we’ve conducted a comparison between the big four providers of ML Text Analytics: Microsoft, Google, IBM and Amazon. This post is a follow up of a previous post regarding AI assisted compliance system.
Microsoft – MS ML studio allow some options of text analytics.
Although not particularly helpful for the purpose of identifying segments within legislation, MS ML studio
is the most friendly system among the ML tools in this comparison. It is so friendly that even a user with minimal background in programming and ML can use it (with some patience and strong will 🙂
In MS ML There is a link to new text analytics models but unfortunately it is a broken link.
Google – Tensorflow offers some text analytics features. This is not a friendly tool and the text analytics options it does offer are vague. However, the vector representation of words may be useful when analyzing legal text and training a model to identify segments within legislation. This is a different approach than the structured text analytics offered by MS and IBM – see below.
In the context of a previous post about AI assisted compliance system, Tensorflow vector representation may be the solution for the first part of the challenge, i.e., manually identifying compliance clauses and training the model with these clauses. Nonetheless, new challenges arises in the implementation stage since the system will be able to identify laws that includes compliance clauses but not the specific clauses within the law.
Overcoming this challenge will require an additional stage in which the laws may be broken into chunks of text before running the model to identify the clauses. As laws are not always (and usually not) machine friendly, this process creates its own challenges.
IBM – Now offered through AlchemyLanguage, IBM now have one text analytics feature analyzing entities and relevance. Before migrating the text analytics features in July 2016, IBM offered few options of text analytics that are not available now.
This system analyze factor as ‘Fear’, ‘Anger’ and ‘Joy’ – not exactly what one would need to analyze legal text. In addition, IBM’s costumer service does not really work. Attempts to get access to their system failed even after stubborn emails.
Finally, it should be mentioned that Amazon’s ML platform does not provide any text analytics options.
One would expect that the first step in analyzing legal text would be to use ML text analytics options. This seems like the short way towards identifying segments within legislation and the best way to ride the advancements in this field. However, upon testing these ML text analytics abilities, it becomes clear that this is not the answer and that in their present state of development, ML text analytics is not capable of doing much serious work, rather than classifying text as ‘Joy’ or ‘Anger’.
The more ‘simplified’ approach taken by Tensorflow vector representation is much more relevant for the purpose of analyzing legal text and identifying segments in big data even though it is far from the ‘Watson Dream’ where you ‘work with Watson’ and get your text analyzed with the click of the mouse.