Using Artificial Intelligence to Study Corruption (I): Academic and Non-Governmental Organization Research
Written by Jonathan J. Rusch
Corruption has long presented a challenge for scholars in multiple fields because the term itself is inherently protean. Depending on the purpose for which it is defined, it can be considered normative or descriptive, and cover an exceptionally broad spectrum of behaviors ranging from any abuse of entrusted power for private gain to specific crimes such as bribery of domestic and foreign officials. In general, it would seem, as Professor Dan Hough of the University of Sussex has acknowledged, that “finding agreement on what is and isn’t corrupt will be virtually impossible.”
Moreover, even currently available and well-regarded measures of corruption – such as Transparency International’s Corruption Perceptions Index and Global Corruption Barometer and the World Justice Project’s Rule of Law Index – take considerable time to conduct and analyze. Whether oriented toward people’s perceptions of, or personal experiences with, corruption, they often require extensive human involvement in gathering data, often from in-country interviews of many individuals, and manually analyzing those data to reach findings and conclusions.
Corruption studies such as these have genuine academic and practical value. It is important, however, for scholars to recognize that studying some aspects of corruption will be possible only by using information technology (IT) to amass and analyze very large volumes of data and to reduce the time that researchers need to review and draw meaningful conclusions from those data.
One especially promising approach is the use of artificial intelligence (AI) in analyzing corruption-related data. This post will summarize several noteworthy studies that academic and non-governmental organizations have successfully conducted using AI.
Artificial Intelligence: A Very Short Course
At the outset, it is necessary to confine the scope of the discussion, as the term “artificial intelligence” has its own protean tendencies. For some, the broad term presages a dystopian future in which artificial beings possess and use intelligence superior to that of humans. No less a figure than the late Professor Stephen Hawking warned that “[t]he development of full artificial intelligence could spell the end of the human race.” But this type of AI, also known as artificial general intelligence (AGI), lies far in the future (assuming it is even possible).
On the other hand, academic narrow intelligence (ANI), which involves the use of IT programmed to perform a single task by drawing information from one or more large datasets, is now ubiquitous in many fields, from medical diagnosis to the development of self-driving cars. One subset of ANI is machine learning (ML), which has been defined as “provid[ing] systems the ability to automatically learn and improve from experience without being explicitly programmed. In ML, there are different algorithms (e.g. neural networks) that help to solve problems.” A subset of ML is deep learning, “which uses the neural networks to analyze different factors with a structure that is similar to the human neural system.”
One of the first widely reported academic studies of corruption using ML is a 2018 article by researchers Félix J. López-Iturriaga and Iván Pastor Sanz. This study sought to develop a neural network prediction model of corruption based on economic factors. It used a database that brought together actual cases of corruption in Spanish provinces that were reported by the media or went to court between 2000 and 2012.
Applying the model to the corruption-case database, the study found “that the taxation of real estate, economic growth, the increase in real estate prices, the growing number of deposit institutions and non-financial firms, and the same political party remaining in power for long periods seem to induce public corruption.” It also stated that the model provides “different profiles of corruption risk depending on the economic conditions of a region conditional on the timing of the prediction,” as well as “different time frameworks to predict corruption up to 3 years before cases are detected.”
Subsequently, a 2019 article by two Brazilian researchers, Tiago Colliri and Liang Zhao, used a network-based technique to explore the corruption-predictive value of public bills-voting data from the Brazilian House of Representatives. The study first amassed data on the votes of 2,455 Brazilian congressmen in a total of 3,407 bills-voting sessions in the House of Representatives from 1991 to 2019. After extracting and cleaning the data from those sessions, the researchers had a total of 2,455 representatives and 1,656,547 votes.
They then used information, confirmed from Brazilian judiciary official sources, indicating which of those congressmen was currently convicted or had been arrested for corruption or other financial crimes such as money laundering, embezzlement, or misappropriation of public funds. After identifying a total of 33 such congressmen, the researchers used two methods for detecting future convictions of congressmen based on the voting data.
The more predictive method, according to the researchers, was a link-prediction model. The link prediction approach seeks to predict whether there will be links between two nodes in a network, “based on the attribute information and the observed existing link information.” The researchers’ link-prediction model reportedly resulted in a high degree of accuracy (up to 90 percent) in predicting future convictions of congressmen. The other method in the study, using a weight matrix that used only one node in the researchers’ network to predict congressmen’s future convictions, resulted in only 24 percent predictive accuracy. The researchers concluded that their work “contributes to the development of big data platform[s] to monitor politicians’ behavior.”
Non-Governmental Organization Research
ANI has also proven valuable to anti-corruption advocacy organizations and investigative journalists. In 2018, the anti-corruption organization Global Witness reported on its use of ML to obtain information on the location of unregulated mines, which would be pertinent to investigating corruption in the mining sector. In partnership with two other entities, Global Witness sought to explore the feasibility of using satellite imagery and AI for automatic detection of certain types of mining activity across a large area.
Global Witness’s approach was to build a computer program with a model that used ML to review a large volume of satellite images, focusing on small-scale mining in eastern Democratic Republic of Congo. Global Witness obtained those data from Google Earth Engine, which makes historical data from the Landsat satellite publicly available.
To train its model, Global Witness also built a training dataset of information about artisanal mines in the DRC, using data that it obtained from an independent research institute, the International Peace Information Service (IPIS). Global Witness data scientists built an application that used the IPIS data to create images of “shapes” (i.e., pixels marked in an image), which they fed into the ML algorithm to train the model. They then evaluated the model’s “by seeing how many of the known (but “unseen”) mine sites from the IPIS dataset it was capable of accurately identifying whilst only being trained on a partial portion of the data.”
Global Witness found that its best model could identify 79.7 percent of the known DRC mines. It recognized, however, that the model had a high false positive rate (i.e., “it often identified pixels as mines where no mining activity was present”). The model was able to identify correctly only 48.4 percent ff all the pixels that were not part of known mine sites. Global Witness therefore concluded that its model “needs some refinement,” but that “this is a fruitful avenue for further research.”
The International Consortium of Investigative Journalists (ICIJ) has also made productive use of ANI, in its investigation of the so-called “Luanda Leaks.” That investigation delved into how Isabel dos Santos, the daughter of Angola’s former president José Eduardo dos Santos, reportedly took hundreds of millions of dollars in public funds out of Angola. The investigation required a tremendous expenditure of human resources; the ICIJ noted that more than 120 journalists in 20 countries needed more than eight months to review the files.
To assist in its analysis of the investigation’s more than 700,000 leaked documents (equivalent to 356 gigabytes), the ICIJ partnered with an AI team from the business news website Quartz to build an AI system. That system was created, as Quartz put it,
to “read” all the documents and help journalists from Quartz, ICIJ and other partner organizations find the kinds of documents they expected in the cache of leaks—regardless of file format, spelling, transcription errors, phrasing, or even the language of the document.
A key element of the system was a piece of software that “transforms any sentence into a list of 512 numbers, called a vector.” Although the numbers in the vector are not really meaningful on their own, but when those numbers are taken together, “sentences that mean about the same thing have vectors that are close to each other.” Furthermore, the vectors are similar for sentences with similar meaning, even if sentences are in different languages.
That feature was reportedly important for the project, as the documents under review included documents in both English and Portuguese (the latter spoken in Angola). As a result, reporters were able to detail key elements of how dos Santos was able to amass and control her wealth, including the roles of Western accountants and consultants and her acquisition of “massive stakes in Portuguese banks” that facilitated her movement of stolen funds around the world.
* * *
These studies collectively show that ANI, especially ML, can be a genuinely useful and efficient means of conducting meaningful corruption-related research on large volumes of public data. Researchers in other countries should review these studies and work to identify other large datasets, including open-source data, that could be productively used to explore other dimensions of corruption.
Note: This post is the first of several C-BERC posts that will explore the topic of ANI’s utility in anti-corruption research. Future posts will touch on the uses of ANI in corporate anti-corruption compliance programs, and the ethical and legal complexities in using ANI to study corruption.
* * *
Jonathan J. Rusch is Adjunct Professor at Georgetown University Law Center and American University Washington College of Law, and a member of the C-BERC Advisory Council.