India's Tryst with AI: The Data Conundrum
India and Artificial Intelligence
Artificial Intelligence (AI) is indisputably the next big thing. For a change, even the Indian government has taken notice of it well in advance. The government is serious when it comes to AI and has allocated Rs. 3,660 crores for the implementation of Artificial Intelligence and other emerging technologies under the aegis of the National Mission on Interdisciplinary Cyber-Physical Systems (NM-ICPS). In order to develop AI, four things are required. First is the skilled workforce, second is data, third is computational power, and fourth is industry know-how. We already have a large enough skilled workforce with both private and public sector institutions working on growing this number. Cloud computing has taken care of the computational power aspect and has also truly democratised access on this front. Industry know-how also, to a large extent, has been out there in the public domain with almost all major cutting-edge work in AI, such as Keras, available in the public domain. Even industrial support frameworks have been made open source.
The missing puzzle piece that remains is the training data. AI needs a tremendous amount of data to come up with useful applications. Large organizations such as Google, Facebook and others who sit over a constant, endless stream of user data have a clear advantage on this front. AI models’ accuracy is directly proportional to the depth and breadth of the training data that is available. The more the data is fed to the model, the better are the chances of it accurately performing tasks and predicting outcomes. Thus, making machine-readable data available to advance in the field of artificial intelligence becomes the most important challenge that lies ahead of us. However, this data cannot be used as it is. To bring, an AI revolution, the data provided must be wholesome and accurate i.e., the form in which it is provided and the information contained therein for achieving the desired outcome. There can be no scope for ambiguity as inaccuracies in the data will result in undesirable results.
In the Indian scenario, the government already owns a large amount of data collected through its various programmes and surveys. However, most of this data is not in the public domain. Government’s dataset is crucial for the development of AI as it is large enough to train a complicated and reliable AI model as well as to ensure that there is no scope for any bias. The limited set of data which is available in the public domain is of no use for AI development as it is not machine readable i.e. which can be read easily by the machines and is available in various file formats such as CSV, ODS, XLS, XML etc.
Legal Framework Governing Governmental Data Disclosure in India
Data in India can be obtained through two legal mechanisms. First is the landmark Right to Information Act, 2005 (RTI), which makes the government accountable and provides citizens with access to governmental information. However, RTI also has the same limitation of not being able to provide machine readable data and has thereby limited utility. The other mechanism is the Open Government Data (OGD) platform which is an initiative by the Indian government along with the United States (US) government under the National Data Sharing and Accessibility Policy, 2012 (NDSAP) to share the non-sensitive data in possession of the various government agencies and departments in a human and machine readable format with the public to ensure accountability and transparency. The NDSAP is in consonance with the Right to Information Act, 2005 (RTI) which calls for cataloguing and indexing of all the data available with various public authorities and making it available to the public at large through various modes to reduce dependency on the RTI.
The ODG platform is a right step in the direction of providing open access to data in a machine-readable format. The policy calls for periodic sharing and updating of data by Ministries and other government departments. The drawback is that it lacks an enforcement mechanism to make government agencies comply with the requirement of sharing of data. The data which is being made available on the ODG portal is not up-to-date, thereby affecting the growth of technologies such as AI and policy of transparency. OGD platform is a result of the collaboration between Indian and the US government for transparent governance by openly providing access to information. The distinction between US and India concerning this program lies in the fact that while the US has enacted a comprehensive “Open Government Data Act 2019” (OGD Act) India, has put in place a mere policy which does not provide any statutory right. This has led to the problem of minimal or no implementation.
Analysis of Implementation of Existing Legal Framework
We have analysed a few datasets available on the Indian OGD platform. The first dataset that we have analysed is that of the Ministry of Law and Justice. This data is available in four catalogues which are All India Advocate List (Advocate list) with twenty-one datasets, Employee Statistics of Department of Legal Affairs with three datasets (Employee statistics), References received from various Ministries/Departments seeking legal advice from Department of Legal Affairs (References) with one dataset, and State-wise List of Notaries Appointed by the Central Government with twenty-four datasets (Notaries list).
These datasets have not been updated since 2017 and have limited data. If we look at Advocates List, it has only one dataset which is from the year 2012 and the rest of them are from 2016. Moreover, these do not even cover all the states and union territories. Similarly, the Employee Statistics has the only dataset about the representation of female employees in Department of Legal Affairs that too from January 2013 to 2014, the other datasets are of same titles i.e. the Total number of government servants and the number of scheduled caste, scheduled tribe, other backward classes, ex-servicemen and physically disabled employees in Department of Legal Affairs but the years are different; one is from January 2008 to 2015 and another is from January 2010 to 2014. The Notaries Lists are from various states and union territories and are of the year 2016 but do not cover all of them. And the References have the data dating from 1st April 2015 to 10th March 2016 only.
The second dataset is from the Ministry of Corporate Affairs (MCA). The data is available under twenty-seven catalogues and has not been updated since 2018. Out of these seventeen have only one dataset, three have two datasets, six have three datasets, and one has one hundred seven datasets. On the other hand, the MCA website has up-to-date data, but then it does not provide macro datasets. Further, a repository of data exists which is not there on the platform but has been made available behind a paywall on the official MCA portal. Companies House, the official United Kingdom regulator provides the same data in bulk and machine-readable format free of cost.
The data provided on the OGD portal is limited, moreover, there is no historical data available. Why would people refer to such a platform if they could find data on other government portals? The non-availability of data not only defeats the purpose behind the initiative but also impacts the AI revolution that India wants to start and has already invested in. There are eight principles which govern open data and conformity to these makes the data truly open. Conforming to these principles can be considered as an acceptable benchmark for the holistic development of AI in India.
The question that arises next is whether India’s ODG platform adheres to these principles? The following is the analysis of these principles and whether India is compliant:
1. Complete: Whether all of the data is publicly available.
Only limited data is available and that too not downloadable in every format which is available on the portal.
2. Primary: Whether the collected data has been modified.
In many cases, the analysis of data is provided, but primary information is unavailable.
3. Timely: Whether the data is made available in a timely manner.
The data updating process is very slow.
4. Accessible: Whether data is available to all users.
5. Machine Processable: Whether the data can be read by machines.
6. Non-Discriminatory: Whether there is no requirement of registration.
Optional registration for API access is available.
7. Non-Proprietary: Whether a particular entity has control over the data.
8. License-Free: Whether the data is subject to intellectual property rights.
India’s OGD platform is compliant with many of the open data principles. Still, compliance is irrelevant until crucial issues such as unavailability of data and delay in updating the data are resolved. Until it can overcome these two criteria, it would not be able to achieve its goal of transparent governance and progress in the field of AI. India needs to move from a policy-based approach to legislation one like the US has done by enacting the OGD Act. Despite appointing Chief Data Officer to every department and ministry as per the guidelines, the NDSAP has failed to achieve its objective. The need of the hour is to focus on the implementation.
 J. E. Olson, Data quality: the accuracy dimension, 29 (1st ed., 2003).
 Preamble, National Data Sharing and Accessibility Policy (NDSAP), 2012, Ministry of Science and Technology, Government of India.
 Ibid (Objective, NDSAP)
 S. 4(1) (a), Right to Information Act, 2005.
 S. 4(2), Right to Information Act, 2005.
Authored by Mohit Yadav, Co-founder of Altinfo & Vishal Gahlayan, Advocate Practicing in the Courts of Haryana and Delhi. This blog is part of the RSRR Blog Series on Artificial Intelligence, in collaboration with Mishi Choudhary & Associates.