OUR LATEST INITIATIVE
The first cross-industry metadata to bring transparency to the origin of datasets used for both traditional data and AI applications.
Experts from 19 D&TA enterprises have co-created these standards to help organizations determine if data is suitable and trusted for use. The proposed standards are currently being tested.
DEVELOPED BY
PRACTITIONERS FROM 19 ORGANIZATIONS INCLUDING
Quick links
Review the standards and give your input Learn more about the standards: what, why, and how FAQs
York Times
coverage Read the press release Read the blog post Join our
Community of
practice
1. Review the standards
To learn more about the proposed standards and gain necessary context, please watch this video or download the information pack.
2. Give your input
Please share your input on the metadata and values by filling out the survey.
WHAT ARE THE
STANDARDS
The eight proposed Data Provenance Standards surface metadata on source, legal rights, privacy & protection, generation date, data type, generation method, intended use and restrictions and lineage. Each metadata field has associated values. To see example values, download the Standards information pack. In addition, the standards call for using a unique provenance metadata ID with each dataset.
Standard
Metadata
Provenance Metadata Unique ID
Blockchain ledger ID
Other:
Username
Organization name
Software/system name
Sensor/IoT device name
Machine name
File name
IP Address
Other:
Attribution rights
Copyrights or trademarks
Required data storage or processing geolocation
Applicable laws and regulations
Protected data classification
Applied privacy enhancing techniques
yyyy-MM-dd'T'HH:mm:ss
Structured
Unstructured
Web scraping/Crawling
Feeds
Syndication
Data mining
Machine generated/MLOps
Sensor and IoT output
Social media
User generated content
Primary user source
Intended use
Restricted audience
This essential information about the origin of and rights associated with data allows enterprises to make informed choices about the data they source and use. The result can be improvements in operational efficiency, regulatory compliance, collaboration and value generation.
York Times
coverage Read the press release Read the blog post Join our
Community of
Practice
“
These practical data standards, co-created by senior practitioners across industry, are designed to help ensure AI workflows are not only compliant with ever-changing government regulations and free of bias, but also generate increased business value.”
— ROB THOMAS, Senior Vice President, IBM Software and Chief Commercial Officer
wHY WE CREATED
THE STANDARDS
Data transparency
is critical.
Trust in the insights and decisions coming from both traditional data and AI applications depends on understanding the origin, lineage, and rights associated with the data that feeds them. Lack of transparency has real costs, including unnecessary risks and foregone opportunities. And yet, many organizations today cannot answer basic data questions without considerable difficulty and investment.
To realize the value of data and AI requires a reliable cross-industry baseline of data transparency. Our Data Provenance Standards propose a solution.
Almost
40 %
of the time data scientists spend working is on basic data preparation and cleansing tasks, according to a 2022 Anaconda report.
61 %
of CEOs cite lack of clarity on data lineage and provenance as a top barrier to adoption of generative AI, according to the annual IBM Institute for Business Value CEO study.
“
Companies like ours feel a deep responsibility to ensure new value creation, as well as trust and transparency of data with all of our customers and stakeholders. Data provenance is critical to those efforts.”
— KEN FINNERTY, President, IT & Data Analytics at UPS
hOW WE CREATED
THE
STANDARDS
We took a "for industry, by industry" approach to creating the standards:
- Started by identifying provenance pain-points from 25 use cases across our member industries
- Iterated the standards through over 100 working sessions with Alliance Members and broader industry representatives
- Consolidated from 53 to 8 standards, focusing on business value and feasibility
- Co-created by CDOs, CIOs, and leads on data strategy, enterprise data and AI governance, compliance and legal from organizations across healthcare, automotive, IT, media, banking and finance, retail, education and other industries
It's not just about managing data; it's about fostering trust and reliability in AI.
Join our Community of Practice as we shape robust, transparent, and adoptable Data Provenance Standards.
FAQs
Currently, member companies are testing the proposed data provenance standards in various use cases, encompassing areas such as regulatory compliance, supply chain management, procurement processes, and virtual patient healthcare systems. These tests are not only focused on assessing business value and understanding the practical implementation requirements, but also include a crucial component of governance testing. This governance testing is designed to ensure that the standards adhere to relevant policies and legal frameworks, thereby making data management practices transparent, accountable, and aligned with both organizational and regulatory expectations.
We are actively calling for input. (1) You can learn more by watching the video above and/or downloading our info pack.(2) You can take our survey. The survey will be open until January 20, 2024. (3) You can join our open Community of Practice to offer new use cases, share learnings, and refine the work. Our goal is to create a baseline set of data provenance standards that is both valuable and implementable for a wide range of cross-industry traditional data and AI use cases.
No. The data provenance standards are designed for both traditional data and AI use cases. They are designed to bring transparency to the origin of datasets, which are helpful for use cases across the enterprise. Note that the standards currently only apply at the dataset level, not the element, table, or row level.
The data provenance standards address most generative AI use cases, but may not be applicable to all use cases. For example, data provenance standards can be challenging to apply effectively to large language models (LLMs) that are trained on vast amounts of public data sourced from diverse locations on the Internet.
Our goal is to refine the standards—based on both testing and practitioner input—and release a version 1 in early Q2 2024. Timing is subject to change. The standards will be available in a variety of formats, ranging from a PDF checklist to JSON and XML, which allows for automation. The standards and supporting tools will be available without charge.
While we can provide guidance on how enterprises and data providers can effectively implement the standards to ensure consistency, interoperability, and data discoverability, the standards and metadata have been designed with ample flexibility to address diverse needs. A values library will also be available for organizations to use and reference, further enhancing consistency across industries. The library will be flexible to accommodate unique values not previously identified through existing use cases.
We have explored and are considering recommendations around immutable records, such as blockchain and data watermarking. Such mechanisms make it nearly impossible to delete or alter data without the consensus of network participants. This immutability can help protect the integrity of metadata associated with the dataset. However, because blockchain, watermarking, and similar solutions can be technically challenging to adopt and cost-prohibitive, the D&TA is not prescribing a method of immutability at this time.
Yes, the data provenance standards can be integrated with systems designed to support data governance, including Collibra.
The data provenance standards are designed to be cross-industry. They were created by practitioners from healthcare, automotive, IT, media, banking and finance, retail, education and other industries. We encourage practitioners to share new use cases and offer new metadata fields for consideration, and we would welcome any such practitioners to join our Community of Practice.
Yes. Organizations must navigate a complex landscape of laws and regulations to ensure their data practices are compliant. The Legal Rights standard identifies the legal or regulatory framework applicable to the current dataset, along with the required data attributions, associated copyright or trademark, and localization and processing requirements. The Privacy and Protection Standard further supports compliance by identifying if sensitive data is present and if any privacy-enhancing techniques have been applied, thus identifying where further steps must be taken to ensure data security and confidentiality.
We expect the standards to evolve over time, based on the experience of those implementing them. Currently, they are in a testing phase. We will refine the standards—based on both testing and community input —and release a fully accessible version 1 in early Q2 2024. Timing is subject to change.
The standards currently apply at the dataset level. After use of version 1, we imagine the standards will be refined over time and become more applicable at the data cell or row level.
After the testing phase, we plan to share examples or case studies of our learnings and plans for implementation.
York Times
coverage Read the press release Read the blog post Join our
Community of
practice
© 2023, The Center for Global Alliance.
All rights reserved.