What is the data provenance standard?

Data provenance standards establish criteria for documenting the origin and lifecycle of data , which includes details on how data is collected and how it may be used. This transparency is crucial for driving accuracy, reliability and trustworthiness of data, across all industries.

Data trust means having confidence that your organization's data is clean, reliable and up to date . It means you can pull out reliable insights and support well-informed decisions around your organization, market and product.

How do you build data trust?

3 Key Strategies to Build Data Trust Understanding data lineage. Managing data quality. Ensuring consistency of understanding (this being what you are trying to achieve through a business glossary and similar activities). Apr 29, 2024

What is data provenance and why is it important?

Data provenance guarantees supply chain transparency by creating a digital record of each product's origin, processing steps and certifications . This transparency allows verification of product authenticity and quality and compliance with laws and ethical sourcing practices.

What is the difference between data lineage and data provenance?

In behavioral science, data lineage and provenance are critical for ensuring the validity and reproducibility of research findings. Data lineage allows researchers to trace the steps taken in data analysis, while provenance provides the context necessary to understand the data's origins and collection methods .

What is the 3 keys factor to build trust?

People tend to trust you when they believe they are interacting with the real you (authenticity), when they have faith in your judgment and competence (logic), and when they feel that you care about them (empathy) . When trust is lost, it can almost always be traced back to a breakdown in one of these three drivers.

Can your data be trusted?

According to a study by HFS Research, 75 percent of business executives do not have a high-level of trust in their data . This lack of trust comes despite 89 percent of executives surveyed saying a high level of data quality was critical for success.

What are the four keys to digital trust?

Building and maintaining this trust requires a combination of transparency, ethical responsibility, customer control over data, and robust security and reliability measures .

What is the data standard SDTM?

Study Data Tabulation Model (SDTM) After the data is collected into a clinical database, it must be converted into standard data tables to be used for analysis. SDTM defines the way in which individual observations from a clinical study are compiled .

What is the ISO standard for data security?

ISO/IEC 27001 is the international standard for information security management. Part of the ISO 27000 series, ISO 27001 sets out a framework for all organisations to establish, implement, operate, monitor, review, maintain and continually improve an ISMS (information security management system).

What is the ISO standard for data collection?

However, two widely discussed and adopted ISO standards for data governance are ISO 27001 and ISO 8000 . IT system engineers and security consultants widely recognise the ISO 27001 standard for information security management systems (ISMS). In comparison, ISO 8000 focuses on data quality management.

What is the ISO metadata standard?

An internationally-adopted schema for describing geographic information and services . It provides information about the identification, the extent, the quality, the spatial and temporal schema, spatial reference, and distribution of digital geographic data.

The Data & Trust Alliance (2024)

OUR LATEST INITIATIVE

The first cross-industry metadata to bring transparency to the origin of datasets used for both traditional data and AI applications.

Experts from 19 D&TA enterprises have co-created these standards to help organizations determine if data is suitable and trusted for use. The proposed standards are currently being tested.

DEVELOPED BY
PRACTITIONERS FROM 19 ORGANIZATIONS INCLUDING

Quick links

Review the standards and give your input Learn more about the standards: what, why, and how FAQs

Download the Standards info pack Read New
York Times
coverage Read the press release Read the blog post Join our
Community of
practice

1. Review the standards

To learn more about the proposed standards and gain necessary context, please watch this video or download the information pack.

Download the Standards information pack

2. Give your input

Please share your input on the metadata and values by filling out the survey.

WHAT ARE THE
STANDARDS

The eight proposed Data Provenance Standards surface metadata on source, legal rights, privacy & protection, generation date, data type, generation method, intended use and restrictions and lineage. Each metadata field has associated values. To see example values, download the Standards information pack. In addition, the standards call for using a unique provenance metadata ID with each dataset.

Standard

Metadata

Provenance Metadata Unique ID

Blockchain ledger ID

Other:

Username

Organization name

Software/system name

Sensor/IoT device name

Machine name

File name

IP Address

Other:

Attribution rights

Copyrights or trademarks

Required data storage or processing geolocation

Applicable laws and regulations

Protected data classification

Applied privacy enhancing techniques

yyyy-MM-dd'T'HH:mm:ss

Structured

Unstructured

Web scraping/Crawling

Feeds

Syndication

Data mining

Machine generated/MLOps

Sensor and IoT output

Social media

User generated content

Primary user source

Intended use

Restricted audience

This essential information about the origin of and rights associated with data allows enterprises to make informed choices about the data they source and use. The result can be improvements in operational efficiency, regulatory compliance, collaboration and value generation.

Download the Standards info pack Read New
York Times
coverage Read the press release Read the blog post Join our
Community of
Practice

wHY WE CREATED
THE STANDARDS

Data transparency
is critical.

Trust in the insights and decisions coming from both traditional data and AI applications depends on understanding the origin, lineage, and rights associated with the data that feeds them. Lack of transparency has real costs, including unnecessary risks and foregone opportunities. And yet, many organizations today cannot answer basic data questions without considerable difficulty and investment.

To realize the value of data and AI requires a reliable cross-industry baseline of data transparency. Our Data Provenance Standards propose a solution.

Almost

40 %

of the time data scientists spend working is on basic data preparation and cleansing tasks, according to a 2022 Anaconda report.

61 %

of CEOs cite lack of clarity on data lineage and provenance as a top barrier to adoption of generative AI, according to the annual IBM Institute for Business Value CEO study.

“

Companies like ours feel a deep responsibility to ensure new value creation, as well as trust and transparency of data with all of our customers and stakeholders. Data provenance is critical to those efforts.”

— KEN FINNERTY, President, IT & Data Analytics at UPS

hOW WE CREATED
THE
STANDARDS

We took a "for industry, by industry" approach to creating the standards:

Started by identifying provenance pain-points from 25 use cases across our member industries

Iterated the standards through over 100 working sessions with Alliance Members and broader industry representatives

Consolidated from 53 to 8 standards, focusing on business value and feasibility

Co-created by CDOs, CIOs, and leads on data strategy, enterprise data and AI governance, compliance and legal from organizations across healthcare, automotive, IT, media, banking and finance, retail, education and other industries

It's not just about managing data; it's about fostering trust and reliability in AI.

Join our Community of Practice as we shape robust, transparent, and adoptable Data Provenance Standards.

FAQs

Currently, member companies are testing the proposed data provenance standards in various use cases, encompassing areas such as regulatory compliance, supply chain management, procurement processes, and virtual patient healthcare systems. These tests are not only focused on assessing business value and understanding the practical implementation requirements, but also include a crucial component of governance testing. This governance testing is designed to ensure that the standards adhere to relevant policies and legal frameworks, thereby making data management practices transparent, accountable, and aligned with both organizational and regulatory expectations.

We are actively calling for input. (1) You can learn more by watching the video above and/or downloading our info pack.(2) You can take our survey. The survey will be open until January 20, 2024. (3) You can join our open Community of Practice to offer new use cases, share learnings, and refine the work. Our goal is to create a baseline set of data provenance standards that is both valuable and implementable for a wide range of cross-industry traditional data and AI use cases.

No. The data provenance standards are designed for both traditional data and AI use cases. They are designed to bring transparency to the origin of datasets, which are helpful for use cases across the enterprise. Note that the standards currently only apply at the dataset level, not the element, table, or row level.

The data provenance standards address most generative AI use cases, but may not be applicable to all use cases. For example, data provenance standards can be challenging to apply effectively to large language models (LLMs) that are trained on vast amounts of public data sourced from diverse locations on the Internet.

Our goal is to refine the standards—based on both testing and practitioner input—and release a version 1 in early Q2 2024. Timing is subject to change. The standards will be available in a variety of formats, ranging from a PDF checklist to JSON and XML, which allows for automation. The standards and supporting tools will be available without charge.

While we can provide guidance on how enterprises and data providers can effectively implement the standards to ensure consistency, interoperability, and data discoverability, the standards and metadata have been designed with ample flexibility to address diverse needs. A values library will also be available for organizations to use and reference, further enhancing consistency across industries. The library will be flexible to accommodate unique values not previously identified through existing use cases.

We have explored and are considering recommendations around immutable records, such as blockchain and data watermarking. Such mechanisms make it nearly impossible to delete or alter data without the consensus of network participants. This immutability can help protect the integrity of metadata associated with the dataset. However, because blockchain, watermarking, and similar solutions can be technically challenging to adopt and cost-prohibitive, the D&TA is not prescribing a method of immutability at this time.

Yes, the data provenance standards can be integrated with systems designed to support data governance, including Collibra.

The data provenance standards are designed to be cross-industry. They were created by practitioners from healthcare, automotive, IT, media, banking and finance, retail, education and other industries. We encourage practitioners to share new use cases and offer new metadata fields for consideration, and we would welcome any such practitioners to join our Community of Practice.

Yes. Organizations must navigate a complex landscape of laws and regulations to ensure their data practices are compliant. The Legal Rights standard identifies the legal or regulatory framework applicable to the current dataset, along with the required data attributions, associated copyright or trademark, and localization and processing requirements. The Privacy and Protection Standard further supports compliance by identifying if sensitive data is present and if any privacy-enhancing techniques have been applied, thus identifying where further steps must be taken to ensure data security and confidentiality.

We expect the standards to evolve over time, based on the experience of those implementing them. Currently, they are in a testing phase. We will refine the standards—based on both testing and community input —and release a fully accessible version 1 in early Q2 2024. Timing is subject to change.
The standards currently apply at the dataset level. After use of version 1, we imagine the standards will be refined over time and become more applicable at the data cell or row level.

After the testing phase, we plan to share examples or case studies of our learnings and plans for implementation.

Download the Standards info pack Read New
York Times
coverage Read the press release Read the blog post Join our
Community of
practice