AI Transparency Data Disclosures: Preparing For California AB 2013

Beginning January 1, 2026, companies that design, code, produce, or significantly update certain generative AI systems must publicly disclose on their websites information concerning the datasets used to train those generative AI systems pursuant to Cal. AB 2013, or the California Generative Artificial Intelligence Training Data Transparency Act.

Cal. AB 2013 Questionnaire

Part A. Is Our Company Subject to the Requirements of Cal. AB 2013?

Have we designed, coded or produced a generative AI system¹ that we (or a third party) have made publicly available?
Have we retrained, fine-tuned or made any other significant update to a generative AI system that changes its functionality or performance that we (or a third party) have made publicly available?

If the answer to both questions in Part A is no, your company is not subject to the requirements of Cal. AB 2013.

If the answer to either question in Part A is yes, please move to Part B to determine whether an exception is applicable.

Part B. Is Our Generative AI System Subject to Any Exceptions under Cal. AB 2013?

Are any of the following statements true about this generative AI system?
1. The system's sole purpose is to help ensure security and integrity;²
2. The system's sole purpose is the operation of aircraft in the national airspace;
3. The system was developed for national security, military, or defense purposes and is made available only to a U.S. federal government entity.

If the answer to any of the subparts in Part B is yes, the generative AI system is not within the scope of Cal. AB 2013.

If the answer to each of the subparts in Part B is no, your company is subject to the requirements of Cal. AB 2013 and you should move to Part C to determine the information required for your company's website disclosure.

Part C. What Information Should We Disclose Regarding Our Generative AI System?

For each dataset (i.e., each single, pre-packaged collection of data) used to test, validate or fine-tune the generative AI system subject to Cal. AB 2013,³ please:
1. Identify the source or owner of the dataset and indicate whether the dataset was purchased or licensed;
2. Provide the time period when the data in the dataset were collected, indicate whether collection is ongoing, and identify when the dataset was first used in the development of the generative AI system;
3. Provide the time period when the data in the dataset were collected, indicate whether collection is ongoing, and identify when the dataset was first used in the development of the generative AI system;
4. Confirm whether the dataset contains:
  1. Data protected by copyright, trademark or patent law (or, if applicable, indicate whether the data is entirely in the public domain, and therefore not subject to copyright, trademark or patent law);
  2. Personal information;⁴
  3. Aggregate consumer information;⁵
  4. Synthetic data. ⁶
5. Confirm whether we cleaned, processed or otherwise modified the dataset:
  1. If so, please describe the intended purpose of the cleaning, processing or modification of the dataset.
6. Describe:
  1. If the dataset includes labels, the types of labels used;
  2. If the dataset does not include labels, the general characteristics of the data;⁷
  3. With respect to both (i) and (ii), as applicable, how the dataset will contribute to the purpose of the generative AI system.⁸

Cal. AB 2013 Template Disclosure⁹

System Information¹⁰

System Name:
Developer: [Company Name]
Substantial Modification: ¹¹ [Yes] or [No]
- [Model Dependencies and Developer, as applicable: [indicate underlying model/version and developer]]
Release Date: [MM/DD/YYYY]
Version: [e.g., v1.0, v2.5]

High-Level Summary of Datasets

[Dataset 1] [Group of Datasets 1]¹²

Key Information	[Template Disclosure]
Source or Owner	[Name of source of the Dataset] or [Name of owner of the Dataset]
Purchased or Licensed	[This Dataset was [purchased or licensed] or [N/A]]
Time Period of Collection of Data in Dataset	[Month, YYYY – Month, YYYY] or [Data in this Dataset was originally collected beginning in [YYYY]; data collection for this Dataset is currently ongoing]¹³
Date of First Use in Development of System	[Month, YYYY]
Number of Data Points	[Approximately [value and unit of measurement of data points in the Dataset - e.g., 5,000,000 tokens, 10,000 images, 100 recordings]]¹⁴
Contains Data Protected by Copyright, Trademark, or Patent Law	[Yes] or [No]¹⁵
Dataset is Entirely in the Public Domain	[Yes] or [No]
Contains Personal Information	[Yes] or [No]¹⁶
Contains Aggregate Consumer Information	[Yes] or [No]
Uses Synthetic Data	[Yes] or [Yes, the purpose of the [use or continuous use] of the synthetic data in this Dataset is to [explain the functional need of the use of the synthetic data in relation to the purpose of the System - e.g., train safety filters of the System]] ¹⁷ or [No]
Dataset Has Been Cleaned, Processed or Otherwise Modified	[Yes, this Dataset was [high-level description of cleaning, processing, and/or other modification activity - e.g., filtered for profanity] in order to [high-level description of purpose of modification activity in relation to System - e.g., maintain platform standards]]
Description of Types of Data Points	[List the types of labels used in the Dataset or, if labels are not used in the Dataset, include a general description of the format of the Dataset and sample values- e.g., images, including images of trees and lakes]
Purpose in Relation to System	[This Dataset furthers the intended purpose of the System by [general description of how the Dataset helps the System achieve its purpose - e.g., improving the System's reasoning]]

Footnotes

1 "Generative artificial intelligence" refers to AI that can generate synthetic content, such as text, images, video, and audio.

2 For purposes of this questionnaire, "security and integrity" means the ability to detect security incidents, resist malicious, deceptive, fraudulent, or illegal actions and to help prosecute those responsible for such actions, and to ensure physical safety.

3 For pre-existing AI models that we fine-tuned, distilled or otherwise modified, these questions should be answered with respect to the training content we used and our own modifications of the generative AI system, not the training content or process used by the underlying AI model provider.

4 "Personal information" means information that identifies or is reasonably capable of being associated or linked with a particular consumer or household.

5 "Aggregate consumer information" means information that relates to a group or category of individuals, from which individual identities have been removed such that the information is not reasonably linkable to any individual or household, including via a device.

6 "Synthetic data" refers to data generated when seed data are used to create artificial data that have some of the statistical characteristics of the seed data.

7 General characteristics may include the format (e.g., image, audio, video, text) and sample values of the underlying data points. Sample values will depend on the format of the data: (i) for image, examples may include photography, visual art works, infographics, social media images, logos, or brands; (ii) for audio, examples may include musical compositions and recordings, audiobooks, radio shows and podcasts, or private audio communications; (iii) for video, examples may include music videos, films, TV programs, performances, video games, video clips, journalistic videos, or social media videos; (iv) for text, examples may include fiction and non-fiction text, scientific text, press publications, legal and official documents, social media comments, or source code.

8 This purpose explanation can be relatively high-level (e.g., "because the dataset is comprised of images of trees, it will help our AI system, which is intended to identify objects in nature, achieve its intended purpose", or "because the dataset is comprised of guitar sounds, it will help our AI system, which is intended to create music based on specific genres").

9 Cal. AB 2013 requires companies that design, code, produce, or significantly update certain generative AI systems or services (the "Systems") to publicly disclose on their websites information concerning the datasets used to train those generative AI systems (the "Datasets"). Systems that are exempted from this disclosure requirement are Systems (i) with the sole purpose of helping ensure security and integrity; (ii) with the sole purpose of operating aircraft in the national airspace; or (iii) that were developed for national security, military, or defense purposes and are made available only to a U.S. federal government entity.

10 For pre-existing Systems (i.e., third-party Systems) that the disclosing company has made a "substantial modification" (defined below) to, these questions should be answered with respect to the training content the disclosing company used and such company's modifications of the pre-existing System, not the Datasets used by the developer of the pre-existing System to train the pre-existing System.

11 "Substantial Modification" refers to a new version, new release, or other update to a System that materially changes its functionality or performance, including the results of retraining or fine tuning.

12 The template disclosures do not necessarily need to be provided for each specific Dataset used to train the relevant System (although, if the System was trained on a limited number of Datasets, this may be easiest). For Systems that were trained on a high number of Datasets, grouping disclosures by common characteristics of the Datasets may be the preferred approach (e.g., Datasets consisting of English speech recordings and Datasets consisting of Spanish speech recordings could be grouped together in a set of disclosures regarding speech recordings).

13 Use the second disclosure option if the collection of data in the Dataset is continuous/ongoing.

14 For a dynamic Dataset, a general range of data points is acceptable (e.g., less than 10,000 data points, between 10,000-1,000,000 data points, over 1,000,000 data points).

15 If the Dataset contains data protected by copyright, trademark, or patent law, the disclosing company may wish to provide additional context regarding such data, including how the relevant intellectual property rights have been secured by the disclosing company. The determination as to whether additional context or clarification to this disclosure is appropriate will depend on the facts and circumstances relating to the Dataset, the System, and the disclosing company.

16 If the Dataset contains personal information, the disclosing company may wish to provide additional context regarding the type of personal information contained in the Dataset, including whether such personal information is publicly available. The determination as to whether additional context or clarification to this disclosure is appropriate will depend on the facts and circumstances relating to the Dataset, the System, and the disclosing company.

17 The statute provides that developers may (but are not required to) include a description of the functional need or desired purpose of the synthetic data in relation to the intended purpose of the system or service.

The content of this article is intended to provide a general guide to the subject matter. Specialist advice should be sought about your specific circumstances.

AI Transparency Data Disclosures: Preparing For California AB 2013

Contributor

Cal. AB 2013 Questionnaire

Cal. AB 2013 Template Disclosure⁹

High-Level Summary of Datasets

[Dataset 1] [Group of Datasets 1]¹²

Technology

Contributor

United States

AI Transparency Data Disclosures: Preparing For California AB 2013

Contributor

Cal. AB 2013 Questionnaire

Cal. AB 2013 Template Disclosure9

High-Level Summary of Datasets

[Dataset 1] [Group of Datasets 1]12

Technology

Contributor

United States

Cal. AB 2013 Template Disclosure⁹

[Dataset 1] [Group of Datasets 1]¹²