StorageConnector
StorageConnector is a connector that connects to a storage system (e.g. AWS S3, Google Cloud Storage, Azure Storage), extracts metadata from files (e.g. CSV, Parquet), and transforms and uploads the metadata to a destination application (dataspot) using the upload API.
Find StorageConnector configuration examples here.
Functionality
StorageConnector follows the general connector architecture and workflow.
The metadata is extracted from the storage system and transformed to assets in the destination application (dataspot).
| Source | Asset |
|---|---|
| Directory | Collection |
| File | UmlClass, UmlAttribute, UmlDatatype |
StorageConnector extracts partitioned metadata from the subdirectory structure by identifying the partition columns and merging the partitioned files to UmlClass assets.
Partitioning refers to the technique of organizing data into directories and subdirectories based on the values of one or more columns, rather than storing all data in a single large file or folder.
The partition columns are not stored in the metadata files themselves.
Rather, their values are encoded in the subdirectory structure.
Each directory level represents a column, typically using a key=value convention to specify the column value.
Example: Parquet dataset of sales data partitioned by year and month
s3a://my-s3-bucket/sales_data/
├── year=2023/
│ ├── month=01/
│ │ ├── sales-0001.parquet
│ │ └── sales-0002.parquet
│ ├── month=02/
│ │ └── sales-0001.parquet
The transformed metadata is uploaded to the destination application by calling the upload API. The reconciliation options of the upload API specify how uploaded metadata is reconciled with existing metadata. The workflow options of the upload API specify the workflow statuses of inserted, updated, or deleted metadata.
Filters
The filtering mechanism for directories and files follows the same core principles. Filters specify matching criteria, to determine whether the filter applies to a specific metadata object, as well as options for transforming the metadata to assets in the destination application.
Filters are nested according to the metadata hierarchy:
- Directory filters contain file filters.
- File filters contain partition filters.
Nested filters only apply to objects within the scope of their parent filter. When a parent filter matches and is applied, the nested filters are used to extract and transform the subordinate metadata objects.
If a filter list is null or empty, no metadata objects at that level are extracted.
Filters are evaluated in their declaration order - from top to bottom. For each metadata object, the first filter that matches is applied - the remaining filters at that level are ignored.
Due to the single-pass resolution, where only the first matching filter is applied, filter lists should be structured from most specific to most general. This approach ensures predictable extraction rules allowing to include or exclude precise subsets of the metadata and to customize how each slice of metadata is transformed to assets in the destination application.
Configuration
A StorageConnector service is configured by defining its unique name, the service type StorageConnector, and the configuration.
Example: StorageConnector
services:
MyService:
type: StorageConnector
While YAML itself doesn't enforce any naming style for property names, multi-word properties (for example, secret key) are typically specified in lowercase separated by hyphens (for example, secret-key).
This naming style - commonly referred to as kebab-case - is used in the following descriptions and examples.
However, all multi-word properties can also be specified in camelCase (for example, secretKey).
In additional to the general connector configuration to specify the destination application, StorageConnector has the following configuration to specify the source as well as the ingestion filters.
Properties marked with * are required for StorageConnector to run.
Source
StorageConnector connects to a storage system using the specified connection URL and authentication settings.
🔑 Property source.url *
The connection URL of the storage system.
requiredStorageConnector connects to the storage system specified by the connection URL.
Example: Property source.url
services:
MyService:
type: StorageConnector
source:
url: s3a://my-s3-bucket
🔑 Property source.properties
The additional properties of the storage system connection, specified as a map of key/value pairs.
The map key is the property name. The map value is the property value.
The default is null (no additional properties).
If additional properties are defined, StorageConnector sets the corresponding properties of the storage system connection.
Additional properties can be used to modify the storage system connection, by setting properties that are either not supported by the connection URL or contain sensitive data (e.g. an access token) that shouldn't be included in the connection URL.
Example: Property source.properties
services:
MyService:
type: StorageConnector
source:
properties:
fs.s3a.endpoint.region: eu-central-1
Authentication
StorageConnector can specify the authentication settings of the storage system.
🔑 Property source.authentication
The authentication settings of the storage system.
optionalThe default is null (no authentication).
If an authentication is defined, StorageConnector connects to the storage system with the specified authentication.
Otherwise, StorageConnector connects without authentication or falls back to the default authentication provider of the storage system.
🔑 Property source.authentication.method
The authentication method.
requiredThe property is required if source.authentication is specified.
Depending on the storage system, StorageConnector supports the following authentication methods:
| URL scheme | Storage system | Authentication method | method |
|---|---|---|---|
s3a | AWS S3 | Profile | profile |
s3a | AWS S3 | Secret key | secret-key |
gs | Google Cloud Storage | Keyfile | keyfile |
abfss | Azure Storage | OAuth 2.0 | oauth |
Example: Property source.authentication.method
services:
MyService:
type: StorageConnector
source:
authentication:
method: keyfile
Profile
StorageConnector can use a profile for connecting to the storage system.
🔑 Property source.authentication.profile
The profile name.
The named profiles are stored in profile files (e.g. ~/.aws/credentials).
The property can only be specified and is required if source.authentication.method is profile.
StorageConnector loads the specified profile from the profile file (e.g. ~/.aws/credentials) and uses it for authentication.
Example: Property source.authentication.profile
services:
MyService:
type: StorageConnector
source:
authentication:
method: profile
profile: ${aws.s3.profile}
Secret key
StorageConnector can use a secret key (static credentials) for connecting to the storage system.
🔑 Property source.authentication.access-key
The access key.
requiredThe property can only be specified and is required if source.authentication.method is secret-key.
StorageConnector uses the specified access key and secret key for authentication.
The secret key is specified in source.authentication.secret-key.
Example: Property source.authentication.access-key
services:
MyService:
type: StorageConnector
source:
authentication:
method: secret-key
access-key: ${aws.s3.access-key}
secret-key: ${aws.s3.secret-key}
🔑 Property source.authentication.secret-key
The secret key.
requiredThe property can only be specified and is required if source.authentication.method is secret-key.
StorageConnector uses the specified access key and secret key for authentication.
The access key is specified in source.authentication.access-key.
Example: Property source.authentication.secret-key
services:
MyService:
type: StorageConnector
source:
authentication:
method: secret-key
access-key: ${aws.s3.access-key}
secret-key: ${aws.s3.secret-key}
Keyfile
StorageConnector can use a JSON keyfile for connecting to the storage system.
🔑 Property source.authentication.keyfile
The absolute or relative path to the JSON keyfile.
requiredThe property can only be specified and is required if source.authentication.method is keyfile.
StorageConnector loads the specified JSON keyfile and uses it for authentication.
Example: Property source.authentication.keyfile
services:
MyService:
type: StorageConnector
source:
authentication:
method: keyfile
keyfile: /home/connector/gs-sa-key.json
OAuth 2.0
StorageConnector can use OAuth 2.0 authentication for connecting to the storage system.
The application supports non-interactive (machine to machine) grants to obtain an access token as a client application or to obtain an ID token as an end-user.
🔑 Property source.authentication.provider-url
The URL of the identity provider, that supports OAuth 2.0.
requiredThe property can only be specified and is required if source.authentication.method is oauth.
StorageConnector uses the provider URL and the client ID to contact the identity provider.
The client ID is specified in source.authentication.client-id.
Example: Property source.authentication.provider-url
services:
MyService:
type: StorageConnector
source:
authentication:
method: oauth
provider-url: https://login.microsoftonline.com/b0ebd953-fb5f-425e-98fc-ec46bf8ce2f1
🔑 Property source.authentication.client-id
The OAuth 2.0 client ID.
requiredThe property can only be specified and is required if source.authentication.method is oauth.
StorageConnector uses the provider URL and the client ID to contact the identity provider.
The provider URL is specified in source.authentication.provider-url.
Example: Property source.authentication.client-id
services:
MyService:
type: StorageConnector
source:
authentication:
method: oauth
provider-url: https://login.microsoftonline.com/b0ebd953-fb5f-425e-98fc-ec46bf8ce2f1
client-id: 6731de76-14a6-49ae-97bc-6eba6914391e
🔑 Property source.authentication.credentials.type
The credentials type.
requiredThe property can only be specified and is required if source.authentication.method is oauth.
Depending on the storage system, StorageConnector supports the following credentials types to obtain an access or ID token:
| URL scheme | Credentials type | type |
|---|---|---|
abfss | Client credentials with client secret | client-secret |
abfss | Resource owner password credentials | password |
Example: Property source.authentication.credentials.type
services:
MyService:
type: StorageConnector
source:
authentication:
method: oauth
provider-url: https://login.microsoftonline.com/b0ebd953-fb5f-425e-98fc-ec46bf8ce2f1
client-id: 6731de76-14a6-49ae-97bc-6eba6914391e
credentials:
type: password
Client credentials with client secret
The OAuth 2.0 client credentials grant with a client secret is a non-interactive (machine to machine) authentication.
StorageConnector authenticates as a client application, rather than as an end-user, to obtain an access token.
🔑 Property source.authentication.credentials.client-secret
The client secret.
requiredThe property can only be specified and is required if source.authentication.credentials.type is client-secret.
StorageConnector uses the specified client secret to obtain an access token from the identity provider.
Example: Property source.authentication.credentials.client-secret
services:
MyService:
type: StorageConnector
source:
authentication:
method: oauth
provider-url: https://login.microsoftonline.com/b0ebd953-fb5f-425e-98fc-ec46bf8ce2f1
client-id: 6731de76-14a6-49ae-97bc-6eba6914391e
credentials:
type: client-secret
client-secret: ${aws.s3.client-secret}
Resource owner password credentials
The OAuth 2.0 resource owner password credentials (ROPC) grant with a username and password is a non-interactive (machine to machine) authentication.
StorageConnector authenticates as an end-user and uses the OpenID Connect (OIDC) authentication layer to obtain an ID token.
If the identity provider requires multi-factor authentication (MFA) (and therefore user interaction), using resource owner password credentials (ROPC) is not a suitable machine to machine authentication method. Alternatively, a non-interactive authentication must be used (e.g. client credentials with client secret).
🔑 Property source.authentication.credentials.username
The username.
Typically, the username is an e-mail address.
The property can only be specified and is required if source.authentication.credentials.type is password.
StorageConnector uses the specified username and password to obtain an ID token from the identity provider.
The password is specified in source.authentication.credentials.password.
Example: Property source.authentication.credentials.username
services:
MyService:
type: StorageConnector
source:
authentication:
method: oauth
provider-url: https://login.microsoftonline.com/b0ebd953-fb5f-425e-98fc-ec46bf8ce2f1
client-id: 6731de76-14a6-49ae-97bc-6eba6914391e
credentials:
type: password
username: ${aws.s3.username}
password: ${aws.s3.password}
🔑 Property source.authentication.credentials.password
The password.
requiredThe property can only be specified and is required if source.authentication.credentials.type is password.
StorageConnector uses the specified username and password to obtain an ID token from the identity provider.
The username is specified in source.authentication.credentials.username.
Example: Property source.authentication.credentials.password
services:
MyService:
type: StorageConnector
source:
authentication:
method: oauth
provider-url: https://login.microsoftonline.com/b0ebd953-fb5f-425e-98fc-ec46bf8ce2f1
client-id: 6731de76-14a6-49ae-97bc-6eba6914391e
credentials:
type: password
username: ${aws.s3.username}
password: ${aws.s3.password}
Options
🔑 Property ingestion.options.read
The mode that determines which files to read, based on when they were created or last modified.
optionalThe default is all.
StorageConnector supports the following read modes:
read | Description |
|---|---|
all | Read all files, regardless of when they were modified. |
latest | Read the latest modified file only. |
since-last-run | Read the files modified since the last run. |
since-timestamp | Read the files modified since the specified timestamp. |
Example: Property ingestion.options.read
services:
MyService:
type: StorageConnector
ingestion:
options:
read: all
🔑 Property ingestion.options.timestamp
The timestamp that limits which files to read.
optionalThe default is null (no limit).
If the read mode is since-timestamp, StorageConnector reads the files modified since the specified timestamp.
If the timestamp is null, StorageConnector reads all files, regardless of when they were modified.
The timestamp can be specified in the following formats:
| Format | Description | Example |
|---|---|---|
| Epoch milliseconds | Number of milliseconds since the Unix epoch (1970-01-01T00:00:00.000). | 1752512124375 |
| ISO 8601 | ISO 8601 representation, with millisecond precision, without a time zone. | 2025-07-14T16:55:24.375 |
Example: Property ingestion.options.timestamp
services:
MyService:
type: StorageConnector
ingestion:
options:
read: since-timestamp
timestamp: 2025-07-14T16:55:24.375
Directories
StorageConnector extracts directories from the storage system and transforms them to Collection assets.
Directory filters define which directories to extract and specify their transformation options.
🔑 Property ingestion.directories
The ordered list of directory filters that specify which directories to extract and their transformation options.
optionalThe default is null (don't extract any directories).
For each directory in the storage system, StorageConnector evaluates the directory filters in their declaration order - from top to bottom.
The first directory filter that matches is applied - the remaining directory filters are ignored.
A directory matches a directory filter, if the directory path starts with one of the directory filter's paths.
In this case, the directory is extracted and transformed to Collection assets using the transformation options of the directory filter.
The files of the directory are extracted using the file filters nested in the applied directory filter.
Notice that a directory matches a directory filter, if the directory path starts with one of the directory filter's path patterns. Since all directories in the subdirectory tree also start with that path pattern, these recursive subdirectories - and consequently their files - are also considered for extraction.
If the list of directory filters is null or empty, no directories are extracted.
Example: Property ingestion.directories
services:
MyService:
type: StorageConnector
ingestion:
directories:
# extract directories 'sales/annual' and 'finance'
- paths:
- sales/annual
- finance
files:
# extract partitioned parquet files
- type: parquet
partition:
delimiter: =
# extract directory 'product'
- paths:
- product
files:
# extract csv files
- type: csv
If the list of directory filters is a single-value list, it can be formatted as a single value, rather than as a list with a single value.
🔑 Property ingestion.directories[].paths
The list of path patterns applied to filter directories based on their directory path.
A path pattern may contain the wildcard character *, matching any directory name in the path.
The property is required for each directory filter.
StorageConnector evaluates the path patterns in their declaration order - from top to bottom - to determine whether to extract the directory.
A directory matches a path pattern, if the directory path starts with the path pattern.
If the file type is partitioned, the first path pattern that matches the extracted directory is also the base directory for extracting partitioned metadata.
The subdirectories located in the base directory are considered to be object directories containing partition columns and files.
Each object directory is transformed to a UmlClass and the partition columns and files extracted from the object's subdirectory tree are merged into this single UmlClass asset.
Example: Property ingestion.directories[].paths
services:
MyService:
type: StorageConnector
ingestion:
directories:
# extract directory 'sales/annual' and directories matching 'finance/*'
- paths:
- sales/annual
- finance/*
🔑 Property ingestion.directories[].deployment
The deployment of the transformed assets.
optionalThe default is null (no deployment).
For each UmlClass asset transformed from the extracted files in the directory, StorageConnector creates a Deployment link and sets the specified deployment system, favorite flag, and qualifier.
Example: Property ingestion.directories[].deployment
services:
MyService:
type: StorageConnector
ingestion:
directories:
# extract directory 'sales'
- paths:
- sales
# transform the assets with deployments to '/Systems/Finance/ARS'
deployment:
deployed-in: /Systems/Sales/ARS
favorite: true
qualifier: SPOT
files:
# extract csv files
- type: csv
🔑 Property ingestion.directories[].deployment.deployed-in
The system in which the asset is deployed for execution or storage purposes.
requiredThe property is required if ingestion.directories[].deployment is specified.
🔑 Property ingestion.directories[].deployment.favorite
The flag that specifies if the deployment should be marked as favorite.
optionalThe default is false (not favorite).
🔑 Property ingestion.directories[].deployment.qualifier
The additional qualifier characterizing the deployment.
optionalThe default is null (no qualifier).
Files
StorageConnector extracts files from the storage system and transforms them to UmlClass, UmlAttribute, and UmlDatatype assets.
File filters are nested in directory filters and define which files to extract and specify their transformation options.
🔑 Property ingestion.directories[].files
The ordered list of file filters that specify which files to extract and their transformation options.
requiredThe property is required for each directory filter.
For each file in the extracted directory, StorageConnector evaluates the file filters in their declaration order - from top to bottom.
The first file filter that matches is applied - the remaining file filters are ignored.
A file matches a file filter, if the file extension matches the file filter's extensions.
In this case, the file is extracted and transformed to UmlClass, UmlAttribute, and UmlDatatype assets using the transformation options of the file filter.
Partitioned metadata is extracted using the partition filter nested in the applied file filter.
If the file type is partitioned, each object directory is determined by the matching path pattern and transformed to a UmlClass.
The extracted files in the object's subdirectory tree are merged into this single UmlClass asset - rather than creating a separate UmlClass asset for each file - and the partition columns are extracted from the directory structure.
If the list of file filters is null or empty, no files are extracted.
Example: Property ingestion.directories[].files
services:
MyService:
type: StorageConnector
ingestion:
directories:
- paths:
- sales
- finance
files:
# extract unstructured files
- type: unstructured
If the list of file filters is a single-value list, it can be formatted as a single value, rather than as a list with a single value.
🔑 Property ingestion.directories[].files[].type
The file type to extract.
requiredThe property is required for each file filter.
If a file matches the file filter, StorageConnector extracts the metadata from the file as specified by the file type.
The following file types are supported:
type | Description |
|---|---|
parquet | Extract metadata from the metadata section in Parquet files. The file is transformed to UmlClass, UmlAttribute, and UmlDatatype assets. |
csv | Extract metadata from the header record in CSV files. The file is transformed to UmlClass and UmlAttribute assets. |
unstructured | Extract metadata from unstructured files. The file is transformed to a UmlClass asset. |
The file type determines the default file extensions to process.
Example: Property ingestion.directories[].files[].type
services:
MyService:
type: StorageConnector
ingestion:
directories:
- paths:
- sales
- finance
files:
# extract parquet and csv files
- type: parquet
- type: csv
🔑 Property ingestion.directories[].files[].extensions
The list of file extensions to extract.
optionalThe default is null (use the default file extensions of the file type).
If file extensions are defined, StorageConnector uses them to match the extension of each file to determine whether to extract the file.
Otherwise, the default file extensions of the file type are used.
The default file extensions are:
type | Default |
|---|---|
parquet | [parquet, pqt] |
csv | [csv] |
unstructured | (all extensions) |
Example: Property ingestion.directories[].files[].extensions
services:
MyService:
type: StorageConnector
ingestion:
directories:
- paths:
- documents
files:
# extract 'docx' and 'pdf' files
- type: unstructured
extensions:
- docx
- pdf
🔑 Property ingestion.directories[].files[].stereotype
The stereotype of the transformed UmlClass asset.
The default is file.
If a stereotype is defined, StorageConnector sets the stereotype of the transformed UmlClass asset.
Example: Property ingestion.directories[].files[].stereotype
services:
MyService:
type: StorageConnector
ingestion:
directories:
- paths:
- sales
files:
# extract parquet files and transform them with stereotype 'parquet'
- type: parquet
stereotype: parquet
If the specified stereotype doesn't exist in the scheme of the destination application, the stereotype is ignored.
🔑 Property ingestion.directories[].files[].delimiter
The delimiter character separating the record values in CSV files.
The property can only be specified if ingestion.directories[].files[].type is csv.
The default is , (comma).
StorageConnector uses the specified delimiter character to parse the CSV file.
Example: Property ingestion.directories[].files[].delimiter
services:
MyService:
type: StorageConnector
ingestion:
directories:
- paths:
- sales
files:
# extract csv files using the delimiter character ';'
- type: csv
delimiter: ;
Partition
StorageConnector extracts partition columns from the subdirectory structure and transforms them to UmlAttribute assets.
Partition filters are nested in file filters and define which partition columns to extract and specify their transformation options.
🔑 Property ingestion.directories[].files.partition
The partition filter that specifies that the file type is partitioned.
If the file type is partitioned, each object directory is determined by the matching path pattern. The partition columns - encoded in the subdirectory structure - and files in the object subdirectory tree are considered to belong to the same object.
The default is null (the file type is not partitioned).
Each object directory - determined by the matching path pattern - is extracted and transformed to a UmlClass asset.
The files in the object's subdirectory tree are merged into this single UmlClass asset - rather than creating a separate UmlClass asset for each file.
For each subdirectory, starting from the object directory, StorageConnector evaluates the partition filter to identify the partition columns.
A subdirectory is a partition column, if the subdirectory name contains the delimiter character.
In this case, the partition column is extracted and transformed to a UmlAttribute asset using the transformation options of the partition filter.
Example: Property ingestion.directories[].files.partition
services:
MyService:
type: StorageConnector
ingestion:
directories:
- paths:
- sales
files:
# extract partitioned parquet files
- type: parquet
partition:
delimiter: =
🔑 Property ingestion.directories[].files[].partition.delimiter
The delimiter character separating the partition column key from its value.
optionalThe default is null (don't extract partition columns).
If a delimiter character is defined, StorageConnector uses it to identify the partition columns in the subdirectory structure.
Otherwise, partition columns are not extracted from the subdirectory structure.
Example: Property ingestion.directories[].files[].partition.delimiter
services:
MyService:
type: StorageConnector
ingestion:
directories:
- paths:
- sales
files:
# extract partitioned parquet files
- type: parquet
partition:
delimiter: =
🔑 Property ingestion.directories[].files[].partition.stereotype
The stereotype of the transformed UmlAttribute asset.
The default is null (no stereotype).
If a stereotype is defined, StorageConnector sets the stereotype of the transformed UmlAttribute asset.
Example: Property ingestion.directories[].files[].partition.stereotype
services:
MyService:
type: StorageConnector
ingestion:
directories:
- paths:
- sales
files:
# extract partitioned parquet files
- type: parquet
partition:
delimiter: =
# transform the assets with stereotype 'column'
stereotype: column
If the specified stereotype doesn't exist in the scheme of the destination application, the stereotype is ignored.
🔑 Property ingestion.directories[].files[].partition.columns
The list of additional partition columns of the file type.
optionalThe default is null (no additional partition columns).
If additional partition columns are defined, StorageConnector transforms them to UmlAttribute and UmlDatatype assets.
The list of additional partition columns allows partition colums to be defined manually, in case the subdirectory structure doesn't reflect the column keys.
Example: Parquet dataset of sales data partitioned by year and month without column keys
s3a://my-s3-bucket/sales_data/
├── 2023/
│ ├── 01/
│ │ ├── sales-0001.parquet
│ │ └── sales-0002.parquet
│ ├── 02/
│ │ └── sales-0001.parquet
Example: Property ingestion.directories[].files[].partition.columns
services:
MyService:
type: StorageConnector
ingestion:
directories:
- paths:
- sales
files:
# extract partitioned parquet files
- type: parquet
partition:
delimiter: =
# additional partition columns 'year' and 'category'
columns:
- key: year
datatype: DATE
- key: category
🔑 Property ingestion.directories[].files[].partition.columns[].key
The key of the additional partition column.
requiredThe property is required for each additional partition column.
StorageConnector transforms the key to a UmlAttribute asset.
🔑 Property ingestion.directories[].files[].partition.columns[].datatype
The datatype of the additional partition column.
optionalThe default is null (no datatype).
If a datatype is defined, StorageConnector transforms it to a UmlDatatype asset.