Zum Hauptinhalt springen

StorageConnector

StorageConnector is a connector that connects to a storage system (e.g. AWS S3, Google Cloud Storage, Azure Storage), extracts metadata from files (e.g. CSV, Parquet), and transforms and uploads the metadata to a destination application (dataspot) using the upload API.

Tooltip

Find StorageConnector configuration examples here.

Functionality

StorageConnector follows the general connector architecture and workflow.

The metadata is extracted from the storage system and transformed to assets in the destination application (dataspot).

SourceAsset
DirectoryCollection
FileUmlClass, UmlAttribute, UmlDatatype

StorageConnector extracts partitioned metadata from the subdirectory structure by identifying the partition columns and merging the partitioned files to UmlClass assets.

Did you know?

Partitioning refers to the technique of organizing data into directories and subdirectories based on the values of one or more columns, rather than storing all data in a single large file or folder. The partition columns are not stored in the metadata files themselves. Rather, their values are encoded in the subdirectory structure. Each directory level represents a column, typically using a key=value convention to specify the column value.

Example: Parquet dataset of sales data partitioned by year and month

s3a://my-s3-bucket/sales_data/
├── year=2023/
│ ├── month=01/
│ │ ├── sales-0001.parquet
│ │ └── sales-0002.parquet
│ ├── month=02/
│ │ └── sales-0001.parquet

The transformed metadata is uploaded to the destination application by calling the upload API. The reconciliation options of the upload API specify how uploaded metadata is reconciled with existing metadata. The workflow options of the upload API specify the workflow statuses of inserted, updated, or deleted metadata.

Filters

The filtering mechanism for directories and files follows the same core principles. Filters specify matching criteria, to determine whether the filter applies to a specific metadata object, as well as options for transforming the metadata to assets in the destination application.

Filters are nested according to the metadata hierarchy:

  • Directory filters contain file filters.
  • File filters contain partition filters.

Nested filters only apply to objects within the scope of their parent filter. When a parent filter matches and is applied, the nested filters are used to extract and transform the subordinate metadata objects.

Note

If a filter list is null or empty, no metadata objects at that level are extracted.

Filters are evaluated in their declaration order - from top to bottom. For each metadata object, the first filter that matches is applied - the remaining filters at that level are ignored.

Tooltip

Due to the single-pass resolution, where only the first matching filter is applied, filter lists should be structured from most specific to most general. This approach ensures predictable extraction rules allowing to include or exclude precise subsets of the metadata and to customize how each slice of metadata is transformed to assets in the destination application.

Configuration

A StorageConnector service is configured by defining its unique name, the service type StorageConnector, and the configuration.

Example: StorageConnector

services:
MyService:
type: StorageConnector
Tooltip

While YAML itself doesn't enforce any naming style for property names, multi-word properties (for example, secret key) are typically specified in lowercase separated by hyphens (for example, secret-key). This naming style - commonly referred to as kebab-case - is used in the following descriptions and examples. However, all multi-word properties can also be specified in camelCase (for example, secretKey).

In additional to the general connector configuration to specify the destination application, StorageConnector has the following configuration to specify the source as well as the ingestion filters.

Tooltip

Properties marked with * are required for StorageConnector to run.

Source

StorageConnector connects to a storage system using the specified connection URL and authentication settings.

🔑 Property source.url *

The connection URL of the storage system.

required

StorageConnector connects to the storage system specified by the connection URL.

Example: Property source.url

services:
MyService:
type: StorageConnector
source:
url: s3a://my-s3-bucket
🔑 Property source.properties

The additional properties of the storage system connection, specified as a map of key/value pairs.

Note

The map key is the property name. The map value is the property value.

optional

The default is null (no additional properties).

If additional properties are defined, StorageConnector sets the corresponding properties of the storage system connection.

Tooltip

Additional properties can be used to modify the storage system connection, by setting properties that are either not supported by the connection URL or contain sensitive data (e.g. an access token) that shouldn't be included in the connection URL.

Example: Property source.properties

services:
MyService:
type: StorageConnector
source:
properties:
fs.s3a.endpoint.region: eu-central-1

Authentication

StorageConnector can specify the authentication settings of the storage system.

🔑 Property source.authentication

The authentication settings of the storage system.

optional

The default is null (no authentication).

If an authentication is defined, StorageConnector connects to the storage system with the specified authentication. Otherwise, StorageConnector connects without authentication or falls back to the default authentication provider of the storage system.

🔑 Property source.authentication.method

The authentication method.

required

The property is required if source.authentication is specified.

Depending on the storage system, StorageConnector supports the following authentication methods:

URL schemeStorage systemAuthentication methodmethod
s3aAWS S3Profileprofile
s3aAWS S3Secret keysecret-key
gsGoogle Cloud StorageKeyfilekeyfile
abfssAzure StorageOAuth 2.0oauth

Example: Property source.authentication.method

services:
MyService:
type: StorageConnector
source:
authentication:
method: keyfile
Profile

StorageConnector can use a profile for connecting to the storage system.

🔑 Property source.authentication.profile

The profile name.

Tooltip

The named profiles are stored in profile files (e.g. ~/.aws/credentials).

required

The property can only be specified and is required if source.authentication.method is profile.

StorageConnector loads the specified profile from the profile file (e.g. ~/.aws/credentials) and uses it for authentication.

Example: Property source.authentication.profile

services:
MyService:
type: StorageConnector
source:
authentication:
method: profile
profile: ${aws.s3.profile}
Secret key

StorageConnector can use a secret key (static credentials) for connecting to the storage system.

🔑 Property source.authentication.access-key

The access key.

required

The property can only be specified and is required if source.authentication.method is secret-key.

StorageConnector uses the specified access key and secret key for authentication.

Note

The secret key is specified in source.authentication.secret-key.

Example: Property source.authentication.access-key

services:
MyService:
type: StorageConnector
source:
authentication:
method: secret-key
access-key: ${aws.s3.access-key}
secret-key: ${aws.s3.secret-key}
🔑 Property source.authentication.secret-key

The secret key.

required

The property can only be specified and is required if source.authentication.method is secret-key.

StorageConnector uses the specified access key and secret key for authentication.

Note

The access key is specified in source.authentication.access-key.

Example: Property source.authentication.secret-key

services:
MyService:
type: StorageConnector
source:
authentication:
method: secret-key
access-key: ${aws.s3.access-key}
secret-key: ${aws.s3.secret-key}
Keyfile

StorageConnector can use a JSON keyfile for connecting to the storage system.

🔑 Property source.authentication.keyfile

The absolute or relative path to the JSON keyfile.

required

The property can only be specified and is required if source.authentication.method is keyfile.

StorageConnector loads the specified JSON keyfile and uses it for authentication.

Example: Property source.authentication.keyfile

services:
MyService:
type: StorageConnector
source:
authentication:
method: keyfile
keyfile: /home/connector/gs-sa-key.json
OAuth 2.0

StorageConnector can use OAuth 2.0 authentication for connecting to the storage system. The application supports non-interactive (machine to machine) grants to obtain an access token as a client application or to obtain an ID token as an end-user.

🔑 Property source.authentication.provider-url

The URL of the identity provider, that supports OAuth 2.0.

required

The property can only be specified and is required if source.authentication.method is oauth.

StorageConnector uses the provider URL and the client ID to contact the identity provider.

Note

The client ID is specified in source.authentication.client-id.

Example: Property source.authentication.provider-url

services:
MyService:
type: StorageConnector
source:
authentication:
method: oauth
provider-url: https://login.microsoftonline.com/b0ebd953-fb5f-425e-98fc-ec46bf8ce2f1
🔑 Property source.authentication.client-id

The OAuth 2.0 client ID.

required

The property can only be specified and is required if source.authentication.method is oauth.

StorageConnector uses the provider URL and the client ID to contact the identity provider.

Note

The provider URL is specified in source.authentication.provider-url.

Example: Property source.authentication.client-id

services:
MyService:
type: StorageConnector
source:
authentication:
method: oauth
provider-url: https://login.microsoftonline.com/b0ebd953-fb5f-425e-98fc-ec46bf8ce2f1
client-id: 6731de76-14a6-49ae-97bc-6eba6914391e
🔑 Property source.authentication.credentials.type

The credentials type.

required

The property can only be specified and is required if source.authentication.method is oauth.

Depending on the storage system, StorageConnector supports the following credentials types to obtain an access or ID token:

URL schemeCredentials typetype
abfssClient credentials with client secretclient-secret
abfssResource owner password credentialspassword

Example: Property source.authentication.credentials.type

services:
MyService:
type: StorageConnector
source:
authentication:
method: oauth
provider-url: https://login.microsoftonline.com/b0ebd953-fb5f-425e-98fc-ec46bf8ce2f1
client-id: 6731de76-14a6-49ae-97bc-6eba6914391e
credentials:
type: password
Client credentials with client secret

The OAuth 2.0 client credentials grant with a client secret is a non-interactive (machine to machine) authentication. StorageConnector authenticates as a client application, rather than as an end-user, to obtain an access token.

🔑 Property source.authentication.credentials.client-secret

The client secret.

required

The property can only be specified and is required if source.authentication.credentials.type is client-secret.

StorageConnector uses the specified client secret to obtain an access token from the identity provider.

Example: Property source.authentication.credentials.client-secret

services:
MyService:
type: StorageConnector
source:
authentication:
method: oauth
provider-url: https://login.microsoftonline.com/b0ebd953-fb5f-425e-98fc-ec46bf8ce2f1
client-id: 6731de76-14a6-49ae-97bc-6eba6914391e
credentials:
type: client-secret
client-secret: ${aws.s3.client-secret}
Resource owner password credentials

The OAuth 2.0 resource owner password credentials (ROPC) grant with a username and password is a non-interactive (machine to machine) authentication. StorageConnector authenticates as an end-user and uses the OpenID Connect (OIDC) authentication layer to obtain an ID token.

Attention

If the identity provider requires multi-factor authentication (MFA) (and therefore user interaction), using resource owner password credentials (ROPC) is not a suitable machine to machine authentication method. Alternatively, a non-interactive authentication must be used (e.g. client credentials with client secret).

🔑 Property source.authentication.credentials.username

The username.

Note

Typically, the username is an e-mail address.

required

The property can only be specified and is required if source.authentication.credentials.type is password.

StorageConnector uses the specified username and password to obtain an ID token from the identity provider.

Note

The password is specified in source.authentication.credentials.password.

Example: Property source.authentication.credentials.username

services:
MyService:
type: StorageConnector
source:
authentication:
method: oauth
provider-url: https://login.microsoftonline.com/b0ebd953-fb5f-425e-98fc-ec46bf8ce2f1
client-id: 6731de76-14a6-49ae-97bc-6eba6914391e
credentials:
type: password
username: ${aws.s3.username}
password: ${aws.s3.password}
🔑 Property source.authentication.credentials.password

The password.

required

The property can only be specified and is required if source.authentication.credentials.type is password.

StorageConnector uses the specified username and password to obtain an ID token from the identity provider.

Note

The username is specified in source.authentication.credentials.username.

Example: Property source.authentication.credentials.password

services:
MyService:
type: StorageConnector
source:
authentication:
method: oauth
provider-url: https://login.microsoftonline.com/b0ebd953-fb5f-425e-98fc-ec46bf8ce2f1
client-id: 6731de76-14a6-49ae-97bc-6eba6914391e
credentials:
type: password
username: ${aws.s3.username}
password: ${aws.s3.password}

Options

🔑 Property ingestion.options.read

The mode that determines which files to read, based on when they were created or last modified.

optional

The default is all.

StorageConnector supports the following read modes:

readDescription
allRead all files, regardless of when they were modified.
latestRead the latest modified file only.
since-last-runRead the files modified since the last run.
since-timestampRead the files modified since the specified timestamp.

Example: Property ingestion.options.read

services:
MyService:
type: StorageConnector
ingestion:
options:
read: all
🔑 Property ingestion.options.timestamp

The timestamp that limits which files to read.

optional

The default is null (no limit).

If the read mode is since-timestamp, StorageConnector reads the files modified since the specified timestamp. If the timestamp is null, StorageConnector reads all files, regardless of when they were modified.

The timestamp can be specified in the following formats:

FormatDescriptionExample
Epoch millisecondsNumber of milliseconds since the Unix epoch (1970-01-01T00:00:00.000).1752512124375
ISO 8601ISO 8601 representation, with millisecond precision, without a time zone.2025-07-14T16:55:24.375

Example: Property ingestion.options.timestamp

services:
MyService:
type: StorageConnector
ingestion:
options:
read: since-timestamp
timestamp: 2025-07-14T16:55:24.375

Directories

StorageConnector extracts directories from the storage system and transforms them to Collection assets.

Directory filters define which directories to extract and specify their transformation options.

🔑 Property ingestion.directories

The ordered list of directory filters that specify which directories to extract and their transformation options.

optional

The default is null (don't extract any directories).

For each directory in the storage system, StorageConnector evaluates the directory filters in their declaration order - from top to bottom. The first directory filter that matches is applied - the remaining directory filters are ignored.

A directory matches a directory filter, if the directory path starts with one of the directory filter's paths. In this case, the directory is extracted and transformed to Collection assets using the transformation options of the directory filter. The files of the directory are extracted using the file filters nested in the applied directory filter.

Attention

Notice that a directory matches a directory filter, if the directory path starts with one of the directory filter's path patterns. Since all directories in the subdirectory tree also start with that path pattern, these recursive subdirectories - and consequently their files - are also considered for extraction.

Note

If the list of directory filters is null or empty, no directories are extracted.

Example: Property ingestion.directories

services:
MyService:
type: StorageConnector
ingestion:
directories:
# extract directories 'sales/annual' and 'finance'
- paths:
- sales/annual
- finance
files:
# extract partitioned parquet files
- type: parquet
partition:
delimiter: =
# extract directory 'product'
- paths:
- product
files:
# extract csv files
- type: csv
Tooltip

If the list of directory filters is a single-value list, it can be formatted as a single value, rather than as a list with a single value.

🔑 Property ingestion.directories[].paths

The list of path patterns applied to filter directories based on their directory path.

Note

A path pattern may contain the wildcard character *, matching any directory name in the path.

required

The property is required for each directory filter.

StorageConnector evaluates the path patterns in their declaration order - from top to bottom - to determine whether to extract the directory. A directory matches a path pattern, if the directory path starts with the path pattern.

Partitioning

If the file type is partitioned, the first path pattern that matches the extracted directory is also the base directory for extracting partitioned metadata. The subdirectories located in the base directory are considered to be object directories containing partition columns and files. Each object directory is transformed to a UmlClass and the partition columns and files extracted from the object's subdirectory tree are merged into this single UmlClass asset.

Example: Property ingestion.directories[].paths

services:
MyService:
type: StorageConnector
ingestion:
directories:
# extract directory 'sales/annual' and directories matching 'finance/*'
- paths:
- sales/annual
- finance/*
🔑 Property ingestion.directories[].deployment

The deployment of the transformed assets.

optional

The default is null (no deployment).

For each UmlClass asset transformed from the extracted files in the directory, StorageConnector creates a Deployment link and sets the specified deployment system, favorite flag, and qualifier.

Example: Property ingestion.directories[].deployment

services:
MyService:
type: StorageConnector
ingestion:
directories:
# extract directory 'sales'
- paths:
- sales
# transform the assets with deployments to '/Systems/Finance/ARS'
deployment:
deployed-in: /Systems/Sales/ARS
favorite: true
qualifier: SPOT
files:
# extract csv files
- type: csv
🔑 Property ingestion.directories[].deployment.deployed-in

The system in which the asset is deployed for execution or storage purposes.

required

The property is required if ingestion.directories[].deployment is specified.

🔑 Property ingestion.directories[].deployment.favorite

The flag that specifies if the deployment should be marked as favorite.

optional

The default is false (not favorite).

🔑 Property ingestion.directories[].deployment.qualifier

The additional qualifier characterizing the deployment.

optional

The default is null (no qualifier).

Files

StorageConnector extracts files from the storage system and transforms them to UmlClass, UmlAttribute, and UmlDatatype assets.

File filters are nested in directory filters and define which files to extract and specify their transformation options.

🔑 Property ingestion.directories[].files

The ordered list of file filters that specify which files to extract and their transformation options.

required

The property is required for each directory filter.

For each file in the extracted directory, StorageConnector evaluates the file filters in their declaration order - from top to bottom. The first file filter that matches is applied - the remaining file filters are ignored.

A file matches a file filter, if the file extension matches the file filter's extensions. In this case, the file is extracted and transformed to UmlClass, UmlAttribute, and UmlDatatype assets using the transformation options of the file filter. Partitioned metadata is extracted using the partition filter nested in the applied file filter.

Partitioning

If the file type is partitioned, each object directory is determined by the matching path pattern and transformed to a UmlClass. The extracted files in the object's subdirectory tree are merged into this single UmlClass asset - rather than creating a separate UmlClass asset for each file - and the partition columns are extracted from the directory structure.

Note

If the list of file filters is null or empty, no files are extracted.

Example: Property ingestion.directories[].files

services:
MyService:
type: StorageConnector
ingestion:
directories:
- paths:
- sales
- finance
files:
# extract unstructured files
- type: unstructured
Tooltip

If the list of file filters is a single-value list, it can be formatted as a single value, rather than as a list with a single value.

🔑 Property ingestion.directories[].files[].type

The file type to extract.

required

The property is required for each file filter.

If a file matches the file filter, StorageConnector extracts the metadata from the file as specified by the file type. The following file types are supported:

typeDescription
parquetExtract metadata from the metadata section in Parquet files.
The file is transformed to UmlClass, UmlAttribute, and UmlDatatype assets.
csvExtract metadata from the header record in CSV files.
The file is transformed to UmlClass and UmlAttribute assets.
unstructuredExtract metadata from unstructured files.
The file is transformed to a UmlClass asset.
Tooltip

The file type determines the default file extensions to process.

Example: Property ingestion.directories[].files[].type

services:
MyService:
type: StorageConnector
ingestion:
directories:
- paths:
- sales
- finance
files:
# extract parquet and csv files
- type: parquet
- type: csv
🔑 Property ingestion.directories[].files[].extensions

The list of file extensions to extract.

optional

The default is null (use the default file extensions of the file type).

If file extensions are defined, StorageConnector uses them to match the extension of each file to determine whether to extract the file. Otherwise, the default file extensions of the file type are used.

The default file extensions are:

typeDefault
parquet[parquet, pqt]
csv[csv]
unstructured(all extensions)

Example: Property ingestion.directories[].files[].extensions

services:
MyService:
type: StorageConnector
ingestion:
directories:
- paths:
- documents
files:
# extract 'docx' and 'pdf' files
- type: unstructured
extensions:
- docx
- pdf
🔑 Property ingestion.directories[].files[].stereotype

The stereotype of the transformed UmlClass asset.

optional

The default is file.

If a stereotype is defined, StorageConnector sets the stereotype of the transformed UmlClass asset.

Example: Property ingestion.directories[].files[].stereotype

services:
MyService:
type: StorageConnector
ingestion:
directories:
- paths:
- sales
files:
# extract parquet files and transform them with stereotype 'parquet'
- type: parquet
stereotype: parquet
Tooltip

If the specified stereotype doesn't exist in the scheme of the destination application, the stereotype is ignored.

🔑 Property ingestion.directories[].files[].delimiter

The delimiter character separating the record values in CSV files.

Tooltip

While comma is the canonical delimiter character in CSV (per RFC 4180), other delimiter characters (e.g. ;, \t, |) may also be used to accommodate regional conventions or to avoid collisions with the actual record values.

optional

The property can only be specified if ingestion.directories[].files[].type is csv.
The default is , (comma).

StorageConnector uses the specified delimiter character to parse the CSV file.

Example: Property ingestion.directories[].files[].delimiter

services:
MyService:
type: StorageConnector
ingestion:
directories:
- paths:
- sales
files:
# extract csv files using the delimiter character ';'
- type: csv
delimiter: ;
Partition

StorageConnector extracts partition columns from the subdirectory structure and transforms them to UmlAttribute assets.

Partition filters are nested in file filters and define which partition columns to extract and specify their transformation options.

🔑 Property ingestion.directories[].files.partition

The partition filter that specifies that the file type is partitioned.

Note

If the file type is partitioned, each object directory is determined by the matching path pattern. The partition columns - encoded in the subdirectory structure - and files in the object subdirectory tree are considered to belong to the same object.

optional

The default is null (the file type is not partitioned).

Each object directory - determined by the matching path pattern - is extracted and transformed to a UmlClass asset. The files in the object's subdirectory tree are merged into this single UmlClass asset - rather than creating a separate UmlClass asset for each file.

For each subdirectory, starting from the object directory, StorageConnector evaluates the partition filter to identify the partition columns. A subdirectory is a partition column, if the subdirectory name contains the delimiter character. In this case, the partition column is extracted and transformed to a UmlAttribute asset using the transformation options of the partition filter.

Example: Property ingestion.directories[].files.partition

services:
MyService:
type: StorageConnector
ingestion:
directories:
- paths:
- sales
files:
# extract partitioned parquet files
- type: parquet
partition:
delimiter: =
🔑 Property ingestion.directories[].files[].partition.delimiter

The delimiter character separating the partition column key from its value.

optional

The default is null (don't extract partition columns).

If a delimiter character is defined, StorageConnector uses it to identify the partition columns in the subdirectory structure. Otherwise, partition columns are not extracted from the subdirectory structure.

Example: Property ingestion.directories[].files[].partition.delimiter

services:
MyService:
type: StorageConnector
ingestion:
directories:
- paths:
- sales
files:
# extract partitioned parquet files
- type: parquet
partition:
delimiter: =
🔑 Property ingestion.directories[].files[].partition.stereotype

The stereotype of the transformed UmlAttribute asset.

optional

The default is null (no stereotype).

If a stereotype is defined, StorageConnector sets the stereotype of the transformed UmlAttribute asset.

Example: Property ingestion.directories[].files[].partition.stereotype

services:
MyService:
type: StorageConnector
ingestion:
directories:
- paths:
- sales
files:
# extract partitioned parquet files
- type: parquet
partition:
delimiter: =
# transform the assets with stereotype 'column'
stereotype: column
Tooltip

If the specified stereotype doesn't exist in the scheme of the destination application, the stereotype is ignored.

🔑 Property ingestion.directories[].files[].partition.columns

The list of additional partition columns of the file type.

optional

The default is null (no additional partition columns).

If additional partition columns are defined, StorageConnector transforms them to UmlAttribute and UmlDatatype assets.

Note

The list of additional partition columns allows partition colums to be defined manually, in case the subdirectory structure doesn't reflect the column keys.

Example: Parquet dataset of sales data partitioned by year and month without column keys

s3a://my-s3-bucket/sales_data/
├── 2023/
│ ├── 01/
│ │ ├── sales-0001.parquet
│ │ └── sales-0002.parquet
│ ├── 02/
│ │ └── sales-0001.parquet

Example: Property ingestion.directories[].files[].partition.columns

services:
MyService:
type: StorageConnector
ingestion:
directories:
- paths:
- sales
files:
# extract partitioned parquet files
- type: parquet
partition:
delimiter: =
# additional partition columns 'year' and 'category'
columns:
- key: year
datatype: DATE
- key: category
🔑 Property ingestion.directories[].files[].partition.columns[].key

The key of the additional partition column.

required

The property is required for each additional partition column.

StorageConnector transforms the key to a UmlAttribute asset.

🔑 Property ingestion.directories[].files[].partition.columns[].datatype

The datatype of the additional partition column.

optional

The default is null (no datatype).

If a datatype is defined, StorageConnector transforms it to a UmlDatatype asset.