最新レポート「エンタープライズ AI と最新のデータアーキテクチャをめぐる状況」

ダウンロードする
  • Cloudera Cloudera
  • | テクニカル

    Snowflake でCloudera Data PlatformのCDW Iceberg テーブルをアクセス

    Cloudera Author Profile Picture
    Modern open office space

    Cloudera テクニカルシリーズ


    目的:

    Cloudera Data PlatformのCDWに、既にIcebergテーブルに保存したデータがあります。Snowflake からアクセスしてみたいです。

    1) アーキテクチャ

    2) 新規作成

    2.1) Snowflake の外部Volumeを新規作成

    Step 1: Configure access permissions for the S3 bucket

    Create a policy in AWS IAM.

    I created a policy called zzeng-Snowflake-ext :

    {
        "Version": "2012-10-17",
        "Statement": [
            {
                "Effect": "Allow",
                "Action": [
                    "s3:PutObject",
                    "s3:GetObject",
                    "s3:GetObjectVersion",
                    "s3:DeleteObject",
                    "s3:DeleteObjectVersion"
                ],
                "Resource": "arn:aws:s3:::<my-bucket>/data/zzeng/*"
            },
            {
                "Effect": "Allow",
                "Action": [
                    "s3:ListBucket",
                    "s3:GetBucketLocation"
                ],
                "Resource": "arn:aws:s3:::<my-bucket>",
                "Condition": {
                    "StringLike": {
                        "s3:prefix": [
                            "data/zzeng/*"
                        ]
                    }
                }
            }
        ]
    }
    

    AWS Console の設定:

    Step 2: AWSのIAMロールを作成

    zzeng-Snowflake-ext-role:

    作成した後、ARNを控える

    Step 3: SSE-KMS暗号化のIAMロール (optional)

    今回暗号化しないので、SKIP

    Step 4: Snowflakeの外部Volumeを作成

    Snowflake のSQL実行:

    CREATE OR REPLACE EXTERNAL VOLUME extIcebergVolC
       STORAGE_LOCATIONS =
          (
             (
                NAME = 'zzeng-iceberg-se-s3-ap-northeast-1'
                STORAGE_PROVIDER = 'S3'
                STORAGE_BASE_URL = 's3://<my-bucket>/data/zzeng/'
                STORAGE_AWS_ROLE_ARN = 'arn:aws:iam::<my-AWS-id*****>:role/zzeng-Snowflake-ext-role'
             )
          );
          ```
    
    
    ### Step 5: Snowflake accountとAWS IAM userを紐づけ
    
    Snowflake で下記SQLを実行:
    
    ```sql
    DESC EXTERNAL VOLUME extIcebergVolC;
    

    実行結果:

    zzeng#COMPUTE_WH@ZZENG.PUBLIC>DESC EXTERNAL VOLUME extIcebergVolC;
    +-------------------+--------------------+---------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------------+
    | parent_property   | property           | property_type | property_value                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   | property_default |
    |-------------------+--------------------+---------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------------|
    |                   | ALLOW_WRITES       | Boolean       | true                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             | true             |
    | STORAGE_LOCATIONS | STORAGE_LOCATION_1 | String        | {"NAME":"zzeng-iceberg-se-s3-ap-northeast-1","STORAGE_PROVIDER":"S3","STORAGE_BASE_URL":"s3://<my-bucket-id>/data/zzeng/","STORAGE_ALLOWED_LOCATIONS":["s3://<my-bucket-id>/data/zzeng/*"],"STORAGE_REGION":"us-east-2","PRIVILEGES_VERIFIED":true,"STORAGE_AWS_ROLE_ARN":"<STORAGE_AWS_ROLE_ARN>","STORAGE_AWS_IAM_USER_ARN":"<STORAGE_AWS_ROLE_ARN>","STORAGE_AWS_EXTERNAL_ID":"<a long string for STORAGE_AWS_EXTERNAL_ID>","ENCRYPTION_TYPE":"NONE","ENCRYPTION_KMS_KEY_ID":""} |                  |
    | STORAGE_LOCATIONS | ACTIVE             | String        | zzeng-iceberg-se-s3-ap-northeast-1                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               |                  |
    +-------------------+--------------------+---------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------------+
    3 Row(s) produced. Time Elapsed: 0.949s
    zzeng#COMPUTE_WH@ZZENG.PUBLIC>
    

    STORAGE_AWS_IAM_USER_ARN と STORAGE_AWS_EXTERNAL_IDを控える

    Step 6: Object StoreのBucketeアクセスのIAM user権限を設定

    前のStep 5で貰った値で、AWSのIAM RoleのTrustshipを編集。

    {
      "Version": "2012-10-17",
      "Statement": [
        {
          "Sid": "",
          "Effect": "Allow",
          "Principal": {
            "AWS": "<snowflake_user_arn>"
          },
          "Action": "sts:AssumeRole",
          "Condition": {
            "StringEquals": {
              "sts:ExternalId": "<snowflake_external_id>"
            }
          }
        }
      ]
    }
    

    2.2) CDWで新しいIcebergテーブルを作成

    下記S3フォルダで作成する予定。

    s3a://${my-test-bucket}/data/${user_id}/airlines/airlines
    

    HUE画面で、下記SQLを実行

    CREATE DATABASE ${user_id}_airlines_ice;
    drop table if exists ${user_id}_airlines_ice.airlines;
    
    CREATE EXTERNAL TABLE ${user_id}_airlines_ice.airlines (code string, description string) 
    STORED BY ICEBERG
    STORED AS PARQUET
    LOCATION 's3a://${cdp_env_bucket}/data/${user_id}/airlines/airlines'
    tblproperties("format-version"="2",'external.table.purge'='true');
    
    INSERT INTO ${user_id}_airlines_ice.airlines
      SELECT * FROM ${user_id}_airlines_csv.airlines_csv;
      
      
    select * from ${user_id}_airlines_ice.airlines;
      
    select count(*) from ${user_id}_airlines_ice.airlines;
    

    SQLを実行したら下記フォルダが出来上がる。

    Metadataの情報をチェック:

    SHOW CREATE TABLE airlines;
    

    実行結果:

    CREATE EXTERNAL TABLE `zzeng_airlines_ice`.`airlines`(
      `code` string, 
      `description` string)
    ROW FORMAT SERDE 
      'org.apache.iceberg.mr.hive.HiveIcebergSerDe' 
    STORED BY 
      'org.apache.iceberg.mr.hive.HiveIcebergStorageHandler' 
    
    LOCATION
        's3a://<my-bucket-name>/data/zzeng/airlines/airlines'
      TBLPROPERTIES (
        'bucketing_version'='2', 
        'current-schema'='{"type":"struct","schema-id":0,"fields":[{"id":1,"name":"code","required":false,"type":"string"},{"id":2,"name":"description","required":false,"type":"string"}]}', 
        'current-snapshot-id'='6686807318178502869', 
        'current-snapshot-summary'='{"added-data-files":"1","added-records":"1491","added-files-size":"25686","changed-partition-count":"1","total-records":"1491","total-files-size":"25686","total-data-files":"1","total-delete-files":"0","total-position-deletes":"0","total-equality-deletes":"0"}', 
        'current-snapshot-timestamp-ms'='1710324490640', 
        'external.table.purge'='true', 
        'format-version'='2', 
        'iceberg.orc.files.only'='false', 
        'metadata_location'='s3a://<my-bucket-name>/data/zzeng/airlines/airlines/metadata/00001-7e28a998-42f9-4466-8884-32d450af5c85.metadata.json', 
        'previous_metadata_location'='s3a://<my-bucket-name>/data/zzeng/airlines/airlines/metadata/00000-9179e76f-369b-47ca-b01f-20e6bffd36a5.metadata.json', 
        'serialization.format'='1', 
        'snapshot-count'='1', 
        'table_type'='ICEBERG', 
        'transient_lastDdlTime'='1710324491', 
        'uuid'='1a86667c-6c7c-4318-a976-ba7fd49c13b0', 
        'write.delete.mode'='merge-on-read', 
        'write.format.default'='parquet', 
        'write.merge.mode'='merge-on-read', 
        'write.update.mode'='merge-on-read')
    

    2.3) Snowflake でIceberg Table を作成し、先ほどのCDW Iceberg Tableをアクセス

    SQL

    CREATE OR REPLACE ICEBERG TABLE airlines
      CATALOG='zzengIcebergCatalogInt'
      EXTERNAL_VOLUME='extIcebergVolC'
      BASE_LOCATION='airlines/airlines'
      METADATA_FILE_PATH='metadata/00001-7e28a998-42f9-4466-8884-32d450af5c85.metadata.json'
    ;
    

    3) 結果チェック

    In Snowflake:

    zzeng#COMPUTE_WH@ZZENG.PUBLIC>select count(*) from AIRLINES;
    +----------+
    | COUNT(*) |
    |----------|
    |     1491 |
    +----------+
    1 Row(s) produced. Time Elapsed: 0.393s
    zzeng#COMPUTE_WH@ZZENG.PUBLIC>select * from AIRLINES limit 3;
    +------+--------------------+
    | CODE | DESCRIPTION        |
    |------+--------------------|
    | 02Q  | Titan Airways      |
    | 04Q  | Tradewind Aviation |
    | 05Q  | "Comlux Aviation   |
    +------+--------------------+
    3 Row(s) produced. Time Elapsed: 4.705s
    zzeng#COMPUTE_WH@ZZENG.PUBLIC>
    

    In CDW Hive HUE:

    select count(*) from ${user_id}_airlines_ice.airlines;
    

    4) データ変更

    4.1) データ追加

    現時点、SnowflakeのIcebergカタログを使わず、外部のIcebergテーブルを利用する場合、Insertできないです。
    Reference URL:

    こちらの説明では、

    You can use INSERT and UPDATE statements to modify an Iceberg table that uses Snowflake as the catalog.

    Snowflake をIcebergのカタログとして利用する時はINSERT/UPDATEできますが、Catalog Integrationの場合は更新できません。

    実際やってみました。

    zzeng#COMPUTE_WH@ZZENG.PUBLIC>INSERT INTO airlines (code, description) VALUES
                                       ('A1', 'Airline 1 Description'),
                                       ('A2', 'Airline 2 Description'),
                                       ('A3', 'Airline 3 Description'),
                                       ('A4', 'Airline 4 Description'),
                                       ('A5', 'Airline 5 Description'),
                                       ('A6', 'Airline 6 Description'),
                                       ('A7', 'Airline 7 Description'),
                                       ('A8', 'Airline 8 Description'),
                                       ('A9', 'Airline 9 Description'),
                                       ('A10', 'Airline 10 Description'),
                                       ('A11', 'Airline 11 Description'),
                                       ('A12', 'Airline 12 Description'),
                                       ('A13', 'Airline 13 Description'),
                                       ('A14', 'Airline 14 Description'),
                                       ('A15', 'Airline 15 Description'),
                                       ('A16', 'Airline 16 Description'),
                                       ('A17', 'Airline 17 Description'),
                                       ('A18', 'Airline 18 Description'),
                                       ('A19', 'Airline 19 Description'),
                                       ('A20', 'Airline 20 Description');
    091357 (42601): SQL Compilation error: Iceberg table AIRLINES with an external catalog integration is a read-only table and cannot be modified
    zzeng#COMPUTE_WH@ZZENG.PUBLIC>
    

    エラーです。

    CDWのHiveでやってみると、

    INSERT INTO `zzeng_airlines_ice`.`airlines` (code, description) VALUES
    ('A1', 'Airline 1 Description'),
    ('A2', 'Airline 2 Description'),
    ('A3', 'Airline 3 Description'),
    ('A4', 'Airline 4 Description'),
    ('A5', 'Airline 5 Description'),
    ('A6', 'Airline 6 Description'),
    ('A7', 'Airline 7 Description'),
    ('A8', 'Airline 8 Description'),
    ('A9', 'Airline 9 Description'),
    ('A10', 'Airline 10 Description'),
    ('A11', 'Airline 11 Description'),
    ('A12', 'Airline 12 Description'),
    ('A13', 'Airline 13 Description'),
    ('A14', 'Airline 14 Description'),
    ('A15', 'Airline 15 Description'),
    ('A16', 'Airline 16 Description'),
    ('A17', 'Airline 17 Description'),
    ('A18', 'Airline 18 Description'),
    ('A19', 'Airline 19 Description'),
    ('A20', 'Airline 20 Description');
    

    成功。

    4.2) CDWでIcebergテーブルを更新したら、Snowflake 側に反映

    現在の課題:Hive側でIcebergテーブルを更新しましたが、Snowflake 側は反映されていない。

    なぜ?

    Icebergを更新したら、Metadataが変わりました。
    Snowflake側はそれを検知していないから、前の世代のMetadataを見ていた。

    zzeng#COMPUTE_WH@ZZENG.PUBLIC>select count(*) from AIRLINES;
    +----------+
    | COUNT(*) |
    |----------|
    |     1491 |
    +----------+
    1 Row(s) produced. Time Elapsed: 0.678s
    zzeng#COMPUTE_WH@ZZENG.PUBLIC>
    

    Snowflake 公式ドキュメント:

    In Snowflake, The metadata files do not identify the most recent snapshot of an Iceberg table.

    Ref URL: https://docs.snowflake.com/en/user-guide/tables-iceberg

    手動でSnowflake のMetadata設定変更。

    まずは新しいMetadataのファイル名を確認。

    これでALTER文作成:

    ALTER ICEBERG TABLE AIRLINES REFRESH 'metadata/00002-c33ae888-1af3-4f64-830b-eac9e0a95983.metadata.json';
    

    更新、動作確認結果:

    zzeng#COMPUTE_WH@ZZENG.PUBLIC>ALTER ICEBERG TABLE AIRLINES REFRESH 'metadata/00002-c33ae888-1af3-4f64-830b-eac9e0a95983.metadata.json';
    +----------------------------------+
    | status                           |
    |----------------------------------|
    | Statement executed successfully. |
    +----------------------------------+
    1 Row(s) produced. Time Elapsed: 10.199s
    zzeng#COMPUTE_WH@ZZENG.PUBLIC>select count(*) from AIRLINES;
    +----------+
    | COUNT(*) |
    |----------|
    |     1511 |
    +----------+
    1 Row(s) produced. Time Elapsed: 0.204s
    zzeng#COMPUTE_WH@ZZENG.PUBLIC>
    

    5) 制約事項

    https://docs.snowflake.com/en/user-guide/tables-iceberg

    Copied from Snowflake documents:

    Metadata files

    zzeng#COMPUTE_WH@ZZENG.PUBLIC>ALTER ICEBERG TABLE AIRLINES REFRESH 'metadata/00002-c33ae888-1af3-4f64-830b-eac9e0a95983.metadata.json';
    +----------------------------------+
    | status                           |
    |----------------------------------|
    | Statement executed successfully. |
    +----------------------------------+
    1 Row(s) produced. Time Elapsed: 10.199s
    zzeng#COMPUTE_WH@ZZENG.PUBLIC>select count(*) from AIRLINES;
    +----------+
    | COUNT(*) |
    |----------|
    |     1511 |
    +----------+
    1 Row(s) produced. Time Elapsed: 0.204s
    zzeng#COMPUTE_WH@ZZENG.PUBLIC>
    
    The metadata files do not identify the most recent snapshot of an Iceberg table.
    
    You cannot modify the location of the data files or snapshot using the ALTER ICEBERG TABLE command. To modify either of these settings, you must recreate the table (using the CREATE OR REPLACE ICEBERG TABLE syntax).
    

    Your form submission has failed.

    This may have been caused by one of the following:

    • Your request timed out
    • A plugin/browser extension blocked the submission. If you have an ad blocking plugin please disable it and close this message to reload the page.