Publishing Packages

Warning

This section hasn’t been updated recently, and is probably only of historically suggestive value.

The metasync program can build multiple package types and upload them to an S3 bucket. Typical usage is:

$ metasync -c -e -f -z -s s3://library.metatab.org

With these options, the metasync program will create an Excel, Zip and Filesystem package and store them in the s3 bucket library.metadata.org. In this case, the “filesystem” package is not created in the local filesystem, but only in S3. ( “Filesystem” packages are basically what you get after unziping a ZIP package. )

Because generating all of the packages and uploading to S3 is common, the metasync -S option is a synonym for generating all package types and uploading:

$ metasync -S s3://library.metatab.org

Currently, metasync will only write packages to S3. For S3 metasync uses boto3, so refer to the boto3 credentials documentation for instructions on how to set your S3 access key and secret.

One important side effect of the metasync program is that it will add Distribution terms to the main metadata.csv file before creating the packages, so all the packages that the program syncs will include references to the S3 location of all packages. For instance, the example invocation above will add these Distribution terms:

Distribution        http://s3.amazonaws.com/library.metatab.org/simple_example-2017-us-1.xlsx
    Distribution    http://s3.amazonaws.com/library.metatab.org/simple_example-2017-us-1.zip
        Distribution        http://s3.amazonaws.com/library.metatab.org/simple_example-2017-us-1/metadata.csv

These Distribution terms are valuable documentation, but they are also required for the metakan program to create entries for the package in CKAN.

Adding Packages to CKAN

The metakan program reads a Metatab file, creates a dataset in CKAN, and adds resources to the CKAN entry based on the Distribution terms in the Metatab data. For instance, with a localhost CKAN server, and the metadata file from the “Publishing Packages” section example:

$ metakan  --ckan http://localhost:32768/ --api f1f45...e9a9

This command would create a CKAN dataset with the metadata in the metadata.csv file in the current directory, reading the Distribution terms. It would add resources for simple_example-2017-us-1.xlsx and simple_example-2017-us-1.zip. For the simple_example-2017-us-1/metadata.csv entry, it would read the remote metadata.csv file, resolve the resource URLs, and create a resource entry in CKAN for the metadata.csv file and all of the resources referenced in the remote metadata.csv file.

Note that because part of the information in the CKAN dataset comes from the file metadata.csv file and part of the resources are discovered from the remote file, there is a substantial possibility for these files to become unsynchronized. For this reason, it is important to run the metasync program to create Distribution terms before running the metakan program.

For an example of a CKAN entry generated by metakan, see http://data.sandiegodata.org/dataset/fns-usda-gov-f2s_census-2015-2

Publish to CKAN from S3

The metakan program can publish all of the CSV packages available in an S3 bucket by giving it an S3 url instead of a Metatab file. For instance, to publish all of the CSV packages in the library.metatab.org bucket, run:

$ metakan  --ckan http://localhost:32768/ --api f1f45...e9a9 s3://library.metatab.org

As with publishing a local Metatab file, the CSV packages in the S3 buck may have Distribution terms to identify other packages that should also be published into the CKan dataset.

Adding Packages to Data.World

The metaworld program will publish the package to Data.World. Only Excel and CSV packages will be published, because ZIP packages will be disaggregated, conflicting with CSV packages. The program is a bit buggy, and when creating a new package, the server may return a 500 error. If it does, just re-run the program.

The metaworld program takes no options. To use it, you must install the datadotworld python package and configure it, which will store your username and password.

$ metaworld

Private Datasets

Datasets that should be protected from unauthorized access can be written to S3 with a private ACL and access using S3 credentials. To use private datasets:

  • Use the metaaws program to setup an S3 bucket with a policy and users
  • Add a Root.Access term to the dataset’s metatab document.
  • Syncronize the dataset to s3 with metasync
  • Setup credentials for an S3 user
  • Access the dataset using an S3 url.

Setup The S3 Bucket

Suppose we want to store datasets in a bucket bucket.example.com. After creating the bucjet, initialize it with subdirectories and policies with the metaaws program.

$ metaaws init-bucket bucket.example.com

Configure and Sync a Dataset

To make a dataset private, add a Root.Access term to the Root section, with a value of private

Create S3 Users

Use the metaaws program to create users and add permissions to the bucket. First, initialize a bucket with the apprpriate policies:

$ metaaws init-bucket bucket.example.com

Then, create a new user.

$ metaaws new-user foobar
Created user : foobar
arn          : arn:aws:iam::095555823111:user/metatab/foobar
Access Key   : AKIAJXMFAP3X5TRYYQ5Q
Secret Key   : b81zw4LRDKVILzrZbS0B8KMn88xbY9BEEnwzKrz2
The secret key and access key should be given to the user, to set up as according to the next
section.

Setup S3 Credentials

The access and secret keys should be stored in a boto configuration file, such as ~/.aws/credentials. See the boto3 configuration documentation for details. Here is an example of a credentials file:

[default]
aws_access_key_id = AKIAJXMFAP3X5TRYYQ5Q
aws_secret_access_key = b81zw4LRDKVILzrZbS0B8KMn88xbY9BEEnwzKrz2

If you have multiple credentials, you can put them in different sections by changing [default] to the name of another profile. For instance, here is a credentials file with a default and alternate profile:

[default]
aws_access_key_id = AKIAJXMFAP3X5TRYYQ5Q
aws_secret_access_key = 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
[fooprofile]
aws_access_key_id = AKIAX5TRYYQ5QJXMFAP3
aws_secret_access_key = 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

To use the alternate credentials with the metasync program, use the -p option:

$ metasync -p fooprofile -S library.metatab.org

To use the alternate credentials with the open_package() function, you will need to set them in the shell before you run any programs. The metasync -C program will display the credentials in a form that can be shell eval’d, and the -p option can select an alternate profile.

$ metasync -C -p fooprofile
export AWS_ACCESS_KEY_ID=AKIAX5TRYYQ5QJXMFAP3
export AWS_SECRET_ACCESS_KEY=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
# Run  'eval $(metasync -C -p fooprofile )' to configure credentials in a shell

The last line of the output shows the command to run to set the credentials in the shell:

$ eval $(metasync -C -p fooprofile )

Setting credentials in the shell is only required if you access the private dataset via open_package() although it should also work when using the metasync and metapack program.

Using Private Files

Private files can’t be easily downloaded using a web browser, but there are a few other ways to fetch them.

  • Use an S3 client, such as CyberDuck, S3 Browser, CloudBerry or S3 Tools.
  • Use the metapack program to dump a CSV file.

To use the matpack program, first list the resources in the remote package:

$ metapack -r s3://library.civicknowledge.com/private/carr/civicknowledge.com-rcfe_health-1.csv
seniors s3://library.civicknowledge.com/private/carr/civicknowledge.com-rcfe_health-1/data/seniors.csv
rcfe_tract s3://library.civicknowledge.com/private/carr/civicknowledge.com-rcfe_health-1/data/rcfe_tract.csv
rcfe_sra s3://library.civicknowledge.com/private/carr/civicknowledge.com-rcfe_health-1/data/rcfe_sra.csv
rcfe_seniors_tract s3://library.civicknowledge.com/private/carr/civicknowledge.com-rcfe_health-1/data/rcfe_seniors_tract.csv

Then, run the same command again, but appending a fragment to the url, and redirecting to a csv file. For instance, for the ‘seniors’ file, append #seniors to the url:

$ metapack -r s3://.../civicknowledge.com-rcfe_health-1.csv#seniors > seniors.csv

You can also fetch the entire data package, downloading all of the data files, by creating a local file system, zip or excel package. The easiest to use is the Filesystem package, created with metapack -f

$ metapack -f s3://.../civicknowledge.com-rcfe_health-1.csv

The command will create a complete data package with unpacked CSV files in the _packages subdirectory.