Metapack Overview¶
Metapack is a system for packaging data, based on a structured metadata format called Metatab. A Metapack data package is a collection of data and metadata, where the metadata is expressed in Metatab format, usually in a CSV file. Metapack packages come in several variants:
Filesystem: files in a directory
ZIP: A ZIPped Filesystem package
S3: A Filesystem package in an S3 buckets
Excel: All metadata and data in a single Excel spreadsheet
Source: A Filesystem package with data processing instructions, to build all other package types.
Each package has data, metadata and documentation.
Root Metadata includes the title, name, identifiers, and other top level information
Resources are data that is included in the data package, usually as a CSV file.
References are URLs to other documents or data
Documentation includes a README file, links to websites, and inline notes.
Data Dictionary, a list of all tables and their columns.
All of this data and metadata is accessible through either the Metapack programamtic interface or the CLI commands.
The resources, references and documentation metadata makes heavy use of URLs to refer to external resources, and the resources and references use of custom urls to refer to row-oriented data.
For this overview, we’ll refer to the metadata file for the example.com-full-2017-us-1 package
Just Enough Metatab¶
To fully understand this documentation, you’ll want to have a basic understanding of Metatab. The best information is in the Specification, but you can get by with a short introduction.
Metatab is a tabular format for structured data, so you can put records that
have multiple properties and children into a spreadsheet. Each of the Metatab
records is a Term
. Terms have a short name and a fully qualified name.
For instance, the term that holds title information is the Root.Title
term, but can be shortened to Title
.
In a Metatab file, the first column always holds a term name, and the second column holds the term value. The columns after the first two can hold properties, which are also child terms.
Child Relationship are encoded by specifying a term that has the first part of the fully qualified term be the same name as the parent term. For instance, these rows:
Root.Table |
TableName |
Table.Column |
ColumnName |
Will create a Root.Table
term, with a value of ‘TableName’ and a
Table.Column
child, with a column name of ‘ColumnName’. The parent
portion of a term name can be elided if it can be inferred from the previous
term, so the above example can also be written as:
+--------------+------------+
.Column |
ColumnName |
That is, if the Term starts with .
, it is assumed to be a child of the
most recent top level term. If there is no parent portion to the term, and
no .
, the term is assumed to be the child of Root
. The most
common way to present this information, however, is to elide Root
,
but be explicit about most other parents. So this example is most often written as:
Table |
TableName |
Table.Column |
ColumnName |
Rows can also have properties, values in the third column of the file or
later, which are converted to child properties. The term name for the
properties is specified in a header, which is part of the section the terms are
in. A Metatab document starts with a root section, but the section can be
explicitly set with a Section
term. Here is an example, from the
Schema section of a typical Metatab document:
Section |
Schema |
DataType |
Table |
TableName |
|
Table.Column |
ColumnName |
integer |
In the Section
row, the third column, “DataType” declares that in
this section, any value in the third column is a child of the row’s term,
with a name of DataType
. Therefore, the third line of this example
results in a Table.Column
term with a value of “ColumnName” and the
Table.Column
term has a child term of type Column.DataType
with a value of “integer”. So these rows are equivalent to:
Table |
TableName |
Table.Column |
ColumnName |
Column.DataType |
integer |
For writing in text files, there is a “Text Lines” version of the format, which consists of only the term and value portions of the format; all properties are represented explicitly as children. This is the format you will see in this documentation. For instance, the Schema section example would be expressed in Text Lines as:
Section: Schema
Table: TableName
Table.Column: ColumnName
Column.DataType: Integer
In both CSV and Lines format, the parent portion of a term name can be elided if it can be inferred from the previous term, so the above example can also be written as:
Section: Schema
Table: TableName
Table.Column: ColumnName
Column.DataType: Integer
The Lines format is more compact and more readable in text files, so occasionally documentation will use the Lines format.
Root, Documentation and Contact Metadata¶
The Root section is the first, unlabeled section of a Metatab document, which contains information such as the package title, name, and identification numbers. In the example.com-full-2017-us-1 example file, the root section contains:
Declare |
metatab-latest |
Title |
A Metatab Example Data Package |
Description |
An example data package, from the Metatab tutorial at |
Description |
https://github.com/CivicKnowledge/metatab-py/blob/master/README.rst |
Identifier |
96cd659b-94ad-46ae-9c18-4018caa64355 |
Name |
example.com-full-2017-us-1 |
Dataset |
full |
Origin |
example.com |
Time |
2017 |
Space |
US |
Version |
1 |
Modified |
2017-09-20T16:00:18 |
Issued |
2017-09-20T16:43:33 |
Giturl |
|
Distribution |
http://library.metatab.org/example.com-full-2017-us-1/metadata.csv |
Distribution |
|
Distribution |
Some of the important terms in this section include:
Declare: specifies the terms that are valid for the document and their datatypes.
Title: The dataset title
Description: A simple description, which can be split across multiple terms.
Identifier: An automatically generated unique string for this dataset.
Name: The formal name of the dataset, which is created from the Origin, Dataset, Variation, Time, Space, Grain and Version terms.
Distribution: Indicates where other versions of this same package are located on the Web.
The Documentation section has links to URLS, or text files included in a ZIP package, for important documentation, download pages, data dictionaries, or notes.
The Contacts section lists the names, urls and email addresses for people opr organizations that created, wrangled or published the dataset.
Resources and References¶
The heart of the metadata is the Resources and References section. Both sections have the same format, with an important difference: The Resources section declares row-oriented datafiles that are included in data packages ( ie, files that are copied into a ZIP package ) while the References section specifies URLs to objects that are not included in the data package, and may not be row-oriented data.
The Resources section has ( in Lines format )
Section: Resources|Name|schema|StartLine|HeaderLines|Description|nrows|
Datafile: http://public.source.civicknowledge.com/example.com/sources\
/test_data.zip#renter_cost.csv
.Name: renter_cost
.Startline: 5
.Headerlines: 3,4
.Description: Portion of income spent on rent, extracted from the ACS
.Nrows: 12000
The values for the Datafile terms are urls that reference row-oriented data on
the web. The fragment portion of the URL – preceded by a ‘#’ – describes
that file within the ZIP file to extract. The .Startline
argument indicates
that the first data line of the file is on line 5, not line 1 as is typical,
and the .Headerlines
argument indicates that rather than using line 0 for
the headers, the headers are on lines 3 and 4. The values in line 3 and line 4
will be concatenated column-wise.
Datafiles can also be references from other metatab packages, such as with this resource line:
Datafile: metapack+http://library.metatab.org/\
example.com-simple_example-2017-us-2.csv#random-names
.Name: random-names-csv
.Schema: random-names
.Description: Names and a random number
The metapack+
portion of the URL indicates that the URL references a
metapack package, and the fragment #random-names
is a resource in the
package.
In source packages, resources can also reference programs:
Datafile: program+file:scripts/rowgen.py
.Name: rowgen
The preceeding examples are actually from a source package, so when this package is built all of the resources will be downloaded and processed into a standard CSV files, with a corresponding change to their URLs.
The References section has the same structure to URLs, but the data for the resources is not copied into the data package. References frequently refer to more complex data, such as geographic shape files:
Reference: shape+http://ds.civicknowledge.org/sangis.org/Subregional_Areas_2010.zip
.Name: sra
.Description: Sub-regional areas
The shape+
protocol is defined in the rowgenerators module. The full set of url patterns that
the rowgenerators module recognizes can be found from running the
rowgen-urls -l program
Resource Urls¶
We’ve see a few URLs in the previous sections, but they should be describes in more detail, because URLs are so central to the system. These urls have a few extra components that are not common on web urls. The parts of these URLs are;
An options protocol, the part of the scheme before a ‘=’ character.
A normal URL, or a file path.
A fragment, indicated with a ‘#’ character. Fragments can contain:
** One or two segments, after the fragment, to indicate files within a resource container ** Multiple argments, seperates with & characters.
The _protocol_ describes additional handling for the URL, such as the shape+
protocol, which indicates a shapefile. The _segments_ refer to files in a
contain, such as file in a ZIP archive, or a spreadsheet in an Excel workbook.
There are two segments, so you can refer to a spreadsheet in an Excel workbooks
that’s inside a ZIP file. The _argument_ can override information about the
resoruce describe by the URL, such as forcing a file that ends in ‘.txt’ to be
interpreted as a CSV file.
When Resource URLs are processed in the rowgenerator.appurl
module,
the processing distinguishes several important application-specific parts of
the URL:
proto
. This is set to thescheme_extension
if it exists, the scheme otherwise.resource_file
. The filename of the resource to download. It is usually the last part of the URL, but can be overidden in the fragmentresource_format
. The format name of the resource, normally drawn from theresource_file
extension, but can be overidden in the fragmenttarget_file
. The filename of the file that will be produced by :py:meth`Url.get_target`, but may be overidden.target_format
. The format of thetarget_file
, but may be overridden. The format is just a file extension string, with out the ‘.’.target_segment
. A sub-component of the`target_file
, such as the worksheet in a spreadsheet.fragment_query
. Holds additional parts of the fragment.
Several of these parts can be overridden by URL arguments, which appear after the fragment. The system will accept any URL arguments, but the ones it recognizes are:
resource_file
Used to force the name resource to download, if it is not available as the last component of the URL path.resource_format
Used to force the file type of the resource, if the resource extension is not correct.target_file
Use to force the name of a target file, if it can’t be inferred from the URLtarget_format
Used to force the format of the target file, by specifying an alternate file extension to use.encoding
. Text encoding to be used when reading the target.headers
. For row-oriented data, the row numbers of the headers, as a comma-seperated list of integers.start
. For row-oriented data, the row number of the first row of data ( as opposed to headers. )end
. For row-oriented data, the row number of the last row of data.
Here are a few example URLS that are common in Metapack metadata:
- http://example.com/sources/file.csv
A simple URL to a CSV file
- http://example.com/sources/file.txt#&target_format=csv
A simple URL to a CSV file that has the wrong extension, so force using
csv
- http://example.com/sources/file.csv#&encoding=latin-1
A simple URL to a CSV file, but with latin1 encoding.
- http://example.com/test_data.zip#renter_cost_excel07.xlsx
An Excel file within a ZIP file, defaulting to the first spreadsheet in the workbook.
- http://example.com/test_data.zip#renter_cost_excel07.xlsx;1
The second workbook in an Excel file within a ZIP file.
- python:pylib#func
References a row generating python function in the pyblic module
- gs://1VGEkgXXmpWya7KLkrAPHp3BLGbXibxHqZvfn9zA800w
The first tab of a google spreadsheet, referenced by its ID number.
- metatab+http://library.metatab.org/example.com-simple_example-2017-us-1#random-names
A resource in a Metapack package.
- socrata+http://chhs.data.ca.gov/api/views/tthg-z4mf
A file in a Socrata data repository
Most of these URL forms will only bee seen in source packages for resources, but may appear in the references section of any package type. .Other packages only have resource URLS that refer to well-formed CSV files that have been loaded into the package.
The rowgen program, part of the rowgenerators
module, will convert the row data referenced by a URL into CSV or a table, so it’s handy for testing URLs:
$ rowgen http://public.source.civicknowledge.com/example.com/sources/test_data.zip#simple-example.csv
id,uuid,int,float
1,eb385c36-9298-4427-8925-fe09294dbd5f,30,99.7346915319786
2,fbe2ba34-b130-49b7-bd84-3dc6efb63266,79,18.7600680401673
3,b63c1b4c-0d48-43ae-9f1d-83b0291462b5,21,34.2058855203307
4,bcf29f19-79f3-427d-b068-898e21bdc933,52,85.1947994474281
...
Schemas: The Data Dictionary¶
The last major section of the metadata is the Schema section, which holds information about each of the tables and each column in the table. Like a typical Data Dictionary, this information usually ( or should, anyway ) includes a description of each column.
The schema will have, at least, these values:
Column name
Datatype
And will often also include:
Column description
An alternate name for the column
Alternate names are the main column name, with no spaces, funny characters or uppercase letters.
Continue to the next section, Using Metapack Packages, for basic use patterns.