the 'official' Archive::BagIt repository. Contains patches to update Archive::BagIt to version 1.0 of BagIt, see RFC 8493 (https://tools.ietf.org/html/rfc8493)
Go to file
2025-02-25 09:55:49 +01:00
bagit_conformance_suite - fix, if you store a bag with large payload-files, the bagit.txt and bag-info.txt should be written after manifest-files to ensure, that a partial written bagit is not processed before finished. 2025-01-29 08:10:32 +01:00
lib/Archive - fixed issues if bag-info.txt is empty or undefined, unit tests added, too 2025-02-25 09:52:52 +01:00
t - fixed issues if bag-info.txt is empty or undefined, unit tests added, too 2025-02-25 09:52:52 +01:00
.gitignore - ignore ide-config 2020-06-24 11:58:20 +02:00
Changes Released 0.100 2025-02-25 09:55:49 +01:00
dist.ini - fixes gathering issue with testfile 2022-05-10 10:45:02 +02:00
ignore.txt - adapted to build a clean package 2020-06-23 20:00:44 +02:00
README - fixed outdated README 2025-01-29 07:57:41 +01:00
README.developer - init 2020-06-23 20:00:56 +02:00
README.mkdn Released 0.100 2025-02-25 09:55:49 +01:00

NAME

    Achive::BagIt - The main module to handle Bags

SYNOPSIS

    This modules will hopefully help with the basic commands needed to
    create and verify a bag. This part supports BagIt 1.0 according to RFC
    8493
    ([https://tools.ietf.org/html/rfc8493](https://tools.ietf.org/html/rfc8493)).

    You only need to know the following methods first:

 read a BagIt

        use Archive::BagIt;
    
        #read in an existing bag:
        my $bag_dir = "/path/to/bag";
        my $bag = Archive::BagIt->new($bag_dir);

 construct a BagIt around a payload

        use Archive::BagIt;
        my $bag2 = Archive::BagIt->make_bag($bag_dir);

 verify a BagIt-dir

        use Archive::BagIt;
    
        # Validate a BagIt archive against its manifest
        my $bag3 = Archive::BagIt->new($bag_dir);
        my $is_valid1 = $bag3->verify_bag();
    
        # Validate a BagIt archive against its manifest, report all errors
        my $bag4 = Archive::BagIt->new($bag_dir);
        my $is_valid2 = $bag4->verify_bag( {report_all_errors => 1} );

 read a BagIt-dir, change something, store

    Because all methods operate lazy, you should ensure to parse parts of
    the bag *BEFORE* you modify it. Otherwise it will be overwritten!

        use Archive::BagIt;
        my $bag5 = Archive::BagIt->new($bag_dir); # lazy, nothing happened
        $bag5->load(); # this updates the object representation by parsing the given $bag_dir
        $bag5->store(); # this writes the bag new

SOURCE

    The original development version was on github at
    http://github.com/rjeschmi/Archive-BagIt and may be cloned from there.

    The actual development version is available at
    https://git.fsfe.org/art1pirat/Archive-BagIt

Conformance to RFC8493

    The module should fulfill the RFC requirements, with following
    limitations:

    only encoding UTF-8 is supported

    version 0.97 or 1.0 allowed

    version 0.97 requires tag-/manifest-files with md5-fixity

    version 1.0 requires tag-/manifest-files with sha512-fixity

    BOM is not supported

    Carriage Return in bagit-files are not allowed

    fetch.txt is unsupported

    At the moment only filepaths in linux-style are supported.

    To get an more detailled overview, see the testsuite under
    t/verify_bag.t and corresponding test bags from the BagIt conformance
    testsuite of Library of Congress under bagit_conformance_suite/.

    See https://datatracker.ietf.org/doc/rfc8493/?include_text=1 for
    details.

TODO

    enhanced testsuite

    reduce complexity

    use modern perl code

    add flag to enable very strict verify

METHODS

 Constructor

    The constructor sub, will create a bag with a single argument,

        use Archive::BagIt;
    
        #read in an existing bag:
        my $bag_dir = "/path/to/bag";
        my $bag = Archive::BagIt->new($bag_dir);

    or use hashreferences

        use Archive::BagIt;
    
        #read in an existing bag:
        my $bag_dir = "/path/to/bag";
        my $bag = Archive::BagIt->new(
            bag_path => $bag_dir,
        );

    The arguments are:

    bag_path - path to bag-directory

    force_utf8 - if set the warnings about non portable filenames are
    disabled (default: enabled)

    use_async - if set it uses IO::Async to read payload files asynchronly,
    only useful under Linux.

    use_parallel - if set it uses Parallel::parallel_map to calculate
    digests of payload files in parallel, only useful if underlying
    filesystem supports parallel read and if multiple CPU cores available.

    use_plugins - expected manifest plugin strings, if set it uses the
    requested plugins, example Archive::BagIt::Plugin::Manifest::SHA256.
    HINT: this option *disables* the forced fixity check in verify_bag()!

    The bag object will use $bag_dir, BUT an existing $bag_dir is not read.
    If you use store() an existing bag will be overwritten!

    See load() if you want to parse/modify an existing bag.

 use_parallel()

    if set it uses parallel digest processing, default: false

 use_async()

    if set it uses async IO, default: false

 has_force_utf8()

    to check if force_utf8() was set.

    If set it ignores warnings about potential filepath problems.

 bag_path([$new_value])

    Getter/setter for bag path

 metadata_path()

    Getter for metadata path

 payload_path()

    Getter for payload path

 checksum_algos()

    Getter for registered Checksums

 bag_version()

    Getter for bag version

 bag_encoding()

    Getter for bag encoding.

    HINT: the current version of Archive::BagIt only supports UTF-8, but
    the method could return other values depending on given Bags.

 bag_info([$new_value])

    Getter/Setter for bag info. Expects/returns an array of HashRefs
    implementing simple key-value pairs.

    HINT: RFC8493 does not allow *reordering* of entries!

 has_bag_info()

    returns true if bag info exists.

 errors()

    Getter to return collected errors after a verify_bag() call with Option
    report_all_errors

 warnings()

    Getter to return collected warnings after a verify_bag() call

 digest_callback()

    This method could be reimplemented by derived classes to handle fixity
    checks in own way. The getter returns an anonymous function with
    following interface:

       my $digest = $self->digest_callback;
       &$digest( $digestobject, $filename);

    This anonymous function MUST use the get_hash_string() function of the
    Archive::BagIt::Role::Algorithm role, which is implemented by each
    Archive::BagIt::Plugin::Algorithm::XXXX module.

    See Archive::BagIt::Fast for details.

 get_baginfo_values_by_key($searchkey)

    Returns all values which match $searchkey, undef otherwise

 is_baginfo_key_reserved_as_uniq($searchkey)

    returns true if key is reserved and should be uniq

 is_baginfo_key_reserved( $searchkey )

    returns true if key is reserved

 verify_baginfo()

    checks baginfo-keys, returns true if all fine, otherwise returns undef
    and the message is pushed to errors(). Warnings pushed to  warnings() 

 delete_baginfo_by_key( $searchkey )

    deletes an entry of given $searchkey if exists. If multiple entries
    with $searchkey exists, only the last one is deleted.

 exists_baginfo_key( $searchkey )

    returns true if a given $searchkey exists

 append_baginfo_by_key($searchkey, $newvalue)

    Appends a key value pair to bag_info.

    HINT: check return code if append was successful, because some keys
    needs to be uniq.

 add_or_replace_baginfo_by_key($searchkey, $newvalue)

    It replaces the first entry with $newvalue if $searchkey exists,
    otherwise it appends.

 forced_fixity_algorithm()

    Getter to return the forced fixity algorithm depending on BagIt version

 manifest_files()

    Getter to find all manifest-files

 tagmanifest_files()

    Getter to find all tagmanifest-files

 payload_files()

    Getter to find all payload-files

 non_payload_files()

    Getter to find all non payload-files

 plugins()

    Getter/setter to algorithm plugins

 manifests()

    Getter/Setter to all manifests (objects)

 algos()

    Getter/Setter to all registered Algorithms

 load_plugins

    As default SHA512 and MD5 will be loaded and therefore used. If you
    want to create a bag only with one or a specific checksum-algorithm,
    you could use this method to (re-)register it. It expects list of
    strings with namespace of type: Archive::BagIt::Plugin::Algorithm::XXX
    where XXX is your chosen fixity algorithm.

 load()

    Triggers loading of an existing bag

 verify_bag($opts)

    A method to verify a bag deeply. If $opts is set with
    {return_all_errors} all fixity errors are reported. The default ist to
    croak with error message if any error is detected.

    HINT: You might also want to check Archive::BagIt::Fast to see a more
    direct way of accessing files (and thus faster).

 calc_payload_oxum()

    returns an array with octets and streamcount of payload-dir

 calc_bagsize()

    returns a string with human readable size of paylod

 create_bagit()

    creates a bagit.txt file

 create_baginfo()

    creates a bag-info.txt file

    Hint: the entries 'Bagging-Date', 'Bag-Software-Agent', 'Payload-Oxum'
    and 'Bag-Size' will be automagically set, existing values in internal
    bag-info representation will be overwritten!

 store()

    store a bagit-obj if bagit directory-structure was already constructed.

 init_metadata( $bag_path, $options)

    A constructor that will just create the metadata directory

    This won't make a bag, but it will create the conditions to do that
    eventually

 make_bag( $bag_path, $options )

    A constructor that will make and return a bag from a directory,

    It expects a preliminary bagit-dir exists. If there a data directory
    exists, assume it is already a bag (no checking for invalid files in
    root)

FAQ

 How to access the manifest-entries directly?

    Try this:

       foreach my $algorithm ( keys %{ $self->manifests }) {
           my $entries_ref = $self->manifests->{$algorithm}->manifest_entries();
           # $entries_ref returns a hashref like:
           # {
           #     data/hello.txt   "e7c22b994c59d9cf2b48e549b1e24666636045930d3da7c1acb299d1c3b7f931f94aae41edda2c2b207a36e10f8bcb8d45223e54878f5b316e7ce3b6bc019629"
           # }
       }

    Similar for tagmanifests

 How fast is Archive::BagIt?

    I have made great efforts to optimize Archive::BagIt for high
    throughput. There are two limiting factors:

    calculation of checksums, by switching from the module "Digest" to
    OpenSSL by using Net::SSLeay a significant speed increase could be
    achieved.

    loading the files referenced in the manifest files was previously done
    serially and using synchronous I/O. By using the IO::Async module, the
    files are loaded asynchronously and the checksums are calculated in
    parallel. If the underlying file system supports parallel accesses, the
    performance gain is huge.

    On my system with 8cores, SSD and a large 9GB bag with 568 payload
    files the results for verify_bag() are:

                        processing time          run time             throughput
       Version       user time    system time    total time    total    MB/s
        v0.71        38.31s        1.60s         39.938s       100%     230
        v0.81        25.48s        1.68s         27.1s          67%     340
        v0.82        48.85s        3.89s          6.84s         17%    1346

 How fast is Archive::BagIt::Fast?

    It depends. On my system with 8cores, SSD and a 38MB bag with 48
    payload files the results for verify_bag() are:

                      Rate         Base         Fast
       Base         3.01/s           --         -21%
       Fast         3.80/s          26%           --

    On my system with 8cores, SSD and a large 9GB bag with 568 payload
    files the results for verify_bag() are:

                    s/iter         Base         Fast
       Base           74.6           --          -9%
       Fast           68.3           9%           --

    But you should measure which variant is best for you. In general the
    default Archive::BagIt is fast enough.

 How to update an old bag of version v0.97 to v1.0?

    You could try this:

       use Archive::BagIt;
       my $bag=Archive::BagIt->new( $my_old_bag_filepath );
       $bag->load();
       $bag->store();

 How to create UTF-8 based paths under MS Windows?

    For versions < Windows10: I have no idea and suggestions for a portable
    solution are very welcome! For Windows 10: Thanks to
    https://superuser.com/questions/1033088/is-it-possible-to-set-locale-of-a-windows-application-to-utf-8/1451686#1451686
    you have to enable UTF-8 support via 'System Administration' ->
    'Region' -> 'Administrative' -> 'Region Settings' -> Flag 'Use Unicode
    UTF-8 for worldwide language support'

    Hint: The better way is to use only portable filenames. See perlport
    for details.

BUGS

    There are problems related to Parallel::parallel_map and IO::AIO under
    MS Windows. The tests are skipped there. Use the parallel feature or
    the Archive::BagIt::Fast at your own risks on a MS Window System. If
    you are a MS Windows developer, feel free to send me patches or hints
    to fix the issues.

THANKS

    Thanks to Rob Schmidt <rjeschmi@gmail.com> for the trustful handover of
    the project and thanks for your initial work! I would also like to
    thank Patrick Hochstenbach and Rusell McOrmond for their valuable and
    especially detailed advice! And without the helpful, sometimes rude
    help of the IRC channel #perl I would have been stuck in a lot of
    problems. Without the support of my colleagues at SLUB Dresden, the
    project would never have made it this far.