{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": { "vscode": { "languageId": "plaintext" } }, "source": [ "# S/ISO/PDF \n", "\n", "PDF stands for Portable Document Format, which was created by Adobe[^1], \n", "and currently maintained by the International Organization for Standardization (ISO) as an open source international standard [^2]. \n", "\n", "Some commonly used specialized PDF types include: \n", "\n", "- ISO 14289/PDF/UA for accessible PDF documents and processors (extends PDF/A conformance level A) \n", "- ISO 15930/PDF/X for printing \n", "- ISO 19005/**PDF/A for long-term archiving** [^3] \n", " - Sub-parts:\n", " - ISO 19005-1:2005/PDF/A-1 (based on PDF v1.4)\n", " - ISO 19005-2:2011/PDF/A-2 (based on PDF v1.7)\n", " - ISO 19005-3:2012/PDF/A-3 (add file)\n", " - ISO 19005-4:2020/PDF/A-4 (based on PDF v2.0)\n", " - Not allow: audio, video, 3d objects, JS, certain actions, encryption, non-standard metadata\n", " - Require: embedding font with proper license \n", "- ISO 24517/PDF/E for representing engineering documents (CAD, etc.).\n", "\n", "For regulatory submission, FDA currently support \"PDF versions 1.4 through 1.7, PDF/A-1 and PDF/A-2\"[^4].\n", "Steps for **creating and validating** PDF/A files can be found in reference [^5],[^6].\n", "\n", "The module `stdiso.pdfsummary` depends on package `pypdf` [^7].\n", "The module `stdiso.pdfsummary` include functions for creating summaries about a specified PDF file.\n" ] }, { "cell_type": "markdown", "metadata": { "vscode": { "languageId": "plaintext" } }, "source": [ "## PDF File Summary\n", "\n", "To use `mtbp3.stdiso`:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "vscode": { "languageId": "plaintext" } }, "outputs": [], "source": [ "from mtbp3.stdiso.pdfsummary import pdfSummary\n", "\n", "pfr = pdfSummary(path=\"\")\n", "print(pfr.get_summary_string())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If the path left as empty, an example pdf file will be loaded for illustration.\n", "More details about the example pdf can be found here: https://arxiv.org/abs/1706.03762." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To view the outline tree:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "vscode": { "languageId": "plaintext" } }, "outputs": [], "source": [ "print(pfr.show_outline_tree())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Work with Images\n", "\n", "We can see that there is one image in the 3rd page from the summary above. \n", "To extract the first image on the 3rd page:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "vscode": { "languageId": "plaintext" } }, "outputs": [], "source": [ "img = pfr.get_image(page_index=2, image_index=0, outfolder='')\n", "print(type(img))\n", "print(img.size)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "vscode": { "languageId": "plaintext" } }, "outputs": [], "source": [ "display(img.resize((300, int(300*img.size[1]/img.size[0]))))" ] }, { "cell_type": "markdown", "metadata": { "vscode": { "languageId": "plaintext" } }, "source": [ "The `resize()` function above resized the figure before displaying. Use `display(img)` in Jupyter if resizing is not required.\n", "\n", "To save the 2nd image on the 4th page to a file, add an existing folder path using `outfolder='add_path_here'`:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "vscode": { "languageId": "plaintext" } }, "outputs": [], "source": [ "img_path = pfr.get_image(page_index=3, image_index=1, outfolder='.')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The function `get_image()` returns a file path instead of the image when the `outfolder` option is not an empty string. \n", "To read and display the saved image file in Jupyter:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "vscode": { "languageId": "plaintext" } }, "outputs": [], "source": [ "from IPython.display import Image \n", "\n", "img = Image(filename=img_path, width=300)\n", "display(img)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Reference\n", "\n", "[^1]: Adobe. (2024). Everything you need to know about the PDF. ([web page](https://www.adobe.com/acrobat/about-adobe-pdf.html))\n", "[^2]: ISO. (2021). The standard for PDF is revised. ([web page](https://www.iso.org/news/ref2608.html))\n", "[^3]: pdfa.org. (2013). PDF/A in a Nutshell 2.0. ([web page](https://pdfa.org/resource/pdfa-in-a-nutshell-2-0/))\n", "[^4]: FDA. (2016). Portable Document Format (PDF) Specifications. ([pdf](https://www.fda.gov/files/drugs/published/Portable-Document-Format-Specifications.pdf))\n", "[^5]: Adobe. (2023). PDF/X-, PDF/A-, and PDF/E-compliant files (Acrobat Pro). ([web page](https://helpx.adobe.com/acrobat/using/pdf-x-pdf-a-pdf.html))\n", "[^6]: pypdf Contributors. (2024). PDF/A Compliance. ([web page](https://pypdf.readthedocs.io/en/stable/user/pdfa-compliance.html))\n", "[^7]: pypdf Contributors. (2024). pypdf. ([web page](https://pypdf.readthedocs.io/en/stable/index.html))\n", "\n", "\n", "\n", "\n" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.13" } }, "nbformat": 4, "nbformat_minor": 4 }