# S/ISO/PDF 

PDF stands for Portable Document Format, which was created by Adobe[^1], 
and currently maintained by the International Organization for Standardization (ISO) as an open source international standard [^2]. 

Some commonly used specialized PDF types include: 

- ISO 14289/PDF/UA for accessible PDF documents and processors (extends PDF/A conformance level A) 
- ISO 15930/PDF/X for printing 
- ISO 19005/**PDF/A for long-term archiving** [^3] 
 - Sub-parts:
 - ISO 19005-1:2005/PDF/A-1 (based on PDF v1.4)
 - ISO 19005-2:2011/PDF/A-2 (based on PDF v1.7)
 - ISO 19005-3:2012/PDF/A-3 (add file)
 - ISO 19005-4:2020/PDF/A-4 (based on PDF v2.0)
 - Not allow: audio, video, 3d objects, JS, certain actions, encryption, non-standard metadata
 - Require: embedding font with proper license 
- ISO 24517/PDF/E for representing engineering documents (CAD, etc.).

For regulatory submission, FDA currently support "PDF versions 1.4 through 1.7, PDF/A-1 and PDF/A-2"[^4].
Steps for **creating and validating** PDF/A files can be found in reference [^5],[^6].

The module `stdiso.pdfsummary` depends on package `pypdf` [^7].
The module `stdiso.pdfsummary` include functions for creating summaries about a specified PDF file.


## PDF File Summary

To use `mtbp3.stdiso`:

In [None]:
from mtbp3.stdiso.pdfsummary import pdfSummary

pfr = pdfSummary(path="")
print(pfr.get_summary_string())

If the path left as empty, an example pdf file will be loaded for illustration.
More details about the example pdf can be found here: https://arxiv.org/abs/1706.03762.

To view the outline tree:

In [None]:
print(pfr.show_outline_tree())

## Work with Images

We can see that there is one image in the 3rd page from the summary above. 
To extract the first image on the 3rd page:

In [None]:
img = pfr.get_image(page_index=2, image_index=0, outfolder='')
print(type(img))
print(img.size)

In [None]:
display(img.resize((300, int(300*img.size[1]/img.size[0]))))

The `resize()` function above resized the figure before displaying. Use `display(img)` in Jupyter if resizing is not required.

To save the 2nd image on the 4th page to a file, add an existing folder path using `outfolder='add_path_here'`:

In [None]:
img_path = pfr.get_image(page_index=3, image_index=1, outfolder='.')

The function `get_image()` returns a file path instead of the image when the `outfolder` option is not an empty string. 
To read and display the saved image file in Jupyter:

In [None]:
from IPython.display import Image 

img = Image(filename=img_path, width=300)
display(img)

## Reference

[^1]: Adobe. (2024). Everything you need to know about the PDF. ([web page](https://www.adobe.com/acrobat/about-adobe-pdf.html))
[^2]: ISO. (2021). The standard for PDF is revised. ([web page](https://www.iso.org/news/ref2608.html))
[^3]: pdfa.org. (2013). PDF/A in a Nutshell 2.0. ([web page](https://pdfa.org/resource/pdfa-in-a-nutshell-2-0/))
[^4]: FDA. (2016). Portable Document Format (PDF) Specifications. ([pdf](https://www.fda.gov/files/drugs/published/Portable-Document-Format-Specifications.pdf))
[^5]: Adobe. (2023). PDF/X-, PDF/A-, and PDF/E-compliant files (Acrobat Pro). ([web page](https://helpx.adobe.com/acrobat/using/pdf-x-pdf-a-pdf.html))
[^6]: pypdf Contributors. (2024). PDF/A Compliance. ([web page](https://pypdf.readthedocs.io/en/stable/user/pdfa-compliance.html))
[^7]: pypdf Contributors. (2024). pypdf. ([web page](https://pypdf.readthedocs.io/en/stable/index.html))




