The Art of Struggling and PDF Optimization

友人G · February 5, 2022, 4:00pm

Posted by: @FriendG Edited by: @Sunday

The Art of Tinkering and PDF Optimization

The Origin of PDF

My first contact with the PDF format was in high school. At that time, the classrooms were equipped with Seewo all-in-one machines. The chemistry teacher scanned collected test papers into PDFs, imported them into OneNote, and then annotated and explained them on the big screen.

The magical effect of being able to annotate directly on the computer (after all, the Seewo all-in-one runs Windows) took a place in my mind filled with college entrance exam content. After graduating from high school and just before entering university, I chose the Surface Pro 6 as my laptop for my university years. Because the Surface Pro is a 2-in-1 notebook, retaining the Windows system while providing a touchscreen feature, I was finally able to realize the wish to digitize my notes.

But this was just the beginning.

In the first semester of freshman year, when using the Surface Pro as a tablet, I clearly felt rushing heat and bright reflections. These discomforts impacted my concentration, and the weak battery management of the Windows tablet couldn’t support long-term use without power. So soon, I gave up the idea of using the Surface Pro as a writing tablet.

Although the Surface Pro ceased to be the main force for digital learning, digitization was inevitable. In early 2020, the COVID-19 pandemic suddenly arrived. The second semester of freshman year began amidst laughter and “clinical learning." Staying at home provided a good research environment, and my PDF research started from there.

Endless Tinkering

Getting PDFs

You can’t cook without rice. Obtaining original PDF resources is the beginning of everything. The PDFs discussed here refer specifically to scanned books.

According to the source, we can divide them into:

Acquiring existing PDF resources;
Creating PDFs from physical books.

Each method has pros and cons. Getting existing PDFs is convenient, but many new and niche books may not have corresponding PDFs, and PDF quality varies greatly. Making PDFs from physical books first requires having the physical books, appropriate lighting, and scanning (photographing) equipment, but the potential quality and resolution limit is much higher.

Searching for Existing PDFs — Sifting Through the Sand

PDF searches are mainly spread across major databases.

Chaoxing

For Chaoxing, A Xu has already explained in detail, so I won’t elaborate. The competition in this circle is intense; many libraries have recently been shut down. Whoever’s library is more complete now has the discourse power. Let’s just say those who know, know (manual dog head.jpg).

Here’s a picture showing that I have dabbled a bit.

Yingke

Yingke Qianxin is a magical literature service company. After WeChat heavily cracked down on group bots, their company surprisingly still manages to use group bots on a large scale, maybe due to a policy similar to VPN—blocking individuals but not companies.

Yingke cooperates with many university libraries to provide literature search services via WeChat groups, including but not limited to Chinese and foreign journals (searched by citation or DOI), Chinese and foreign publications (searched by ISBN), etc. Thanks to the group bots, the search efficiency and experience are quite comfortable. Unfortunately, Yingke Group 1 of Beijing University of Chinese Medicine is full; perhaps we can only hope for Group 2 or other schools’ Yingke groups. Below is me blending into Yingke’s group of Guizhou Tongren College.

TCM Digital Library

The TCM Digital Library is an online digital library built by the China Traditional Chinese Medicine Publishing House. It has many well-made Epub and PDF resources but they can only be viewed online, not downloaded.

Regarding Epub resources from this website, leaked copies have been seen; perhaps some expert cracked the site’s anti-theft measures, but no further news is known. As for the PDF resources on the website, I once attempted to crack them (I had cracked Renwei’s digital library before using online methods), hoping to download high-quality original PDFs (close to those exported from Word), but ultimately failed.

But the story didn’t end there. In January 2022, during final exams of junior year’s first semester, I found that the “Practical New Course on Internal Medicine of Traditional Chinese Medicine” needed next semester almost had no resources online (except for document delivery from Duxiu, but the quality and delivery speed were worrying), yet the website had the original version (not scanned). Facing a good opportunity but no entry method, after pondering (slacking off during finals) for several days, a peculiar but simple method emerged: screenshots.

Looking back, I feel deeply moved: the light at the end of the tunnel, and more so the simplicity of the great way (manual dog head.jpg). The following outlines the process (Windows environment) for those who come after.

Later, I also found A Xu’s relevant tutorial, which assured me I was not alone on this path (updated 2022.2.14).

One day I randomly tried using a virtual HDMI to create a virtual 4K display, which surprisingly worked (updated 2022.2.15).

*Select ShareX software for full-screen screenshot operation because it is fast and can hide notifications.
*Use “Ever Recorder” component in Quicker software to record keyboard and mouse operations. Specifically, record “PrintScreen (full-screen screenshot) - click next page,” which is the smallest automation unit.
Choose a relatively high-resolution display for subsequent operations because screenshot resolution is limited by display resolution. I used Surface Pro 6 with a resolution of 27361824.
*Open the online PDF of the corresponding book, set the display mode to “Single Page,” i.e., one screen per page. Zoom mode set to “Fit Page.” Because my Surface Pro 6 supports portrait mode, I chose “Fit Width.”
*Then play the recorded keyboard/mouse macro from step 2 and repeat hundreds of times (depending on page count). You can take a tea break midway.
*This way, you get a batch of screenshots, some of which contain the needed PDF content.
*Use IrfanView software to batch crop the obtained images.
*Then refer to the general PDF optimization workflow.

Creating PDFs from Physical Books — Self-Reliance

Before 2021, this wasn’t a hard requirement. But with the 14th and 15th Five-Year Plans, it’s unlikely online copies of textbooks from these years will be available in the next few years. Thus, producing PDFs from physical books became an unavoidable hurdle.

There are two paths: DIY and outsourcing.

DIY

The Zhihu user Chalk Age has a very detailed explanation; I won’t repeat it here. From my hands-on experience, for a few e-books, it’s tough to train yourself to be as skilled as him right at the start, plus medical students have heavy academic pressure, so outsourcing is more recommended.

Outsourcing

Outsourcing means sending books to print shops or Taobao merchants for scanning. Generally, there are book destruction scanning and non-destructive scanning. Book destruction involves removing the book spine then scanning automatically using a sheet-fed scanner, essentially eliminating curvature or skew. Non-destructive scanning typically uses overhead scanners to preserve the book intact but the scan quality is slightly worse than book destruction. Choose per your needs.

General PDF Optimization Workflow

After obtaining a PDF, sometimes issues such as large file size, ghosting, light text, skew, or creases appear. Then optimization is necessary. Referring again to the Zhihu user Chalk Age’s answer (recommended reading), the general steps are as follows:

1. Extract images from the PDF

Scanning PDFs are essentially collections of images. We need to restore them before further processing. I chose PDF Patch Ding for image extraction.

This completes the first step, extracting images from the PDF.

2. Batch optimize images

For image optimization, we can reduce file size and enhance readability. We use ComicEnhancerPro. Chalk Age’s answer explains this more fully.

Open the software
Drag or open images
Focus on the parameter bar
Deskew & crop
Reduce colors
Adjust curves
After other corrections, batch export and wait for completion.

This simple batch image optimization is done.

3. Recombine images into a new PDF

After batch optimizing images, recombine them into a new PDF using PDF Patch Ding. The specific steps are as follows:

4. OCR (Optical Character Recognition)

My first contact with OCR was in high school due to the Zhejiang entrance exam’s technology subject including information technology… In retrospect, OCR is like magic that turns stone into gold. It makes an ordinary image-based PDF into a searchable and copyable file, greatly improving work and study efficiency.

Layered PDF

To make layered PDFs, ABBYY is a common choice.

Download and install ABBYY — here, ABBYY 15 is shown as an example.
Set languages.
Adjust some OCR options:
1. *Turn off MRC compression
2. *Turn off splitting of two-page spreads
Start OCR
After OCR completes, save as searchable PDF

Vectorization

Compared to layered PDFs, vectorization saves the PDF simply as editable text so that scaling corners won’t be jagged. However, without the original image layer, errors in text can be confusing.

Here, we use Wondershare PDF:

Download and install Wondershare PDF.
Select OCR PDF or Batch Processing.
Select files and perform OCR.

5. Getting and adding bookmarks

Bookmarks can be obtained from Duxiu, either directly from the website or via small tools that search by SS number.

Bookmark addition can be batch done with PdgCntEditor. Just satisfy the format BookmarkName + tab + page number, and indentations between levels are made by tab. See the picture for example.

6. PDF page number offset

Usually, the obtained directory is page-numbered from the main text, but our PDFs often don’t start from the main text. Then page number offsets need adjusting. Directory page numbers can be adjusted in PdgCntEditor. Page label offsets can be adjusted in Adobe Acrobat DC. Generally, offset the directory first (mandatory), then page labels (optional but adds polish).

Example of directory page number offset adjustment:

Example of page label offset adjustment (this example PDF has the page label offset adjusted for clarity):

Other PDF Optimization Tips

These are less commonly used and thus simplified here, but can be discussed further if needed.

Removing PDF passwords

There are many types of PDF passwords; here are two simple ones.

Reading restriction password

This requires knowing the password first to remove it. After entering the PDF reading interface with the password, remove Security Method through Properties. Use Adobe Acrobat DC here.

Editing restriction password

This can be cleverly bypassed with small tools. Here, we use PDFPasswordRemover. Just drag the edit-restricted PDF in, and the software will decrypt it automatically, generating a password-free new PDF.

Cropping PDF margins

If the white margins are fairly regular, use Adobe Acrobat DC to crop.

If irregular, use briss to batch crop. It can automatically recognize and crop different pages.

Replacing PDF fonts

Use the PitStop Pro plugin for Adobe Acrobat DC.

PDF page splitting (many-in-one)

Sometimes teachers send PPT slides with multiple slides per page, requiring page splitting. Use A-PDF Page Cut here.

Creating directories from PDF structure

Sometimes we get PDFs exported from Word, which are more regular. We can auto-create directories by matching heading formats using PDF Patch Ding.

Alternative Approaches

Creating PDFs from Word

This is perhaps the most common source for high-quality PDFs. Creating PDFs from Word is like taking a snapshot that freezes a moment of Word’s state.

Before one-click exporting to PDF, we might want to pay attention to some details.

Creating bookmarks from headings

If the Word file contains its own outline levels, selecting Create Bookmarks from Headings will automatically create a directory in the PDF. This option is found in Word’s PDF export Options.

Embedding fonts

Fonts are generally in TTF and OTF formats. OTF fonts are newer; well-known examples include Adobe’s Source Han Serif and Sans.

But note that OTF fonts often cannot be properly embedded in PDFs exported from Word and become jagged. Then it’s necessary to use the TTF version of the font, which can be converted online or locally.

Creating PDFs from Epub/Mobi/Azw3

Epub, Mobi, Azw3 resources are rarer and typically come from Amazon, JD Reading, and other bookstores. Their refined layout and directories make them highly practical. Usually, software like Calibre is used to export them as PDFs with excellent results.

Conclusion

Tinkering is endless, but perhaps what seduces us are the “flow” and “peak” experiences during the process. Even today, when I complete a beautiful PDF like a masterpiece, my soul still trembles like a puppy soaked in a torrential rain.

Copyright statement: Free reprint — Non-commercial — No derivatives — Attribution (Creative Commons 3.0 License)

Author: FriendG

Contact: guyuanye1973 (WeChat), guyuanye1973@foxmail.com

PDF Purchase: I can provide textbooks and other e-book PDFs, including OCR (searchable and copyable) and bookmark directories. For textbooks from the 14th and 15th Five-Year periods, if required for my profession, I scan and create PDFs myself; inquire about specific books.

Publication Date: February 6, 2022

王白水 · June 11, 2023, 4:23am

I’m almost forgetting about this article

一般路过木头人 · June 8, 2024, 5:18am

NB (Posts must be at least 8 characters)

Topic		Replies	Views
【工具推荐】一个网站搞定所有pdf操作 🛠工具与编程 pdf	0	66	July 31, 2024
大家都能用到的获取电子书资源教程 🛠工具与编程电子书	1	1772	February 26, 2025
面向小白的电子书下载教程 🛠工具与编程电子书 , 检索技巧	1	1515	January 18, 2024
【pdf】十四五中医药课本资源下载整合（人民卫生出版社/中国中医药出版社资料分享十四五 , 中国中医药出版社 , 人民卫生出版社-人卫	3	17087	February 5, 2025
【PDF资源-友人G】西医外科学学习指导与习题集资料分享外科学	0	599	April 12, 2022