Rga: Ripgrep, but also search in PDFs, E-Books, Office documents, zip, tar.gz

Last modified on December 03, 2020

• Closing Exchange

rga is a line-oriented search device that lets you examine for a regex in a multitude of file kinds. rga wraps the superior ripgrep and permits it to go making an attempt in pdf, docx, sqlite, jpg, zip, tar.*, film subtitles (mkv, mp4), and many others.



Examples

PDFs

Verbalize you're going to want a mountainous folder of papers or lecture slides, and that you just simply can’t contemplate which one among them talked about GRUs. With rga, that you just simply can right flee this:

~$ rga "GRU" slides/
slides/2016/iciness1516_lecture14.pdf
Page 34:   GRU                            LSTM
Page 35:   GRU                            CONV
Page 38:     - Are making an attempt out GRU-RCN! (imo most attention-grabbing mannequin)

slides/2018/cs231n_2018_ds08.pdf
Page  3: ●   CNNs, GANs, RNNs, LSTMs, GRU
Page 35: ● 1) temporal pooling 2) RNN (e.g. LSTM, GRU)

slides/2019/cs231n_2019_lecture10.pdf
Page 103:   GRU [Learning phrase representations using rnn
Page 105:    - Common to use LSTM or GRU

and it will recursively find a string in pdfs, including if some of them are zipped up.

You can do mostly the same thing with pdfgrep -r, but you will miss content in other file types and it will be much slower:

Searching in 65 pdfs with 93 slides each

05101520pdfgreprga (first run)rga (subsequent runs)
  • run time (seconds, lower is better)

On the first run rga is mostly faster because of multithreading, but on subsequent runs (with the same files but any regex query) rga will cache the text extraction, so it becomes almost as fast as searching in plain text files. All runs were done with a warm FS cache.

Other files

rga will recursively descend into archives and match text in every file type it knows.

Here is an example directory with different file types:

demo
├── greeting.mkv
├── hello.odt
├── hello.sqlite3
└── somearchive.zip
    ├── dir
    │   ├── greeting.docx
    │   └── inner.tar.gz
    │       └── greeting.pdf
    └── greeting.epub

(see the actual directory here)

~$ rga "hello" demo/

demo/greeting.mkv
metadata: chapters.chapter.0.tags.title="Chapter 1: Hello"
00:08.398 --> 00:11.758: Hello from a movie!

demo/hello.odt
Hello from an OpenDocument file!

demo/hello.sqlite3
tbl: greeting='hello', from='sqlite database!'

demo/somearchive.zip
dir/greeting.docx: Hello from a MS Office document!
dir/inner.tar.gz: greeting.pdf: Page 1: Hello from a PDF!
greeting.epub: Hello from an E-Book!

It can even search jpg / png images and scanned pdfs using OCR, though this is disabled by default since it is not useful that often and pretty slow.

~$ # find screenshot of crates.io
~$ rga crates ~/screenshots --rga-adapters=+pdfpages,tesseract
screenshots/2019-06-14-19-01-10.png
crates.io I Browse All Crates  Docs v
Documentation Repository Dependent crates

~$ # there it is!

Setup

Linux, Windows and OSX binaries are available in GitHub releases. See the readme for more information.

For Arch Linux, I have packaged rga in the AUR: yay -S ripgrep-all

Technical details

The code and a few more details are here: https://github.com/phiresky/ripgrep-all

rga simply runs ripgrep (rg) with some options set, especially --pre=rga-preproc and --pre-glob.

rga-preproc [fname] will match an "adapter" to the given file in conserving with each it’s filename or it’s mime form (if --rga-appropriate is given). You can presumably also see all adapters on the second built-in in src/adapters.

Some rga adapters flee exterior binaries to type the true work (corresponding to pandoc or ffmpeg), in most circumstances by writing to stdin and learning from stdout. Others expend a Rust library or bindings to impress the identical invent (like sqlite or zip).

To learn archives, the zip and tar libraries are outmoded, which work absolutely in a streaming vogue - this suggests that the RAM utilization is low and no recordsdata is ever in level of reality extracted to disk!

Most adapters learn the recordsdata from a Be taught, in converse that they work completely on streamed recordsdata (that may maybe effectively advance from wherever alongside with inside nested archives).

One day of the extraction, rga-preproc will compress the ideas with ZSTD to a reminiscence cache whereas concurrently writing it uncompressed to stdout. After completion, if the reminiscence cache is smaller than 2MByte, it is written to a rkv cache. The cache is keyed by (adapter, filename, mtime), so if a file changes it’s state is extracted but once more.

Future Work

  • I desired in order so as to add a photograph adapter (in conserving with object classification / detection) for stress-free, in order that that you just simply can grep for "mountain" and this will likely articulate photographs of mountains, like in Google Photography. It labored with YOLO, but one factor further valuable and suppose-of-the art work like this proved very laborious to combine.
  • 7z adapter (couldn’t rating an appropriate to expend Rust library with streaming)
  • Enable per-adapter configuration decisions (doubtlessly by env (RGA_ADAPTERXYZ_CONF=json))
  • Presumably expend a diversified disk kv-store as a cache reasonably than rkv, as a result of I had some remarkable issues with that. SQLite is sizable. All different Rust alternate decisions I'd rating don’t permit writing from a number of processes.
  • Tests!
  • There’s some further (largely technical) todos in the code I don’t know uncover the way to restore. Serve needed.
  • Assorted launch factors
  • pdfgrep
  • this gist has my proof of thought model of a caching extractor to expend ripgrep as a different for pdfgrep.
  • this gist is a further broad preprocessing script by @ColonolBuendia
  • lesspipe is a device to accumulate much less work with many fairly various file kinds. Assorted usecase, but similar in what it does.

Read More

Similar Products:

Recent Content