• Closing Exchange
rga is a line-oriented search device that lets you examine for a regex in a multitude of file kinds. rga wraps the superior ripgrep and permits it to go making an attempt in pdf, docx, sqlite, jpg, zip, tar.*, film subtitles (mkv, mp4), and many others.
Examples
PDFs
Verbalize you're going to want a mountainous folder of papers or lecture slides, and that you just simply can’t contemplate which one among them talked about GRU
s. With rga, that you just simply can right flee this:
~$ rga "GRU" slides/ slides/2016/iciness1516_lecture14.pdf Page 34: GRU LSTM Page 35: GRU CONV Page 38: - Are making an attempt out GRU-RCN! (imo most attention-grabbing mannequin) slides/2018/cs231n_2018_ds08.pdf Page 3: ● CNNs, GANs, RNNs, LSTMs, GRU Page 35: ● 1) temporal pooling 2) RNN (e.g. LSTM, GRU) slides/2019/cs231n_2019_lecture10.pdf Page 103: GRU [Learning phrase representations using rnn Page 105: - Common to use LSTM or GRU
and it will recursively find a string in pdfs, including if some of them are zipped up.
You can do mostly the same thing with pdfgrep -r
, but you will miss content in other file types and it will be much slower:
Searching in 65 pdfs with 93 slides each
- run time (seconds, lower is better)
On the first run rga is mostly faster because of multithreading, but on subsequent runs (with the same files but any regex query) rga will cache the text extraction, so it becomes almost as fast as searching in plain text files. All runs were done with a warm FS cache.
Other files
rga will recursively descend into archives and match text in every file type it knows.
Here is an example directory with different file types:
demo
├── greeting.mkv
├── hello.odt
├── hello.sqlite3
└── somearchive.zip
├── dir
│ ├── greeting.docx
│ └── inner.tar.gz
│ └── greeting.pdf
└── greeting.epub
(see the actual directory here)
~$ rga "hello" demo/ demo/greeting.mkv metadata: chapters.chapter.0.tags.title="Chapter 1: Hello" 00:08.398 --> 00:11.758: Hello from a movie! demo/hello.odt Hello from an OpenDocument file! demo/hello.sqlite3 tbl: greeting='hello', from='sqlite database!' demo/somearchive.zip dir/greeting.docx: Hello from a MS Office document! dir/inner.tar.gz: greeting.pdf: Page 1: Hello from a PDF! greeting.epub: Hello from an E-Book!
It can even search jpg / png images and scanned pdfs using OCR, though this is disabled by default since it is not useful that often and pretty slow.
~$ # find screenshot of crates.io ~$ rga crates ~/screenshots --rga-adapters=+pdfpages,tesseract screenshots/2019-06-14-19-01-10.png crates.io I Browse All Crates Docs v Documentation Repository Dependent crates ~$ # there it is!
Setup
Linux, Windows and OSX binaries are available in GitHub releases. See the readme for more information.
For Arch Linux, I have packaged rga
in the AUR: yay -S ripgrep-all
Technical details
The code and a few more details are here: https://github.com/phiresky/ripgrep-all
rga
simply runs ripgrep (rg
) with some options set, especially --pre=rga-preproc
and --pre-glob
.
rga-preproc [fname]
will match an "adapter" to the given file in conserving with each it’s filename or it’s mime form (if --rga-appropriate
is given). You can presumably also see all adapters on the second built-in in src/adapters.
Some rga adapters flee exterior binaries to type the true work (corresponding to pandoc or ffmpeg), in most circumstances by writing to stdin and learning from stdout. Others expend a Rust library or bindings to impress the identical invent (like sqlite or zip).
To learn archives, the zip
and tar
libraries are outmoded, which work absolutely in a streaming vogue - this suggests that the RAM utilization is low and no recordsdata is ever in level of reality extracted to disk!
Most adapters learn the recordsdata from a Be taught, in converse that they work completely on streamed recordsdata (that may maybe effectively advance from wherever alongside with inside nested archives).
One day of the extraction, rga-preproc will compress the ideas with ZSTD to a reminiscence cache whereas concurrently writing it uncompressed to stdout. After completion, if the reminiscence cache is smaller than 2MByte, it is written to a rkv cache. The cache is keyed by (adapter, filename, mtime), so if a file changes it’s state is extracted but once more.
Future Work
- I desired in order so as to add a photograph adapter (in conserving with object classification / detection) for stress-free, in order that that you just simply can grep for "mountain" and this will likely articulate photographs of mountains, like in Google Photography. It labored with YOLO, but one factor further valuable and suppose-of-the art work like this proved very laborious to combine.
- 7z adapter (couldn’t rating an appropriate to expend Rust library with streaming)
- Enable per-adapter configuration decisions (doubtlessly by env (RGA_ADAPTERXYZ_CONF=json))
- Presumably expend a diversified disk kv-store as a cache reasonably than rkv, as a result of I had some remarkable issues with that. SQLite is sizable. All different Rust alternate decisions I'd rating don’t permit writing from a number of processes.
- Tests!
- There’s some further (largely technical) todos in the code I don’t know uncover the way to restore. Serve needed.
- Assorted launch factors
- pdfgrep
- this gist has my proof of thought model of a caching extractor to expend ripgrep as a different for pdfgrep.
- this gist is a further broad preprocessing script by @ColonolBuendia
- lesspipe is a device to accumulate
much less
work with many fairly various file kinds. Assorted usecase, but similar in what it does.