Beep Boop Bip
[Return] [Entire Thread] [Last 50 posts] [First 100 posts]
Posting mode: Reply
Name
Email
Subject   (reply to 2493)
Message
BB Code
File
File URL
Embed   Help
Password  (for post and file deletion)
  • Supported file types are: BMP, C, CPP, CSS, EPUB, FLAC, FLV, GIF, JPG, OGG, PDF, PNG, PSD, RAR, TORRENT, TXT, WEBM, ZIP
  • Maximum file size allowed is 10000 KB.
  • Images greater than 260x260 pixels will be thumbnailed.
  • Currently 1097 unique user posts.
  • board catalog

File 163696117380.jpg - (196.22KB , 707x1000 , f57d134f27578d7936a88ca395828b04.jpg )
2493 No. 2493 [Edit]
Might as well make a general thread for this.
Article on search engines with their own index:
https://seirdy.one/2021/03/10/search-engines-with-own-indexes.html

I'm interested in making and hosting a curated search engine(hand picked domains to index) with the following feature set:

Site specific searching
Date range specific searching
Exact string mandating
Image Search
Maybe document type specific searching too
Maybe a synonym system

After hours of research, I still have no idea where to start, like whether it'd be better to make everything from scratch, or cobble together things that already exist, and what if anything I'd need to make myself with the latter option. This post pointed me in a general direction >>/ot/36920 but both Solr and Elastic Search seem meant for searching internal text based documents. I found virtually no information about using them in this kind of context.

There's also licensing concerns with Elastic Search, though it's popular and there's a domain specific web crawler intended to be easily used with it(Ache). There's also the issue of ui....

I'm surprised there's no preexisting solution for my use case. Yacy seems to be the closest, but pretty much nobody who's talked about it likes it.
Expand all images
>> No. 2495 [Edit]
Great thread idea OP. Here are my 2¢:

I believe that Solr does in fact support crawling (it didn't always though, so what you find online might be out of date). I believe that lucene also supports image similarity via some plugins, and it should definitely handle all of the other things you mentioned.

Yacy seems interesting, it's built on top of Solr if I read correctly, and it might be worth a shot trying it out just to see if it is something that already fits your bill, especially since it provides an out-of-the-box solution for the frontend whereas you will almost certainly need to write your own if you use Solr directly.

The thing about Solr is that it's got a billion knobs and dials to tweak (real enterprise grade software™) so even though you won't have to reinvent the wheel by implementing query-rewrite and ranking, you'll still have to spend a ton of time configuring it for your particular use-case. And Solr might be something of overkill for your needs, but on the flipside because it's so customizable you will get very high quality results and can e.g. add image search support very easily.

Keep in mind that any system that is meant for searching internal text based documents can support your use-case as well, you just need to throw a crawler in front of it that will convert websites into "documents" that can be indexed by the search engine.

I did some searching and didn't really find any other pre-baked stacks for local search engine other than yacy like you mentioned. But I will keep looking.

One other alternative which I consider the 95% solution is to use some crawler combined with SQLite's full text search capability. I'm a big fan of this option since it's very simple, very self-contained, and easy to hack on. It already satisfies most of your criteria except image search or synonyms. The downside is that you're limited by SQLite's FTS' ranking algorithms and so are stuck in terms of how far you can push this. Maybe you can start with this, and see if it's good enough?
>> No. 2496 [Edit]
>>2495
>>2495
Yeah after looking at this further I see only 3 options forward:

* Use Yacy, which is the only end-to-end deployable solution I've found.
* Try wrangling solr into what you want. You will have to do more research here on how to set up the crawler, and will have to write some sort of frontend to call into solr and return the results. Pros: it's very extensible, you will get very high quality results, it's very scalable. Cons: Hard to set up, probably overkill for your use-case?
* Build your own, cobbling together a crawler and a full-text search engine. My suggestion of crawler+sqlite FTS is the easiest and most self-contained one I could think of, but there are dozens of full text search engines out there, some more full-featured than others. Some are integrated with a database, some only serves as indexes. Another option for a FTS engine is https://www.lesbonscomptes.com/recoll/ and it even seems to have a nice gui. Typesense and sonic are other lightweight options I've seen. Another thing to take a look at is https://github.com/dainiusjocas/lucene-grep which you might be able to use as a starting point to hack on.

When you do crawling, you'll have to decide what type of crawling you want to do. E.g. if you want to recrawl sites to update data with a fresh copy, or if you're satisfied with just crawling once at add time and then using that forever. If what your'e indexing is mostly static content that will not change, the latter might be fine. You also have to consider the choice of database, and the intermediate crawl format. E.g. you could just save files as html directly to disk, and just have an index of url to filename and additional metadata that you use to avoid recrawling. That would be the simplest. Maybe even stick that in an sqlite db. Depending on the crawler you use, some of this may already be decided for you.
>> No. 2498 [Edit]
File 163711650698.png - (126.62KB , 600x620 , ed9d35039f7bcc1b0e0e2c70123cbe76.png )
2498
>>2495
>>2496
Thank you for the information and suggestions.

>you will almost certainly need to write your own if you use Solr directly
I wonder if Searx could work. Searx is marketed as a proxy for common search engines, but I've seen at least one case of yacy being used as a source for results. So maybe it could receive things directly from an index which has no front end of its own?
https://linuxreviews.org/YaCy#Searx.2BYacy:_A_Huge_Disappointment

Post edited on 16th Nov 2021, 6:57pm
>> No. 2499 [Edit]
>>2498
I don't think using searx would give you much in this case, since it is only a presentation layer. See this pr for instance that shows how to use searx as a frontend for recoll: https://github.com/searx/searx/pull/1257/files.

So at best you can use searx as a nice UI for whatever solution you come up with, but that's not really the hard part here.

>So maybe it could receive things directly from an index which has no front end of its own?
To be clear, searx is just a pretty UI. All it will do is query a backend for you and return the results in a pretty way. I don't think it is capable of doing any sort of ranking on its own, and certainly will not handle the underlying FTS process.

To make a search engine, you need an underlying data source, an index of the data (e.g. posting list in the most simple case), then retrieval algorithms on top of that (query rewrite, stemming etc.), then ranking algorithms to sort the returned results. Luckily there are already projects out there which have taken care of building these things, the only thing is putting them together.

Btw good post on Yacy, that kind of confirms my fear that using Solr for this is basically enterprise-tier overkill. We're not building a distributed fault-tolerant system like Google here, we just want something you can chuck on a raspberry pi connected to a NAS to crawl and index a few thousand sites at most. Sadly I think that Solr/Lucene probably has the best algorithms for querying and ranking that you'll be able to find. So I guess to me that indicates that suggestion #3 is the way forward: try to concoct your own search stack by combining a crawler coupled with either sqlite FTS, Lucene (which seems less daunting than Solr as shown by that lucene-grep project), or one of the other FTS engines I mentioned.
>> No. 2500 [Edit]
>>2499
>So at best you can use searx as a nice UI for whatever solution you come up with
So, uh, is there more to the front-end than that? Where does the front-end I'd have to make myself if I chose solr begin? Is it just the configuration for solr?

As a side note, people like redis....
https://redis.com/modules/redis-search/

Post edited on 16th Nov 2021, 9:48pm
>> No. 2501 [Edit]
>>2500
>So, uh, is there more to the front-end than that? Where does the front-end I'd have to make myself if I chose solr begin? Is it just the configuration for solr?

Well like I said it's pretty much just the UI layer I'm referring to when I say frontend. E.g. the thing that takes the json of results or whatever and lays them out nicely. This is really the most trivial part, so it's why I said you shouldn't focus on it. But yes, once you get to the point where your search stack is up and running and you can send it a request and have it return the results in some formatted manner (either a bunch of rows or a json or whatever) then you can throw a searx frontend on top of that.

As for what you need to do go get solr to work for your use-case, I have no idea. I've never used Solr. But if it's anything like other enterprise scale™ apache software then it's going to be a lot of config files. Maybe start with some tutorials and just see if you can get it to index some local files. Once you get that working, see how you can integrate a crawler to fetch the documents. I've read that Solr has an inbuilt crawler, but that Nutch also used to be the defacto choice to use with Solr.

I found this https://www.cs.toronto.edu/~muuo/blog/build-yourself-a-mini-search-engine/ which might be a good start, but it wants you train your own ranker(!?!) Maybe google around and try to find some other tutorials, and you'll probably have to read a lot on solr configuration in order to figure out the right dials and knobs to set.
>> No. 2511 [Edit]
Some other alternate search engines: gigablast and marginalia. Both are good if you actually want to _explore_ the web without every result being SEO gamed linkbait.

At this point Google's only strength is the size of their index. Their query rewriting and "NLP" based ranking make actually finding organic content terrible. I'm sure a part of this is due to the prevalence of SEO tactics, but if they really cared they would easily penalize against this: they could trivially apply transformer based techniques to detect superficial fluff content, or content that's just copied from more authoritative sources.
>> No. 3392 [Edit]
File 171170458641.jpg - (765.30KB , 2149x3034 , 39fc90f87f1ebc40d2bbfded6985bde2.jpg )
3392
Since posting this thread, I wrote an imageboard engine, and now I feel like coming back to it. I want the index to be distributed, and I'm debating between two approaches, Bittorrent's DHT, or a distributed database like CoackroachDB. I'm leaning towards the latter. From what I understand, the entire DHT would need to be crawled, and that would have to be filtered to leave only the "index". It just doesn't seem suited to the problem.

Cockroach DB is key-value, but its compatible with PostgreSQL's FTS syntax, so it could be self-contained.
https://www.cockroachlabs.com/docs/stable/full-text-search#how-does-full-text-search-work

Some kind of federation scheme, so users could opt into multiple, curated indexes, would also be nice.
>> No. 3396 [Edit]
File 171290144159.png - (2.01MB , 888x1245 , 116290029_p0.png )
3396
I made something bare-bones that partially crawled lainchan's webring, along with a few other sites. It crawls seed url's to a depth of 3, and does a single-page scrape for cross-domain links. I underestimated how slow and computationally expensive web crawling is, but the result seems somewhat promising.

https://codeberg.org/vodahn/Tektite
>> No. 3397 [Edit]
File 171308458420.jpg - (2.33MB , 1158x1637 , 108913776_p0.jpg )
3397
If you have an ip6 address, here's an instance you can try out.
http://[2a07:e03:3:d3::1]:1025/

It's usable for independent use, so figuring out index sharing comes next. I really underestimated how lengthy a process crawling is.
>> No. 3398 [Edit]
This link should be usable by anyone:
http://fukumen.mooo.com:1025/
>> No. 3495 [Edit]
In practice, if you want to find something then trying google+bing+yandex+mojeek+brave seems to be the best option at the moment. And even then none of the search engines were able to pass the test in >>42986 of

>site:tohno-chan.com "accusative case"

In fact I think mojeek's search is just broken, I can clearly see they have some stuff in the index but when I search for the exact text in the returned snippets I get nothing.

And if none of them can properly index or return queries for a completely server-side plain html page like TC, I don't have high hopes for anything more complex.

Orange bar site shills Kagi a lot, in my experience that's mostly just an amalgamation of google+bing+yandex, they don't actually bring anything new to the table you couldn't do with a searx instance, besides maybe convenience.
>> No. 3496 [Edit]
>>3495
I use 4get in my daily life, which unlike searx, supports yandex.
https://git.lolcat.ca/lolcat/4get
>> No. 3497 [Edit]
>>3496
I used it for a while but the captchas started getting annoying.
>> No. 3498 [Edit]
>>3497
Not every instance has captchas.
>> No. 3499 [Edit]
You can make a hundred meta search engines, but the AI will sill craft SEO so it bypasses filters. I used 4get on ddg and amount of shit was pretty much the same.
[Return] [Entire Thread] [Last 50 posts] [First 100 posts]

View catalog

Delete post []
Password  
Report post
Reason  


[Home] [Manage]



[ Rules ] [ an / foe / ma / mp3 / vg / vn ] [ cr / fig / navi ] [ mai / ot / so / tat ] [ arc / ddl / irc / lol / ns / pic ] [ home ]