Stars: 213
Forks: 69
Pull Requests: 16
Issues: 25
Watchers: 25
Last Updated: 2022-10-24 12:27:55
:smile: Emoji synonyms to build your own emoji-capable search engine (elasticsearch, solr, OpenSearch)
License: MIT License
Languages: PHP, Makefile
https://jolicode.com/blog/elasticsearch-icu-now-understands-emoji
Add support for emoji and flags in any Lucene compatible search engine!
If you wish to search 🍩
to find donuts in your documents, you came to the
right place. We offer synonym files ready for usage in Elasticsearch and OpenSearch analyzer.
There is no requirements for Elasticsearch >= 6.7.
Version | Requirements |
---|---|
Elasticsearch >= 6.4 and < 6.7 | You need to install the official ICU Plugin. See our blog post about this change. |
Elasticsearch < 6.4 | You need our custom ICU Tokenizer Plugin, see our blog post (2016). |
Run the following test to verify that you get 4 EMOJI tokens:
GET _analyze
{
"text": ["🍩 🇫🇷 👩🚒 🚣🏾♀"]
}
What you need to search with emoji is a way to expand them to words that can match searches and documents, in your language. That's the goal of the synonym dictionaries.
We build Solr / Lucene compatible synonyms files in all languages supported by Unicode CLDR so you can set them up in an analyzer. It looks like this:
👩🚒 => 👩🚒, firefighter, firetruck, woman
👩✈ => 👩✈, pilot, plane, woman
🥓 => 🥓, bacon, meat, food
🥔 => 🥔, potato, vegetable, food
😅 => 😅, cold, face, open, smile, sweat
😆 => 😆, face, laugh, mouth, open, satisfied, smile
🚎 => 🚎, bus, tram, trolley
🇫🇷 => 🇫🇷, france
🇬🇧 => 🇬🇧, united kingdom
For emoticons, use this mapping with a char_filter to replace emoticons by emoji.
Download the emoji and emoticon file you want from this repository and store
them in PATH_TO_ES/config/analysis
(or anywhere Elasticsearch can read).
config
├── analysis
│ ├── cldr-emoji-annotation-synonyms-en.txt
│ └── emoticons.txt
├── elasticsearch.yml
...
Use them like this (this is a complete english example with Elasticsearch >= 6.7):
PUT /tweets
{
"settings": {
"analysis": {
"filter": {
"english_emoji": {
"type": "synonym",
"synonyms_path": "analysis/cldr-emoji-annotation-synonyms-en.txt"
},
"emoji_variation_selector_filter": {
"type": "pattern_replace",
"pattern": "\\uFE0E|\\uFE0F",
"replace": ""
},
"english_stop": {
"type": "stop",
"stopwords": "_english_"
},
"english_keywords": {
"type": "keyword_marker",
"keywords": ["example"]
},
"english_stemmer": {
"type": "stemmer",
"language": "english"
},
"english_possessive_stemmer": {
"type": "stemmer",
"language": "possessive_english"
}
},
"analyzer": {
"english_with_emoji": {
"tokenizer": "standard",
"filter": [
"english_possessive_stemmer",
"lowercase",
"emoji_variation_selector_filter",
"english_emoji",
"english_stop",
"english_keywords",
"english_stemmer"
]
}
}
}
},
"mappings": {
"properties": {
"content": {
"type": "text",
"analyzer": "english_with_emoji"
}
}
}
}
You can now test the result with:
GET tweets/_analyze
{
"field": "content",
"text": "🍩 🇫🇷 👩🚒 🚣🏾♀"
}
You will need:
Edit the tag in tools/build-released.php
and run php tools/build-released.php
.
Run php tools/build-emoticon.php
.
Emoji data courtesy of CLDR. See unicode-license.txt for details. Some modifications are done on the data, see here. Emoticon data based on https://github.com/wooorm/emoticon/ (MIT).
This repository in distributed under MIT License. Feel free to use and contribute as you please!