jolicode/emoji-search

Stars: 213

Forks: 69

Pull Requests: 16

Issues: 25

Watchers: 25

Last Updated: 2022-10-24 12:27:55

#emoji #elasticsearch #plugin #emoticons #cldr #analyzer #elasticsearch-plugin #hacktoberfest #opensearch

:smile: Emoji synonyms to build your own emoji-capable search engine (elasticsearch, solr, OpenSearch)

License: MIT License

Languages: PHP, Makefile

https://jolicode.com/blog/elasticsearch-icu-now-understands-emoji

🙂 Emoji, flags & emoticons support for Elasticsearch

Add support for emoji and flags in any Lucene compatible search engine!

If you wish to search 🍩 to find donuts in your documents, you came to the right place. We offer synonym files ready for usage in Elasticsearch and OpenSearch analyzer.

Requirements to index emoji in Elasticsearch

There is no requirements for Elasticsearch >= 6.7.

Using older version of Elasticsearch? Open me! 🖱

Version	Requirements
Elasticsearch >= 6.4 and < 6.7	You need to install the official ICU Plugin. See our blog post about this change.
Elasticsearch < 6.4	You need our custom ICU Tokenizer Plugin, see our blog post (2016).

Run the following test to verify that you get 4 EMOJI tokens:

GET _analyze
{
  "text": ["🍩 🇫🇷 👩‍🚒 🚣🏾‍♀"]
}

The Synonyms, flags and emoticons

What you need to search with emoji is a way to expand them to words that can match searches and documents, in your language. That's the goal of the synonym dictionaries.

We build Solr / Lucene compatible synonyms files in all languages supported by Unicode CLDR so you can set them up in an analyzer. It looks like this:

👩‍🚒 => 👩‍🚒, firefighter, firetruck, woman
👩‍✈ => 👩‍✈, pilot, plane, woman
🥓 => 🥓, bacon, meat, food
🥔 => 🥔, potato, vegetable, food
😅 => 😅, cold, face, open, smile, sweat
😆 => 😆, face, laugh, mouth, open, satisfied, smile
🚎 => 🚎, bus, tram, trolley
🇫🇷 => 🇫🇷, france
🇬🇧 => 🇬🇧, united kingdom

For emoticons, use this mapping with a char_filter to replace emoticons by emoji.

Installation

Download the emoji and emoticon file you want from this repository and store them in PATH_TO_ES/config/analysis (or anywhere Elasticsearch can read).

config
├── analysis
│   ├── cldr-emoji-annotation-synonyms-en.txt
│   └── emoticons.txt
├── elasticsearch.yml
...

Use them like this (this is a complete english example with Elasticsearch >= 6.7):

PUT /tweets
{
  "settings": {
    "analysis": {
      "filter": {
        "english_emoji": {
          "type": "synonym",
          "synonyms_path": "analysis/cldr-emoji-annotation-synonyms-en.txt"
        },
        "emoji_variation_selector_filter": {
          "type": "pattern_replace",
          "pattern": "\\uFE0E|\\uFE0F",
          "replace": ""
        },
        "english_stop": {
          "type":       "stop",
          "stopwords":  "_english_"
        },
        "english_keywords": {
          "type":       "keyword_marker",
          "keywords":   ["example"]
        },
        "english_stemmer": {
          "type":       "stemmer",
          "language":   "english"
        },
        "english_possessive_stemmer": {
          "type":       "stemmer",
          "language":   "possessive_english"
        }
      },
      "analyzer": {
        "english_with_emoji": {
          "tokenizer": "standard",
          "filter": [
            "english_possessive_stemmer",
            "lowercase",
            "emoji_variation_selector_filter",
            "english_emoji",
            "english_stop",
            "english_keywords",
            "english_stemmer"
          ]
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "content": {
        "type": "text",
        "analyzer": "english_with_emoji"
      }
    }
  }
}

You can now test the result with:

GET tweets/_analyze
{
  "field": "content",
  "text": "🍩 🇫🇷 👩‍🚒 🚣🏾‍♀"
}

How to contribute

Build from CLDR SVN

You will need:

php cli
php zip and curl extensions

Edit the tag in tools/build-released.php and run php tools/build-released.php.

Update emoticons

Run php tools/build-emoticon.php.

Licenses

Emoji data courtesy of CLDR. See unicode-license.txt for details. Some modifications are done on the data, see here. Emoticon data based on https://github.com/wooorm/emoticon/ (MIT).

This repository in distributed under MIT License. Feel free to use and contribute as you please!

OPEN ISSUES

See all

Mark the synonym token filter as updateable and provide a better example by @damienalexandre
Automatic emoji file update uppon CLDR release by @damienalexandre

RELEASES

See all

6.2.4 by @damienalexandre
6.1.3 by @damienalexandre
6.1.2 by @damienalexandre
5.6.6 by @damienalexandre
by @damienalexandre
5.3.0 by @damienalexandre
Fix 5.2.2 by @damienalexandre
5.2.2 by @damienalexandre
5.2.1 by @damienalexandre
5.2.0 by @damienalexandre
5.1.2 by @damienalexandre
5.1.1 by @damienalexandre
5.0.2 by @damienalexandre
5.0.1 by @damienalexandre
5.0.0 by @damienalexandre

jolicode/emoji-search

octicon-link" viewBox="0 0 16 16" version="1.1" width="16" height="16" aria-hidden="true">🙂 Emoji, flags & emoticons support for Elasticsearch

octicon-link" viewBox="0 0 16 16" version="1.1" width="16" height="16" aria-hidden="true">Requirements to index emoji in Elasticsearch

octicon-link" viewBox="0 0 16 16" version="1.1" width="16" height="16" aria-hidden="true">The Synonyms, flags and emoticons

octicon-link" viewBox="0 0 16 16" version="1.1" width="16" height="16" aria-hidden="true">Installation

octicon-link" viewBox="0 0 16 16" version="1.1" width="16" height="16" aria-hidden="true">How to contribute

octicon-link" viewBox="0 0 16 16" version="1.1" width="16" height="16" aria-hidden="true">Build from CLDR SVN

octicon-link" viewBox="0 0 16 16" version="1.1" width="16" height="16" aria-hidden="true">Update emoticons

octicon-link" viewBox="0 0 16 16" version="1.1" width="16" height="16" aria-hidden="true">Licenses