# SPDX-License-Identifier: AGPL-3.0-or-later
"""
DuckDuckGo WEB
~~~~~~~~~~~~~~

DDG's WEB search:

- DuckDuckGo WEB      : ``https://links.duckduckgo.com/d.js?q=..``  (HTTP GET)
- DuckDuckGo WEB no-AI: ``https://noai.duckduckgo.com/``            (HTTP GET)
- DuckDuckGo WEB html : ``https://html.duckduckgo.com/html``        (HTTP POST no-JS / form data)
- DuckDuckGo WEB lite : ``https://lite.duckduckgo.com/lite``        (HTTP POST no-JS / form data)

DDG's content search / see engine ``duckduckgo_extra.py``

- DuckDuckGo Images   : ``https://duckduckgo.com/i.js??q=...&vqd=...``
- DuckDuckGo Videos   : ``https://duckduckgo.com/v.js??q=...&vqd=...``
- DuckDuckGo News     : ``https://duckduckgo.com/news.js??q=...&vqd=...``

.. hint::

   For WEB searches and to determine the ``vqd`` value, DDG-html (no-JS) is
   used.

Special features of the no-JS services (DDG-lite & DDG-html):

- The no-JS clients receive a form that contains all the controlling parameters.
- When the form data is submitted, a real WEB browser sets the HTTP *Sec-Fetch*
  headers.

HTML ``<form>``, HTTP-Headers & DDG's bot Blocker:

  The HTTP User-Agent_ (see below) is generated by the WEB-client and are
  checked by DDG's bot blocker.

To simulate the behavior of a real browser session, it might be necessary to
evaluate additional headers.  For example, in the response from DDG, the
Referrer-Policy_ is always set to ``origin``.  A real browser would then include
the following header in the next request::

    Referer: https://html.duckduckgo.com/

The fields of the html-form are reverse-engineered from DDG-html and may be
subject to additional bot detection mechanisms and breaking changes in the
future.

Query field:

Intro page: https://html.duckduckgo.com/html/

- ``q`` (str): Search query string
- ``b`` (str): Beginning parameter - empty string for first page requests.  If a
  second page is requested, this field is not set!

Search options:

- ``kl`` (str): Keyboard language/region code (e.g. 'en-us' default: 'wt-wt')
- ``df`` (str): Time filter, maps to values like 'd' (day), 'w' (week), 'm' (month), 'y' (year)

The key/value pairs ``df`` and ``kl`` are additional saved in the cookies,
example::

    Cookie: kl=en-us; df=m

*next page* form fields:

- ``nextParams`` (str): Continuation parameters from previous page response,
  typically empty string.  Opposite of ``b``; this field is not set when
  requesting the first result page.

- ``api`` (str): API endpoint identifier, typically 'd.js'
- ``o`` (str): Output format, typically ``json``
- ``v`` (str): Typically ``l`` for subsequent pages


- ``dc`` (int): Display count - value equal to offset (s) + 1
- ``s`` (int): Search offset for pagination
- ``vqd`` (str): Validation query digest

General assumptions regarding DDG's bot blocker:

- Except ``Cookie: kl=..; df=..`` DDG does not use cookies in any of its
  services.

- DDG does not accept queries with more than 499 chars

- The ``vqd`` value ("Validation query digest") is needed to pass DDG's bot
  protection and is used by all request to DDG.

- The ``vqd`` value is generally not needed for the first query (intro); it is
  only required when additional pages are accessed (or when new content needs to
  be loaded for the query while scrolling).

- The second page (additional content) for a query cannot be requested without
  ``vqd``, as this would lead to an immediate blocking, since such a use-case
  does not exist in the process flows provided by DDG (and is a clear indication
  of a bot).

The following HTTP headers are being evaluated (and may possibly be responsible
for issues):

User-Agent_:
  The HTTP User-Agent is also involved in the formation of the vqd value, read
  `DuckDuckGo Bot Detection Research & Solution`_.  However, it is not checked
  whether the UA is a known header. However, it is possible that certain UA
  headers (such as curl) are filtered.

Sec-Fetch-Mode_:
  In the past, Sec-Fetch-Mode had to be set to 'navigate', otherwise there were
  problems with the bot blocker.. I don't know if DDG still evaluates this
  header today

Accept-Language_:
  DDG-Lite and DDG-HTML TRY to guess user's preferred language from the HTTP
  ``Accept-Language``.  Optional the user can select a region filter (but not a
  language).

In  DDG's bot blocker, the IP will be blocked (DDG does not have a client session!)

- As far as is known, it is possible to remove a un-blocked an IP by executing a
  DDG query in a real web browser over the blocked IP (at least that's my
  assumption).

  How exactly the blocking mechanism currently works is not fully known, and
  there were also changes to the bot blocker in the period of Q3/Q4 2025: in the
  past, the IP blocking was implemented as a 'sliding window' (unblock after
  about 1 hour without requests from this IP)

Terms / phrases that you keep coming across:

- ``d.js``, ``i.js``, ``v.js``, ``news.js`` are the endpoints of the DDG's web
  API through which additional content for a query can be requested (vqd
  required)

  The ``*.js`` endpoints return a JSON response and can therefore only be
  executed on a JS-capable client.

  The service at https://lite.duckduckgo.com/lite offers general WEB searches
  (no news, videos etc).  DDG-lite and DDG-html can be used by clients that do
  not support JS, aka *no-JS*.

  DDG-lite works a bit differently: here, ``d.js`` is not an endpoint but a
  field (``api=d.js``) in a form that is sent to DDG-lite.

- The request argument ``origin=funnel_home_website`` is often seen in the DDG
  services when the category is changed (e.g., from web search to news, images,
  or to the video category)

.. _DuckDuckGo Bot Detection Research & Solution:
   https://github.com/ggfevans/searxng/blob/mod-sidecar-harvester/docs/ddg-bot-detection-research.md

.. _Sec-Fetch-Mode:
   https://developer.mozilla.org/en-US/docs/Web/HTTP/Reference/Headers/Sec-Fetch-Mode>

.. _Referrer-Policy:
   https://developer.mozilla.org/docs/Web/HTTP/Reference/Headers/Referrer-Policy#directives

.. _Referer:
   https://developer.mozilla.org/de/docs/Web/HTTP/Reference/Headers/Referer

.. _User-Agent:
   https://developer.mozilla.org/en-US/docs/Web/HTTP/Reference/Headers/User-Agent

.. _Accept-Language:
   https://developer.mozilla.org/en-US/docs/Web/HTTP/Reference/Headers/Accept-Language

"""
# pylint: disable=global-statement

import json
import re
import typing as t

import babel
import lxml.html

from searx import locales
from searx.enginelib import EngineCache
from searx.enginelib.traits import EngineTraits
from searx.exceptions import SearxEngineCaptchaException
from searx.external_bang import EXTERNAL_BANGS, get_node  # type: ignore
from searx.result_types import EngineResults
from searx.utils import (
    ElementType,
    eval_xpath,
    eval_xpath_getindex,
    extr,
    extract_text,
    gen_useragent,
)

if t.TYPE_CHECKING:
    from searx.extended_types import SXNG_Response
    from searx.search.processors import OnlineParams

about: dict[str, str | bool] = {
    "website": "https://lite.duckduckgo.com/lite/",
    "wikidata_id": "Q12805",
    "use_official_api": False,
    "require_api_key": False,
    "results": "HTML",
}

categories: list[str] = ["general", "web"]
paging: bool = True
time_range_support: bool = True
safesearch: bool = True
"""DDG-lite: user can't select but the results are filtered."""

ddg_url: str = "https://html.duckduckgo.com/html/"
"""The process flow for determining the ``vqd`` values was implemented for the
no-JS variant (DDG-html)"""

time_range_dict: dict[str, str] = {"day": "d", "week": "w", "month": "m", "year": "y"}

_CACHE: EngineCache = None  # pyright: ignore[reportAssignmentType]
"""Persistent (SQLite) key/value cache that deletes its values after ``expire``
seconds."""

_HTTP_User_Agent: str = gen_useragent()


def get_cache() -> EngineCache:
    global _CACHE
    if _CACHE is None:  # pyright: ignore[reportUnnecessaryComparison]
        _CACHE = EngineCache("duckduckgo")  # pyright: ignore[reportUnreachable]
    return _CACHE


def set_vqd(query: str | int, value: str, params: "OnlineParams") -> None:
    cache = get_cache()
    key = cache.secret_hash(f"{query}//{params['headers']['User-Agent']}")
    cache.set(key=key, value=value, expire=3600)


def get_vqd(
    query: str,
    params: "OnlineParams",
) -> str:
    """Returns the ``vqd`` value that fits to the *query* (and HTTP User-Agent_
    header).

    :param query: the query term
    :param params: request parameters
    """
    cache = get_cache()
    key = cache.secret_hash(f"{query}//{params['headers']['User-Agent']}")
    value: str = cache.get(key=key) or ""
    if value:
        logger.debug("get_vqd: re-use cached value: %s", value)
    return value


def get_ddg_lang(
    eng_traits: EngineTraits,
    sxng_locale: str,
    default: str = "en_US",
) -> str | None:
    """Get DuckDuckGo's language identifier from SearXNG's locale.

    .. hint::

       `DDG-lite <https://lite.duckduckgo.com/lite>`__ and the *no Javascript*
       page https://html.duckduckgo.com/html do not offer a language selection
       to the user.

    DDG defines its languages by a region code (:py:obj:`fetch_traits`).  To
    get region and language of a DDG service use:

    .. code: python

       eng_region = traits.get_region(params["searxng_locale"], traits.all_locale)
       eng_lang = get_ddg_lang(traits, params["searxng_locale"])

    It might confuse, but the ``l`` value of the cookie is what SearXNG calls
    the *region*:

    .. code:: python

        # !ddi paris :es-AR --> {'ad': 'es_AR', 'ah': 'ar-es', 'l': 'ar-es'}
        params['cookies']['ad'] = eng_lang
        params['cookies']['ah'] = eng_region
        params['cookies']['l'] = eng_region

    """
    lang: str | None = eng_traits.get_language(sxng_locale, default)

    return eng_traits.custom["lang_region"].get(sxng_locale, lang) or None


ddg_reg_map: dict[str, str] = {
    "tw-tzh": "zh_TW",
    "hk-tzh": "zh_HK",
    "ct-ca": "skip",  # ct-ca and es-ca both map to ca_ES
    "es-ca": "ca_ES",
    "id-en": "id_ID",
    "no-no": "nb_NO",
    "jp-jp": "ja_JP",
    "kr-kr": "ko_KR",
    "xa-ar": "ar_SA",
    "sl-sl": "sl_SI",
    "th-en": "th_TH",
    "vn-en": "vi_VN",
}

ddg_lang_map: dict[str, str] = {
    # use ar --> ar_EG (Egypt's arabic)
    "ar_DZ": "lang_region",
    "ar_JO": "lang_region",
    "ar_SA": "lang_region",
    # use bn --> bn_BD
    "bn_IN": "lang_region",
    # use de --> de_DE
    "de_CH": "lang_region",
    # use en --> en_US,
    "en_AU": "lang_region",
    "en_CA": "lang_region",
    "en_GB": "lang_region",
    # Esperanto
    "eo_XX": "eo",
    # use es --> es_ES,
    "es_AR": "lang_region",
    "es_CL": "lang_region",
    "es_CO": "lang_region",
    "es_CR": "lang_region",
    "es_EC": "lang_region",
    "es_MX": "lang_region",
    "es_PE": "lang_region",
    "es_UY": "lang_region",
    "es_VE": "lang_region",
    # use fr --> rf_FR
    "fr_CA": "lang_region",
    "fr_CH": "lang_region",
    "fr_BE": "lang_region",
    # use nl --> nl_NL
    "nl_BE": "lang_region",
    # use pt --> pt_PT
    "pt_BR": "lang_region",
    # skip these languages
    "od_IN": "skip",
    "io_XX": "skip",
    "tokipona_XX": "skip",
}


def quote_ddg_bangs(query: str) -> str:
    """To avoid a redirect, the !bang directives in the query string are
    quoted."""

    _q: list[str] = []

    for val in re.split(r"(\s+)", query):
        if not val.strip():
            continue

        if val.startswith("!") and get_node(EXTERNAL_BANGS, val[1:]):
            val = f"'{val}'"
        _q.append(val)
    return " ".join(_q)


def request(query: str, params: "OnlineParams") -> None:

    if len(query) >= 500:
        # DDG does not accept queries with more than 499 chars
        params["url"] = None
        return

    query = quote_ddg_bangs(query)
    eng_region: str = traits.get_region(
        params["searxng_locale"],
        traits.all_locale,
    )  # pyright: ignore[reportAssignmentType]

    # HTTP headers
    # ============

    headers = params["headers"]

    # The vqd value is generated from the query and the UA header. To be able to
    # reuse the vqd value, the UA header must be static.
    headers["User-Agent"] = _HTTP_User_Agent

    headers["Sec-Fetch-Dest"] = "document"
    headers["Sec-Fetch-Mode"] = "navigate"
    headers["Sec-Fetch-Site"] = "same-origin"
    headers["Sec-Fetch-User"] = "?1"

    headers["Referer"] = "https://html.duckduckgo.com/"

    ui_lang = params["searxng_locale"]
    if not headers.get("Accept-Language"):
        headers["Accept-Language"] = f"{ui_lang},{ui_lang}-{ui_lang.upper()};q=0.7"

    # DDG search form (POST data)
    # ===========================

    # form_data: dict[str,str] = {"v": "l", "api": "d.js", "o": "json"}
    # """The WEB-API "endpoint" is ``api``."""

    data = params["data"]
    data["q"] = query
    params["url"] = ddg_url
    params["method"] = "POST"

    if params["pageno"] == 1:
        data["b"] = ""
    else:
        # vqd is required to request other pages after the first one
        vqd = get_vqd(query=query, params=params)
        if vqd:
            data["vqd"] = vqd
        else:
            # Don"t try to call follow up pages without a vqd value.
            # DDG recognizes this as a request from a bot. This lowers the
            # reputation of the SearXNG IP and DDG starts to activate CAPTCHAs.
            # set suspend time to zero is OK --> ddg does not block the IP
            raise SearxEngineCaptchaException(
                suspended_time=0,
                message=f"VQD missed (page: {params['pageno']}, locale: {params['searxng_locale']})",
            )

        if params["searxng_locale"].startswith("zh"):
            # Some locales (at least China) do not have a "next page" button and DDG
            # will return a HTTP/2 403 Forbidden for a request of such a page.
            params["url"] = None
            return

        data["nextParams"] = ""
        data["api"] = "d.js"
        data["o"] = "json"
        data["v"] = "l"

        offset = 10 + (params["pageno"] - 2) * 15  # Page 2 = 10, Page 2+n = 10 + n*15
        data["dc"] = offset + 1
        data["s"] = offset

    if eng_region == "wt-wt":
        # Put empty kl in form data if language/region set to all
        # data["kl"] = ""
        data["kl"] = "wt-wt"
    else:
        data["kl"] = eng_region
        params["cookies"]["kl"] = eng_region

    t_range: str = time_range_dict.get(str(params["time_range"]), "")
    if t_range:
        data["df"] = t_range
        params["cookies"]["df"] = t_range

    params["headers"]["Content-Type"] = "application/x-www-form-urlencoded"
    params["headers"]["Referer"] = ddg_url

    logger.debug("param headers: %s", params["headers"])
    logger.debug("param data: %s", params["data"])
    logger.debug("param cookies: %s", params["cookies"])


def is_ddg_captcha(dom: ElementType):
    """In case of CAPTCHA ddg response its own *not a Robot* dialog and is not
    redirected to a CAPTCHA page."""

    return bool(eval_xpath(dom, "//form[@id='challenge-form']"))


def response(resp: "SXNG_Response") -> EngineResults:
    res = EngineResults()

    if resp.status_code == 303:
        return res

    doc = lxml.html.fromstring(resp.text)
    params = resp.search_params

    if is_ddg_captcha(doc):
        # set suspend time to zero is OK --> ddg does not block the IP
        raise SearxEngineCaptchaException(suspended_time=0, message=f"CAPTCHA ({params['data'].get('kl')})")

    form = eval_xpath(doc, '//input[@name="vqd"]/..')

    # Some locales (at least China) do not have a "next page" button and DDG
    # will return a HTTP/2 403 Forbidden for a request of such a page.
    if len(form):
        form = form[0]
        form_vqd = eval_xpath(form, '//input[@name="vqd"]/@value')[0]
        q: str = str(params["data"]["q"])
        set_vqd(
            query=q,
            value=str(form_vqd),
            params=resp.search_params,
        )

    # just select "web-result" and ignore results of class "result--ad result--ad--small"
    for div_result in eval_xpath(doc, '//div[@id="links"]/div[contains(@class, "web-result")]'):
        _title = eval_xpath(div_result, ".//h2/a")
        _content = eval_xpath_getindex(div_result, './/a[contains(@class, "result__snippet")]', 0, [])
        res.add(
            res.types.MainResult(
                title=extract_text(_title) or "",
                url=eval_xpath(div_result, ".//h2/a/@href")[0],
                content=extract_text(_content) or "",
            )
        )

    zero_click_info_xpath = '//div[@id="zero_click_abstract"]'
    zero_click = extract_text(eval_xpath(doc, zero_click_info_xpath)).strip()  # type: ignore

    if zero_click and (
        "Your IP address is" not in zero_click
        and "Your user agent:" not in zero_click
        and "URL Decoded:" not in zero_click
    ):
        res.add(
            res.types.Answer(
                answer=zero_click,
                url=eval_xpath_getindex(doc, '//div[@id="zero_click_abstract"]/a/@href', 0),
            )
        )
    return res


def fetch_traits(engine_traits: EngineTraits):
    """Fetch languages & regions from DuckDuckGo.

    SearXNG's ``all`` locale maps DuckDuckGo's "All regions" (``wt-wt``).
    DuckDuckGo's language "Browsers preferred language" (``wt_WT``) makes no
    sense in a SearXNG request since SearXNG's ``all`` will not add a
    ``Accept-Language`` HTTP header.  The value in ``engine_traits.all_locale``
    is ``wt-wt`` (the region).

    Beside regions DuckDuckGo also defines its languages by region codes.  By
    example these are the english languages in DuckDuckGo:

    - en_US
    - en_AU
    - en_CA
    - en_GB

    The function :py:obj:`get_ddg_lang` evaluates DuckDuckGo's language from
    SearXNG's locale.

    """
    # pylint: disable=too-many-branches, too-many-statements, disable=import-outside-toplevel

    from searx.network import get  # see https://github.com/searxng/searxng/issues/762
    from searx.utils import js_obj_str_to_python

    # fetch regions

    engine_traits.all_locale = "wt-wt"

    # updated from u661.js to u.7669f071a13a7daa57cb / should be updated automatically?
    resp = get("https://duckduckgo.com/dist/util/u.7669f071a13a7daa57cb.js", timeout=5)
    if not resp.ok:
        raise RuntimeError("Response from DuckDuckGo regions is not OK.")

    js_code = extr(resp.text, "regions:", ",snippetLengths")

    regions = json.loads(js_code)
    for eng_tag, name in regions.items():
        if eng_tag == "wt-wt":
            engine_traits.all_locale = "wt-wt"
            continue

        region = ddg_reg_map.get(eng_tag)
        if region == "skip":
            continue

        if not region:
            eng_territory, eng_lang = eng_tag.split("-")
            region = eng_lang + "_" + eng_territory.upper()

        try:
            sxng_tag = locales.region_tag(babel.Locale.parse(region))
        except babel.UnknownLocaleError:
            print("ERROR: %s (%s) -> %s is unknown by babel" % (name, eng_tag, region))
            continue

        conflict = engine_traits.regions.get(sxng_tag)
        if conflict:
            if conflict != eng_tag:
                print("CONFLICT: babel %s --> %s, %s" % (sxng_tag, conflict, eng_tag))
            continue
        engine_traits.regions[sxng_tag] = eng_tag

    # fetch languages

    engine_traits.custom["lang_region"] = {}

    js_code = extr(resp.text, "languages:", ",regions")

    languages: dict[str, str] = js_obj_str_to_python(js_code)
    for eng_lang, name in languages.items():
        if eng_lang == "wt_WT":
            continue

        babel_tag = ddg_lang_map.get(eng_lang, eng_lang)
        if babel_tag == "skip":
            continue

        try:
            if babel_tag == "lang_region":
                sxng_tag = locales.region_tag(babel.Locale.parse(eng_lang))
                engine_traits.custom["lang_region"][sxng_tag] = eng_lang
                continue

            sxng_tag = locales.language_tag(babel.Locale.parse(babel_tag))

        except babel.UnknownLocaleError:
            print("ERROR: language %s (%s) is unknown by babel" % (name, eng_lang))
            continue

        conflict = engine_traits.languages.get(sxng_tag)
        if conflict:
            if conflict != eng_lang:
                print("CONFLICT: babel %s --> %s, %s" % (sxng_tag, conflict, eng_lang))
            continue
        engine_traits.languages[sxng_tag] = eng_lang
