quartz/quartz/components
うろちょろ ec26ebcc9e
Some checks failed
Build and Test / build-and-test (macos-latest) (push) Has been cancelled
Build and Test / build-and-test (ubuntu-latest) (push) Has been cancelled
Build and Test / build-and-test (windows-latest) (push) Has been cancelled
Build and Test / publish-tag (push) Has been cancelled
Docker build & push image / build (push) Has been cancelled
feat: improve search tokenization for CJK languages (#2231)
* feat: improve search tokenization for CJK languages

Enhance the encoder function to properly tokenize CJK (Chinese, Japanese,
Korean) characters while maintaining English word tokenization. This fixes
search issues where CJK text was not searchable due to whitespace-only
splitting.

Changes:
- Tokenize CJK characters (Hiragana, Katakana, Kanji, Hangul) individually
- Preserve whitespace-based tokenization for non-CJK text
- Support mixed CJK/English content in search queries

This addresses the CJK search issues reported in #2109 where Japanese text
like "て以来" was not searchable because the encoder only split on whitespace.

Tested with Japanese, Chinese, and Korean content to verify character-level
tokenization works correctly while maintaining English search functionality.

* perf: optimize CJK search encoder with manual buffer tracking

Replace regex-based tokenization with index-based buffer management.
This improves performance by ~2.93x according to benchmark results.

- Use explicit buffer start/end indices instead of string concatenation
- Replace split(/\s+/) with direct whitespace code point checks
- Remove redundant filter() operations
- Add CJK Extension A support (U+20000-U+2A6DF)

Performance: ~878ms → ~300ms (100 iterations, mixed CJK/English text)

* test: add comprehensive unit tests for CJK search encoder

Add 21 unit tests covering:
- English word tokenization
- CJK character-level tokenization (Japanese, Korean, Chinese)
- Mixed CJK/English content
- Edge cases

All tests pass, confirming the encoder correctly handles CJK text.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

---------

Co-authored-by: Claude <noreply@anthropic.com>
2025-12-02 10:04:38 -08:00
..
pages fix: add proper popover hint to tag content page 2025-04-10 16:28:36 -07:00
scripts feat: improve search tokenization for CJK languages (#2231) 2025-12-02 10:04:38 -08:00
styles fix(css): not break word in the search button placeholder (#2182) 2025-10-31 17:01:51 -07:00
ArticleTitle.tsx chore(types): add additional hint for LSP support (#864) 2024-02-13 23:53:44 -05:00
Backlinks.tsx feat: support non-singleton explorer 2025-03-10 15:13:22 -07:00
Body.tsx chore(types): add additional hint for LSP support (#864) 2024-02-13 23:53:44 -05:00
Breadcrumbs.tsx fix: cleanup a href link construction, global shared trie, breadcrumbs use trie 2025-03-23 17:24:43 -07:00
Comments.tsx feat(giscus): expose language option for Comments component (#2012) 2025-06-08 11:23:01 +02:00
ConditionalRender.tsx feat: conditional render component 2025-03-23 17:34:14 -07:00
ContentMeta.tsx fix: use time HTML element for date strings (#1622) 2024-12-03 01:41:55 -05:00
Darkmode.tsx feat: support non-singleton darkmode 2025-03-10 11:44:47 -07:00
Date.tsx fix: use time HTML element for date strings (#1622) 2024-12-03 01:41:55 -05:00
DesktopOnly.tsx feat: flex component, document higher-order layout components 2025-03-11 14:56:43 -07:00
Explorer.tsx fix(a11y): aria-controls and role fixes 2025-08-03 22:44:35 -07:00
Flex.tsx fix(flex): respect DesktopOnly and MobileOnly components (#1971) 2025-06-02 18:36:57 +02:00
Footer.tsx feat(layout): add afterBody 2024-07-09 19:09:31 -07:00
Graph.tsx fix(graph): make graph non-singleton, proper cleanup, fix radial 2025-03-10 11:39:08 -07:00
Head.tsx feat(fonts): allow PageTitle to have its own font subset (#1848) 2025-03-18 21:43:32 -07:00
Header.tsx chore(types): add additional hint for LSP support (#864) 2024-02-13 23:53:44 -05:00
index.ts feat: reader mode 2025-04-17 19:45:17 -07:00
MobileOnly.tsx feat: flex component, document higher-order layout components 2025-03-11 14:56:43 -07:00
OverflowList.tsx fix(a11y): aria-controls and role fixes 2025-08-03 22:44:35 -07:00
PageList.tsx fix(RecentNotes): Prevent folder pages from always appearing first (closes #1901) (#1904) 2025-04-04 10:36:29 -07:00
PageTitle.tsx feat(fonts): allow PageTitle to have its own font subset (#1848) 2025-03-18 21:43:32 -07:00
ReaderMode.tsx feat(i18n): readermode translations and icon (#1961) 2025-05-07 21:56:18 +02:00
RecentNotes.tsx feat: ability to hide tags in the recent notes component (#1147) 2024-05-21 09:50:58 -07:00
renderPage.tsx Prevent double-loading of afterDOMReady scripts (#2213) 2025-11-27 14:51:56 -08:00
Search.tsx fix(style): layout flow, search restyle 2025-09-17 15:26:49 -07:00
Spacer.tsx fix(div): update class name to remove weird space afterwards (#763) 2024-01-29 21:51:13 -08:00
TableOfContents.tsx fix(a11y): aria-controls and role fixes 2025-08-03 22:44:35 -07:00
TagList.tsx fix: cleanup a href link construction, global shared trie, breadcrumbs use trie 2025-03-23 17:24:43 -07:00
types.ts feat: support non-singleton explorer 2025-03-10 15:13:22 -07:00