利用Javascript获取选择文本所在的句子详解

2023-09-20 02:24:08 358

前言

最近收到一个issue期望能在划词的时候同时保存单词的上下文和来源网址。这个功能其实很久之前就想过，但感觉不好实现一直拖延没做。真做完发现其实并不复杂，完整代码在这里，或者继续往下阅读分析。话不多说了，来一起看看详细的介绍吧。

原理分析

获取选择文本

通过window.getSelection()即可获得一个Selection对象，再利用.toString()即可获得选择的文本。

锚节点与焦节点

在Selection对象中还保存了两个重要信息，anchorNode和focusNode，分别代表选择产生那一刻的节点和选择结束时的节点，而anchorOffset和focusOffset则保存了选择在这两个节点里的偏移值。

这时你可能马上就想到第一个方案：这不就好办了么，有了首尾节点和偏移，就可以获取句子的头部和尾部，再把选择文本作为中间，整个句子不就出来了么。

当然不会这么简单哈stuck_out_tongue。

强调一下

一般情况下，anchorNode和focusNode都是Text节点（而且因为这里处理的是文本，所以其它情况也会直接忽略），可以考虑这种情况：

Saladictisawesome!

如果选择的是“awesome”，那么anchorNode和focusNode都是isawesome!，所以取不到前面的“Saladict”。

另外还有嵌套的情况，也是同样的问题。

Saladictisawesome!

所以我们还需要遍历兄弟和父节点来获取完整的句子。

遍历到哪？

于是接下就是解决遍历边界的问题了。遍历到什么地方为止呢？我的判断标准是：跳过inline-level元素，遇到block-level元素为止。而判断一个元素是inline-level还是block-level最准确的方式应该是用window.getComputedStyle()。但我认为这么做太重了，也不需要严格的准确性，所以用了常见的inline标签来判断。

constINLINE_TAGS=newSet([
//Inlinetextsemantics
'a','abbr','b','bdi','bdo','br','cite','code','data','dfn','em','i',
'kbd','mark','q','rp','rt','rtc','ruby','s','samp','small',
'span','strong','sub','sup','time','u','var','wbr'
])

原理总结

句子由三块组成，选择文本作为中间，然后遍历兄弟和父节点获取首尾补上。

实现

选择文本

先获取文本，如果没有则退出

constselection=window.getSelection()
constselectedText=selection.toString()
if(!selectedText.trim()){return''}

获取首部

对于anchorNode只考虑Text节点，通过anchorOffset获取选择在anchorNode的前半段内容。

然后开始补全在anchorNode之前的兄弟节点，最后补全在anchorNode父元素之前的兄弟元素。注意后面是元素，这样可以减少遍历的次数，而且考虑到一些被隐藏的内容不需要获取，用innerText而不是textContent属性。

letsentenceHead=''
constanchorNode=selection.anchorNode
if(anchorNode.nodeType===Node.TEXT_NODE){
letleadingText=anchorNode.textContent.slice(0,selection.anchorOffset)
for(letnode=anchorNode.previousSibling;node;node=node.previousSibling){
if(node.nodeType===Node.TEXT_NODE){
leadingText=node.textContent+leadingText
}elseif(node.nodeType===Node.ELEMENT_NODE){
leadingText=node.innerText+leadingText
}
}

for(
letelement=anchorNode.parentElement;
element&&INLINE_TAGS.has(element.tagName.toLowerCase())&&element!==document.body;
element=element.parentElement
){
for(letel=element.previousElementSibling;el;el=el.previousElementSibling){
leadingText=el.innerText+leadingText
}
}

sentenceHead=(leadingText.match(sentenceHeadTester)||[''])[0]
}

最后从提取句子首部用的正则是这个

//matchheada.bisokcharsthatendsasentence
constsentenceHeadTester=/((\.(?![.]))|[^.?!。？！…\r\n])+$/

前面的((\.(?![.]))主要是为了跳过a.b这样的特别是在技术文章中常见的写法。

获取尾部

跟首部同理，换成往后遍历。最后的正则保留了标点符号

//matchtailfor"..."
constsentenceTailTester=/^((\.(?![.]))|[^.?!。？！…\r\n])+(.)\3{0,2}/

压缩换行

拼凑完句子之后压缩多个换行为一个空白行，以及删除每行开头结尾的空白符

return(sentenceHead+selectedText+sentenceTail)
.replace(/(^\s+)|(\s+$)/gm,'\n')//allowoneemptyline&trimeachline
.replace(/(^\s+)|(\s+$)/g,'')//removeheadingortailing\n

完整代码

constINLINE_TAGS=newSet([
//Inlinetextsemantics
'a','abbr','b','bdi','bdo','br','cite','code','data','dfn','em','i',
'kbd','mark','q','rp','rt','rtc','ruby','s','samp','small',
'span','strong','sub','sup','time','u','var','wbr'
])

/**
*@returns{string}
*/
exportfunctiongetSelectionSentence(){
constselection=window.getSelection()
constselectedText=selection.toString()
if(!selectedText.trim()){return''}

varsentenceHead=''
varsentenceTail=''

constanchorNode=selection.anchorNode
if(anchorNode.nodeType===Node.TEXT_NODE){
letleadingText=anchorNode.textContent.slice(0,selection.anchorOffset)
for(letnode=anchorNode.previousSibling;node;node=node.previousSibling){
if(node.nodeType===Node.TEXT_NODE){
leadingText=node.textContent+leadingText
}elseif(node.nodeType===Node.ELEMENT_NODE){
leadingText=node.innerText+leadingText
}
}

for(
letelement=anchorNode.parentElement;
element&&INLINE_TAGS.has(element.tagName.toLowerCase())&&element!==document.body;
element=element.parentElement
){
for(letel=element.previousElementSibling;el;el=el.previousElementSibling){
leadingText=el.innerText+leadingText
}
}

sentenceHead=(leadingText.match(sentenceHeadTester)||[''])[0]
}

constfocusNode=selection.focusNode
if(selection.focusNode.nodeType===Node.TEXT_NODE){
lettailingText=selection.focusNode.textContent.slice(selection.focusOffset)
for(letnode=focusNode.nextSibling;node;node=node.nextSibling){
if(node.nodeType===Node.TEXT_NODE){
tailingText+=node.textContent
}elseif(node.nodeType===Node.ELEMENT_NODE){
tailingText+=node.innerText
}
}

for(
letelement=focusNode.parentElement;
element&&INLINE_TAGS.has(element.tagName.toLowerCase())&&element!==document.body;
element=element.parentElement
){
for(letel=element.nextElementSibling;el;el=el.nextElementSibling){
tailingText+=el.innerText
}
}

sentenceTail=(tailingText.match(sentenceTailTester)||[''])[0]
}

return(sentenceHead+selectedText+sentenceTail)
.replace(/(^\s+)|(\s+$)/gm,'\n')//allowoneemptyline&trimeachline
.replace(/(^\s+)|(\s+$)/g,'')//removeheadingortailing\n
}

总结

以上就是这篇文章的全部内容了，希望本文的内容对大家的学习或者工作具有一定的参考学习价值，如果有疑问大家可以留言交流，谢谢大家对毛票票的支持。

利用Javascript获取选择文本所在的句子详解

热门推荐

随机推荐