iOS App使用GCD导致的卡顿现象及解决方法

2023-09-10 15:44:05 415

最近在调研iOSapp中存在的各种卡顿现象以及解决方法。

iOSApp出现卡顿（stall）的概率可能超出大部分人的想象，尤其是对于大公司旗舰型App。一方面是由于业务功能不停累积，各个产品团队之间缺乏协调，大家都忙着增加功能，系统资源出现瓶颈。另一方面的原因是老设备更新换代太慢，iOS设备的耐用度极好，现在还有不少iPhone4S在服役，iPhone6作为问题设备持有量很高，据估计，现在iPhone6s以前的设备占有比高达40%。

所以，如果尝试在线上App加入卡顿检测的工具，你会发现卡顿出现的概率高的惊人。但卡顿的检测就修复并不简单，主要是因为难以在开发设备上复现。

之前写过一篇介绍主线程卡顿监控的文章，好像现在主流的做法都是通过监控Runloop事件回调，检查进入回调的时间间隔是否超过Threshold，超过则记录当前App所有线程的callstack。

我前段时间从后台上报的卡顿日志里看到这样一个callstack：

>0libsystem_kernel.dylib__workq_kernreturn
>1libsystem_pthread.dylib_pthread_workqueue_addthreads
>2libdispatch.dylib_dispatch_queue_wakeup_global_slow
>3libdispatch.dylib_dispatch_queue_wakeup_with_qos_slow
>4libdispatch.dylibdispatch_async

也就是说卡顿出现在dispatch_async，以我现有对于GCD的认知，dispatch_async是绝无可能出现卡顿的。dispatch_async的主要任务是从系统线程池里取出一个工作线程，并将block放到该线程里去执行。

上述callstack确确实实的出现了，而且样本数量还不少，最后一个函数明显是一个内核调用。从函数名字猜测，可能是GCD尝试从线程池里获取线程，但已有线程都在执行状态，所以向系统内核申请创建新的线程。但创建线程的内核调用会很慢吗？会慢到让主线程出现卡顿的程度？带着疑问我搜索了大量相关资料，最后比较相关的有这样一篇文章：http://newosxbook.com/articles/GCD.html

其中有这样一段话：

Thisisn'tdueto10.9'sGCDbeingdifferent-rather,itdemonstratesthetrueasynchronousnatureofGCD:Themainthreadhasyettoreturnfromrequestingtheworker(whichitdoesbypthread_workqueue_addthreads_np,asI'lldescribelater),andalreadytheworkerthreadhasspawnedandismidexecution,possiblyonanotherCPUcore.Theexactstateofthemainthreadwithrespecttotheworkerislargelyunpredictable.

作者认为，GCD申请到的线程有可能是一个正在处理其他任务的thread，mainthread需要等待这个忙碌的线程返回才能继续执行，我对这种说法存疑。

最后求助无门的状况下，我决定使用一次宝贵的TSL机会，直接向Apple的工程师求教。这里不得不提下，向Apple寻求technicalsupport是非常宝贵而且可行的方案，每个开发者账号每年都有2次机会，不用非常可惜。

我把问题抛过去后，得到一位Apple内核团队工程师的回复，我将精简过的回复以问答的形式展示和大家分享：

Q:lookslikeevenifit'sasyncdispatching,themainthreadstillhastowaitfortheotherthreadtoreturn,duringwhichtime,theotherthreadhappentobeinmidexecutionofsth.thisconfusesme,whatexactlyisthemainthreadwaitingfor?

为什么主线程需要等待dispatch_async返回，主线程到底在等待什么？

A:It'shardtosaywithjustauserspacebacktrace.Frame0hasclearlysentthecurrentthreadintothekernel,andthisspecifickernelcallis/way/toocomplextoanalysefromoutside[1].

从用户态调用栈无法得出答案，内核可能的状态过于复杂。

Q:Iknowit'ssuggestedthatwecreatelimitedamountofserialqueue，andusetargetqueueprobably.butwhatcouldhappenifwedon'tfollowthatrule?

Apple一直推荐自己创建serialGCDqueue的时候，一定要控制数量，而且最好设置targetqueue，否则会出现问题，但会出现什么问题我一直很好奇，这次借着机会一起问了。

*OnmacOS,wherethesystemishappiertoovercommit,youendupwithathreadexplosion.Thatinturncanleadtoproblemsrunningoutofmemory,runningoutofMachports,andsoon.

*OniOS,whichisnothappyaboutovercommitting,youfindthatthelatencybetweenablockbeingqueuedanditrunningcanskyrocket.Thiscan,inturn,haveknock-oneffects.Forexample,thelasttimeIlookedataproblemlikethisIfoundthat`NSOperationQueue`wasdispatchingblockstotheglobalqueueforinternalmaintenancetasks,sowhenonesubsystemwithintheappconsumedallthedispatchworkerthreadsothersubsystemswouldjuststallhorribly.

Note:Inthecontextofdispatch,an“overcommit”iswherethesystemhadtoallocatemorethreadstoaqueuethenthereareCPUcores.Intheorythisshouldneverbenecessarybecauseworkyoudispatchtoaqueueshouldneverblockwaitingforresources.Inpracticeit'sunavoidablebecause,ataminimum,theworkyouqueuecanendupblockingontheVMsubsystem.

Despitethis,it'sstillbesttostructureyourcodetoavoidtheneedforovercommitting,especiallywhentheovercommitdoesn'tbuyyouanything.Forexample,codelikethis:

group=dispatch_group_create();
for(urlinurlsToFetch){
dispatch_group_enter(group);
dispatch_async(dispatch_get_global_queue(…),^{
…fetch`url`synchronously…
dispatch_group_leave(group);
});
}
dispatch_group_wait(group,…);

ishorriblebecauseittiesup10dispatchworkerthreadsforaverylongtimewithoutanybenefit.Andwhilethisisanextremeexample—fromdispatch'sperspective,networkingis/really/slow—therearelessextremeexamplesthataresimilarlyproblematic.Fromdispatch'sperspective,eventhediskdriveisslow(-:

这段回复很有意思。阅读过GCD源码的同学会知道，所有默认创建的GCDqueue都有一个优先级，但其实每个优先级对应两个queue，比如一个是default-priority，那么另一个就是default-priority-overcommit。dispatch_async的时候，会首先将任务丢进default-priority队列，如果队列满了，就转而丢进default-priority-overcommit。

在Mac系统里，GCD允许overcommit，意味着每次dispatch_async都会创建一个新线程，即使overcommit了，这些过量的线程会根据优先级来竞争CPU资源。

而在iOS系统里，GCD会控制overcommit，如果某个优先级队列overcommit里，那么排在后面的任务就会处于等待状态。移动设备CPU资源比较紧张，这种设计合乎常理。

所以如果在iOS里创建过多的serialqueue，那么后面提交的任务可能就会一直处于等待状态。这也是为什么我们需要严格控制queue的数量和层级关系，最好是App当中每个子系统只能分配固定数量和优先级的queue，从而避免threadexplosion导致的代码无法及时执行问题。

Q：Iknowthesystemwatchdogcankillanappifthemainthreadistakingtoolongtorespond.Ialsoheardrumorsthattherearetwoothercasesthatmaygetsyourappkilledbywatchdog.thefirstistoomanynewthreadsarebeingcreatedlikebyrandomusageofdispatchingworktoglobalconcurrentqueue?thesecondcaseisifCPUhasbeenkepttoobusylike100%fortoolong,watchdogkillsapptoo?

我借机问了下系统watchdong强杀App的原因，因为坊间一直有传闻是除了主线程长时间没反应之外，创建过多的线程和CPU长时间超负荷运转也会导致被强杀。

A：I'mnotawareofanyspecificwatchdogcheckalongthoselines,butit'snothardtoimaginethattheabove-mentionedknock-oneffectsmightjamupyourappsufficientlyforthewatchdogtokillitforotherreasons.RunningtheCPUfortoolonggeneratesacrashreportbutitdoesn'tactuallykilltheapp.It'sessentiallya‘warning'crashreportabouttheproblem.

创建过多线程不会直接导致watchdog强杀，但过多线程有可能导致主线程得不到及时处理，而因为其他原因被kill。而CPU长时间过载并不会导致强杀，但系统会生成一个report来警告开发者。我确实看到过不少这类‘thisisnotacrash'的crash日志。

另外还有一些问答，和我当前疑问并不直接相关所以略去。最后再贴一段比较有意思的回复，在阅读之前大家可以自己先思考下：

dispatch_async(myQueue,^{
//lineA
});
//lineB

lineA和lineB谁先执行？

Considerasnippetlikethis:

dispatch_async(myQueue,^{
//lineA
});
//lineB

there'sclearlyaraceconditionbetweenlinesAandB,thatis,betweenthe`dispatch_async`returningandtheblockrunningonthequeue.Thiscanpanoutinmultipleways,including:

*If`myQueue`(whichwe'reassumingisaserialqueue)isbusy,AhastowaitsoBwilldefinitelyrunbeforeA.

*If`myQueue`isempty,there'snoidleCPU,and`myQueue`hasahigherprioritythenthethreadthatcalled`dispatch_async`,youcouldimaginethekernelswitchingtheCPUto`myQueue`sothatitcanrunA.

*Thethreadthatcalled`dispatch_async`couldrunoutofitstimequantumafterschedulingBon`myQueue`butbeforereturningfrom`dispatch_async`,whichagainresultsinArunningbeforeB.

*If`myQueue`isemptyandthere'sanidleCPU,AandBcouldenduprunningsimultaneously.

答案

其实最后我也没有得到我想要的准确的答案，可能正如回复里所说，情况有很多而且过于复杂，没法通过一个用户态的callstack简单推知内核的状态，但有些有价值的信息还是得以大致理清：

信息一

iOS系统本身是一个资源调度和分配系统，CPU，diskIO，VM等都是稀缺资源，各个资源之间会互相影响，主线程的卡顿看似CPU资源出现瓶颈，但也有可能内核忙于调度其他资源，比如当前正在发生大量的磁盘读写，或者大量的内存申请和清理，都会导致下面这个简单的创建线程的内核调用出现卡顿：

libsystem_kernel.dylib__workq_kernreturn

所以解决办法只能是自己分析各thread的callstack，根据用户场景分析当前正在消耗的系统资源。后面也确实通过最近提交的代码分析，发现是由于增加了一些非常耗时的磁盘io任务（虽然也是放在在子线程），才出现这个看着不怎么沾边的callstack。revert之后卡顿警报就消失了。

信息二

现有的卡顿检测工具都只能在超时的情况下dumpcallstack，但出现超时有可能是任务A，B，C共同作用导致的，A和B可能是真正耗时的任务，C不耗时但碰巧是最后一个，所以被当成元凶，而A和B却没有出现在上报日志里。我暂时也没有想到特别好的解决办法。很明显，libsystem_kernel.dylib__workq_kernreturn就是一个不怎么耗时的C任务。

信息三

在使用GCD创建queue，或者说一个App内部使用GCD执行子线程任务时，最好有一套App所有团队都能遵循的队列使用机制，避免创建过多的thread，而出现意料之外的线程资源紧缺，代码无法及时执行的情况。这很难，尤其是在大公司动则上百人的团队里面。

总结

以上就是这篇文章的全部内容了，希望本文的内容对大家的学习或者工作具有一定的参考学习价值，如果有疑问大家可以留言交流，谢谢大家对毛票票的支持。

iOS App使用GCD导致的卡顿现象及解决方法

热门推荐

随机推荐