Communication Synthesis in a multiprocessor environment

2023-10-28 来源：世旅网

CommunicationSynthesisinamultiprocessor

environment

ClaudiuZissulescu,BartKienhuis,EdDeprettere

LeidenEmbeddedResearchCenter,

LeidenInstituteofAdvancedComputerScience(LIACS),

LeidenUniversity,TheNetherlands{claus,kienhuis,edd}@liacs.nl

Presentedat

Field-ProgrammableLogicandApplicationsconference(FPL’05)

Tampere,FinlandAugust24-26,2005

Abstract.AtLeidenUniversity,wearedevelopingadesignmethodologythatallowsforfastmappingofnested-loopapplications(e.g.DSP,Imaging,orMulti-Media)writteninasubsetofMatlabontoreconﬁgurabledevices.ThisdesignmethodologyisimplementedintoatoolchainthatwecallCOMPAAN/LAURA[8].Thismethodologygeneratesaprocessnetworkinwhichtheinter-processcom-municationtakesplaceinapoint-to-pointfashion.Fourtypesofpoint-to-pointinter-processorcommunicationexistinthePN.TwoofthemuseaFIFOlikecommunicationandtheothertwouseacachelikememorytoexchangedata.Inthispaper,weinvestigatetherealizationsforthefourcommunicationtypesandshowthatpoint-to-pointcommunicationatthelevelofscalarscanberealizedautomaticallyandveryefﬁcientlyintoday’sFPGAs.

1Introduction

Tobetterexploitthereconﬁgurablehardwaredevicesthatarecomingtomarket,anum-beroftechnologiesaredevelopedtohandlebillionsoftransistorsavailableinthesenewchips.Akeyideainthesetechnologiesisdecouplingthecommunicationfromcomputation.ThisdecouplingallowstheIPcores(thecomputationpart)andtheinter-connect(thecommunicationpart)tobedesignseparately[5].Respectingthisdesignconcept,wearedevelopingadesignmethodologythatallowsfastmappingofnested-loopapplications(e.g.DSP,Imaging,orMulti-Media)writteninasubsetofMatlabontoreconﬁgurabledevices.ThisdesignmethodologyisimplementedintoatoolchainthatwecallCOMPAAN/LAURA[8].

TheCOMPAANtoolanalyzestheMatlabapplicationandderivesautomaticallyaparallelrepresentation,expressedasaProcessNetwork(PN).APNconsistsofcon-currentprocessesthatareinterconnectedviaasynchronousFIFOs.ThecontroloftheinputMatlabprogramisdistributedovertheprocessesandthememoryisdistributedovertheFIFOs.TheLAURAtoolsynthesizesanetworkofhardwareprocessorsfromthegivenPN.AkeyoperationinLAURAistogeneratetheproperhardwarecommu-nicationmechanismforthepointtopointinter-processorscommunication,whichmaybeadifferentcommunicationstructurethanthePNcommunicationFIFOmodel.

2ClaudiuZissulescu,BartKienhuis,EdDeprettere

Ourtoolﬂowhasbeendevelopedfordata-ﬂowalgorithms,havingcommunica-tionatthelevelofscalars(e.g.bytesorwords).ThecommunicationtopologyofthePNisstatic,derivedatcompiletime.Torealizetheinter-processorcommunication,weusepoint-to-pointcommunicationmechanism.Employingbussesand/orcomplexNetworks-on-Chips(NoCs)[7,4]forthecommunicationisnotfeasibleduetothede-laysintheroutingprocessandtheusageoflargepacketsinsteadofscalarsinthecom-municationprotocol.

Aswefoundoutin[10],fourtypesofpoint-to-pointinter-processorcommunicationexistinthePNwegenerate.TwoofthemuseaFIFOlikecommunicationandtheothertwouseacachelikememorytoexchangedata.Inthispaper,weinvestigatetherealiza-tionsforthefourcommunicationtypesanddiscusstheeffectivenessoftherealizations.Therestofthepaperisorganizedasfollows:ﬁrst,wepresenttheCOMPAAN/LAURAﬂowinordertounderstandhowthehardwaremappingisdonebyLAURA.Next,weaddressthecommunicationchannelgenerationandproposeanapproachtosolveeachcommunicationtype.Weﬁnishthispaperwithadiscussionoverthemeritsandim-provementsoftheseapproachesinthecontextofourtoolchain.1.1

COMPAAN/LAURAdesignﬂow

TheprocessnetworksweconsiderinthispaperarederivedusingtheCOMPAANtoolchain.COMPAANtakesasinputparameterizedstaticnestedloopprogramswritteninMatlabandconvertsthiscodetoprocessnetworks.Anintermediatestepistransforma-tionoftheinitialMatlabcodeintosingleassignmentcode(SAC)usingexactdataﬂowanalysis[3].Thelasttooloftheﬂow,calledLAURA,isusedtogenerateaVHDLde-scriptionofanarchitecturefromaPNdescription.Duringthisstep,eachprocessofthePNismappedtoanabstractarchitecturalmodelcalledVirtualProcessor.Eachvirtualprocessorconsistsofthreedistinctcomponents:

–AnExecuteUnit,whichisthecomputationalpartofthevirtualprocessor.ThisunitwrapsinanIPcorethatimplementsthefunctionalityoftheprocess.ItsinterfaceconsistsofanumberofInputargumentsandOutputarguments.

–AReadUnit,whichisresponsibleforassigningvalidtokenstotheinputargumentsoftheExecuteUnit.Sincetherearemoreinputportsthanarguments,theReadUnithastoselectatrun-timefromwhichChanneltoreadtokensusingacontrolprogramthatisderivedbyCOMPAAN.

–AndaWriteUnit,whichisresponsiblefordistributingtheresultsoftheExecuteUnittodifferentChannels.AwriteoperationcanexecuteonlywhenalltheoutputargumentsoftheExecuteUnitareavailablefortheWriteUnit.SimilarlytotheReadUnit,theWriteUnithastoselectachannelatrun-timetowritetokensinto,usingacontrolprogramthatisderivedbyCOMPAAN.

TheapplicationstargetedbyCOMPAANareusuallydata-ﬂowintensive,requiringlargecomputationalpower.Therefore,animportantissueinLAURAisthederivationofef-ﬁcientcommunicationstructuresinhardwareandisthefocusofthispaper.Initially,COMPAANﬁndsintheinputMatlabﬁle,allthepossibleproducer-consumerpairs.Atthatlevelthecommunicationbetweentwoprocessesisdoneusingamultidimensional

CommunicationSynthesisinamultiprocessorenvironment3

array,representedasapolytope.ToselectthetypeofcommunicationalinearizationprocedureisemployedbytheCOMPAANtoolwhichselectstherighttypeofcommuni-cationchannel.

2CommunicationGeneration

InahardwarenetworkgeneratedbyLAURA,eachprocessorexecutesaninternalcontrolprogramatboththeReadandWriteUnits.ThisprogramdescribesalocalscheduleintermsofExecuteUnitexecutions.Ateachexecution,alsorefertoasaniteration,aReadUnitreadsdatafromaChannelandaWriteUnitwritesdatatoaChannel.IntheoriginalMatlabcode,theChannelrepresentsthecommunicationonan-Dimensionalarray(e.g.,a[i,j]).Thisarrayisreplacedby1-Darraybyourtoolchaininthelinearizationstep.

Virtual Processor 1Channel 1ReadExecuteWriteChannel 2Channel 3ReadExecuteWriteVirtual Processor 2ProducerConsumerFig.1.AProducer-Consumerpair

Figure1depictsaclassicalproducerconsumerpair.AVirtualProcessor1sendsdatatothesecondprocessorcalledVirtualProcessor2viatheChannel2.ThechannelrepresentsthedatadependencybetweenaReadUnitandaWriteUnit.ThisrelationisgivenbyaMappingfunction.Thelinearizationstepreplacestheaddressingofanarraywithrelativeaddressingschemebasedonputandgetprimitives.Usually,thede-rivedcommunicationchannelisaFIFO,however,therearecasesinwhichaFIFOisnotsufﬁcienttolinearizean-dimensionalarray[10].WehavefoundthatfourtypesofcommunicationcanbedistinguishedasgiveninFigure2.Theyresultfromtheorder-ingoftheiterationsattheProducerandtheConsumerprocessesandtheexistenceofmultiplicityforagiventoken,whichmeansthatatokenthatissentbyProducerisreadmorethanonceattheConsumerside.Hence,dependingontheorderandexistenceofmultiplicity,anarbitrarycommunicationchannelbelongstooneoffourdisjointclasses:in-orderwithoutmultiplicity(IOM-),in-orderwithmultiplicity(IOM+),out-of-orderwithoutmultiplicity(OOM-),andout-of-orderwithmultiplicity(OOM+).Foreachclassanadequatecommunicationmechanismneedstobeefﬁcientlysynthesizedinhardwareintermsofcyclesperoperation,areaandspeed.

Fromexperience[10],weknowthatonaveragethefollowingdistributioncanbeex-pectedoverthevariouscommunicationtypes:typeIOM-(80%),IOM+(10%),OOM-(9%),OOM+(1%).TypeIOM-togetherwithtypeIOM+,resultinthat90%ofthecommunicationchannels,requireaFIFObuffertorealizethecommunication.Intheremaining10%ofthecases,amorecomplexReorderingChannelisneeded.2.1InOrdercommunication(IOM-)

IntheInOrdercommunication(IOM-)case,theProducerwritesdataintheChannelinthesameorderastheConsumerreadsfromtheChannel.Therefore,thisChannelis

4ClaudiuZissulescu,BartKienhuis,EdDeprettere

In−order without multiplicity (IOM−) :

Produceri4321In−order with multiplicity (IOM+) :Produceri4data dependecyloop schedulej4321Consumerj4321Consumer3211 2 3 4 5 i1 2 3 4 5 iOut−of−order without multiplicity (OOM−) :

Produceri4321j43211 2 3 4 5 iOut−of−order with multiplicity (OOM+) :

Producerij43211 2 3 4 5 i4321ConsumerConsumerFig.2.ThefourcasesofcommunicationbetweenProducerandConsumer

implementedinhardwareusingaFIFObuffer.Itisaccessedusingthetwoprimitivesput(implementedintheWriteunit)andget(implementedintheReadunit).BecausehighlyoptimizedimplementationsofFIFObuffersexistfortoday’sFPGAs,ittakeseachprimitiveonlyasinglecycletowritedataortoreaddatafromaChannel.AhardwareFIFOhasﬁnitememory,thusbothprimitivesareblocking,e.g.theyhaltaprocessorwhennodataisavailableinaFIFOorwhenaFIFOisfull.FindingalowerboundonaChannelisahardprobleminPNsanditisoutsidethescopeofthispaper,althoughasmalldiscussionisgiveninSection3.

2.2InOrderwithMultiplicitycommunication(IOM+)

IntheInOrderwithMultiplicityCommunication(IOM+)case,theorderdataispro-ducedisthesameastheorderinwhichdataisconsumed.However,somedataiscon-sumedmorethanonce,breakingthecommunicationmodelofaFIFOwhereagetoperationisdestructive.Inthismodel,thelife-timeofatokenneedstobetakenintoaccount.Onlyattheendofthelife-timeofatoken,thetokencanbereleasedfromtheFIFO.WhiletheputprimitiveremainsthesameasinthecaseofIOM-,weaddedanewcommunicationprimitivewhichwecalledthepeekprimitive.ThepeekprimitivefetchesdatafromaFIFObufferwithoutdestroyingit.TodestroythecurrentFIFOdataareleasecontrolissynthesizedintheConsumerReadUnit.Also,theoutputoftheFIFOisregisteredbytheMultiplicityRegisterwhichiscontrolledbythereleasecontrol.Apeekisonlyreadingthecontentsoftheregister,whileagetoperationisreadinganewvaluefromtheFIFObufferandplaceitintheMultiplicityRegister.Thecontrolthatdeterminesthelife-timeofatokenisexpressedinthesamewaythecontrolprogramsintheReadandWriteUnitareexpressed.2.3OutofOrdercommunication(OOM-)

IntheOut-of-Ordercommunication(OOM-)case,aConsumerreadsdatainadifferentorderithasbeenwrittenbytheProducer.Hence,thecommunicationchannelallowsaConsumertofetchdataintheorderitexpectit.WerefertothiskindofChannel,

CommunicationSynthesisinamultiprocessorenvironment5

whichallowsforrun-timereorderingoftokens,asaReorderChannel.Themainele-mentsofthisReorderChannelisthereordermemoryandthetaggingoftokensthatarewritten/readto/fromthereorderingmemory.EachtokenneedstobetaggedtoallowtheConsumerProcesstorequestparticulartokensintheordergivenbyitslocalsched-ule.Thetagcomputationtakesplaceinbothpartsinvolvedinthetransaction,i.e.,theProducersideandtheConsumerside.

ValidDataProducerProducerData Out &AddressDataConsumerConsumerReorder MemoryReorder Memory(RAM)(RAM)Data InDataRequestAddress?ACK ProducerACK ConsumerFig.3.TheorganizationoftheReorderchannel

InFigure3,theorganizationoftheReorderChannelisgiven.Themainelementistherandomaccessmemory,calledtheReorderMemory.TheﬁgurealsoshowsaProducerthatwantstowritetokenstotheChannelandaConsumerthatwantstoreadtokens.Atoken(Data)thatiswrittenbytheProducer,istemporarilystoredinareg-istertogetherwithanaddress.ThisaddressiscalculatedbythetaggeneratoroftheProducer.Eachtokenthatisstoredinmemoryhasavalidbit,whichindicatesthataparticularlocation(address)containsvaliddata.Ifthevalidbitisset,theProducerisnotallowedtowritedataandiscompletelystalleduntiltheaddressbecomesavailableagain(ACKProducer).Otherwise,theProducerwritesthetemporarilystoreddataintothememoryandsetsthevalidbit.Attheotherside,theConsumerplacesarequestcommandtotheReorderingChannelforaparticularlocationgivenbyanaddress.Iftherequestedlocationcontainsvaliddata,theConsumerreceivesanacknowledgesig-nal(ACKConsumer),and,atthesametime,thedesireddata.Ifthelocationdoesnotcontainvaliddata,theConsumerstallsuntilvaliddatabecomesavailable.GiventheorganizationoftheReorderingChannelinFigure3,twoissuesdeterminethedesign.Oneisrelatedwiththecomplexityofthetaggenerationandtheotheroneisrelatedwiththeperformanceofthereorderchannelintermsofclockcycles.2.4Taggeneration

Inthegenerationoftags,wetakeadvantagesofthefactthatweoperateonpolytopeswithinCOMPAAN.Thisleadstotwodifferentapproachwecanusetogeneratethetag.OneapproachisbasedontheEhrhartenumerationtheory[2].Usingthistheory,weobtainapseudo-polynomialexpressionthatgivesanuniqueintegervalueforeachpointenclosedbythepolytope.Thisapproachhasbeensuccessfullyexploredinsoftwarein[9].However,thisapproachisnotsuitableforhardwareimplementationduetothecomplexityoftheobtainedpseudo-polynomialexpression.InourexampleinFigure4,

6ClaudiuZissulescu,BartKienhuis,EdDeprettere

thepseudopolynomialforthepolytopeenclosingtheproducerpointsisgivenbythefollowexpression:(−1/4)∗i2+(N−5/2)∗i+j+[−1,5/2]i.Inthisexpression,thepseudopolynomialterm[−1,5/2]iindicatesthatwhentheevaluationofmod(i,2)isequaltozero,thevalue-1isselected;otherwisethevalue5/2isselectedinthepolynomial.

Inthesecondapproach,werelaxtheshapethatenclosestheproducerpolytopetoahyper-rectangularshape.Wecallthishyper-rectangularshapetheBoundingBox.ForaBoundingBox,wecanmakeuseofclassicallinearizationtoconvertan-dimensionrectangletoanone-dimensionalarray[1,6].FindtheBoundingBoxthatbestenclosesthepolytopeisaminimizationproblemthatwesolveusingintegerlinearprogram-ming.TheimprovementovertheEhrhartapproachisthateachBoundingBoxcanbeaddressedusingasimplepolynomialthatcanbeimplementedefﬁcientlyinhardware.Thetagforatokenisobtainedasafunctionoftheiteratorsofaprocessorandhasthe󰀁N

formtag=k=1ck∗xk+x0+c0whereckrepresentsaconstant,xkisaniterationspaceindex.EachtagbecomesanaddressforaRAMmemory.

ProducerMappingN=87654321543211098761413121118171615212019242322262528272930Consumer(i,j) <−− (y,x)N=87654321211373122148422395241062517112618122719282029301516BoundingBoxPolytopej

i12345678910 = N+2y12345678910 = N+2W:(N−3)*(i−1) + j−400010203040506070809xR:(N−3)*(x−1) + y−420 NULL21 NULL22token 1923token 2024token 2125 NULL26 NULL27token 2228token 2329token 2430313233343536373839token 1token 2token 3token 4token 5token 6token 7token 8token 9token 1010111213141516171819 NULLtoken 11token 12token 13token 14 NULLtoken 15token 16token 17token 18 NULL NULL NULLtoken 25token 26 NULL NULL NULLtoken 27token 2840414243444546474849 NULL NULL NULL NULLtoken 29 NULL NULL NULL NULLtoken 30Reorder memory The Linearization shape is given by the smallestrectagular containing the Producer domain

Fig.4.TheLinearrealizationofthereordermemory

IfweextracttheBoundingBoxattheProducersideinFigure4,weobtainthewriteaddress,whichis(N−3)∗(i−1)+j−4.Fromtheaddressgeneratedbythewriteaddress,theConsumerhastoreadtokensintheorderitdesires.Itthereforerequiresareadaddresstoaccessthetokens.Sincethereisawelldeﬁnedmathematicalrelation-shipbetweentheProducerandConsumerasgivenbytheMapping,wecanobtainthereadaddressbymathematicallycomposingthewriteaddresswiththemappingfunc-tion.Thus,theparameterizedreadaddressisequalto(N−3)∗(x−i)+y−4.Inthiscase,reorderingisinducedbythefactthatthepointsintheConsumerpolytopeareaccessedinadifferentorderthentheProducerhaswrittenthantotheReorderChannel.

ThedownsideoftheBoundingBoxapproachisthatthememoryallocatedfortheReorderMemoryislargerthanneededwhencomparedwiththeEhrhartapproach.InFigure4,thecontentoftheReorderMemoryisgiven.Itshowsatwhichaddressapar-ticulartokeniswritten.Forexample,thetokenfromlocation(6,7)iswrittenataddress28.Thisisthe23rdtokenwrittenbytheProducer.ThistokenisreadbytheConsumer

CommunicationSynthesisinamultiprocessorenvironment7

atlocation(6,7)asthe18thtokenread.Fromthecontentofthereorderingmemory,weobservethatmemorylocationsexistthatareneverwrittenorreadduringtheexecutionofanetwork,asgivenbythe”Nulls”.InthecaseofanEhrhartgeneratedaddress,these”Nulls”wouldnotexistattheexpenseofmorecomplexaddresspolynomialsthataredifﬁculttobeimplementedefﬁcientlyinhardware.2.5Communicationprotocol

TheReorderingChannelimplementsaparticularcommunicationprotocolthatinvolveswritingandreadingatoken.Awriteoperationusesonlytheputcommunicationprim-itive,likewewouldhavedoneincaseofaFIFOChannel.ThisprimitiveprovidestotheReorderChannelatokenthatconsistsofataganddata.Ifplaceisavailableatthelocationgivenbythetag,thetokeniswritten;otherwisetheProducerisstalledrealizingablockingwrite.Areadoperationconsistsofthreecommunicationprimitives;acheck,apeekandaget.TogethertheyrealizethereadingofdatafromtheReorderingChannelinanydesiredorder.Eachprimitiveperformsaparticulartask:

–checkinquiresifspeciﬁcdataispresentintheReorderMemory.ThedesiredtaggiventotheReorderingChannelasaRequest.Ifthatdataispresentatthead-dressgivenbythetag,theReorderingChannelsendsanacknowledgesignaltotheConsumer;otherwisetheConsumerblocksrealizingablockingread.

–getreadsatokenfromtherequestedaddressandsetthevalidbittofalseindicatedthatthelocationifavailableforwritingagain.Agetoperationisthereforedestruc-tive.

–peekreadsatokenfromtherequestedaddressbutkeepsthevalidbithightoin-dicatedthatthelocationisstillneededforreading.Usingthepeekoperation,thelife-timeofatokeniscontrolled.

IntheOOM-communicationtype,weonlyusethecheckandgetprimitivestoreaddatasincenomultiplicityisinvolved.Amemorylocationcanbeimmediatelyreleasedwhenreadingatoken.

2.6OutofOrderwithmultiplicitycommunication(OOM+)

IntheOut-of-OrderwithMultiplicitycommunication(OOM+)case,achannelhasthesamecharacteristicsastheOOM-case.AdditionalreleaselogicishoweveraddedattheConsumersidetokeeptrackofthelife-timeoftokens.Consequently,theOOM+communicationtypealsousesthepeekprimitiveifatokendoesnotyetneedtobereleasedfrommemory.Thegetprimitiveisusedwhenthereleaselogicindicatesthatthelife-timeofatokenhascometoitsend.

3MemoryAllocationSchemes

WhenimplementingaChannel,whetheritisaFIFOoraReorderingChannel,weneedtodeterminealowerboundontheamountoftokensthatcanbestoredinaChannel,withoutcausingadeadlocktooccur.Findingthislowerboundinprocessnetworks

8ClaudiuZissulescu,BartKienhuis,EdDeprettere

isingeneraladifﬁcultproblem.Aprocessnetworkisspeciﬁedintermsofpartialorderingsbetweenaproducerandconsumer,i.e.,theProducer/Consumerpair.ToﬁndalowerboundonaChannelrequires,however,atotalorderontheexecutionoftheprocessesinthenetwork.Manydifferenttotalordersexist;leadingtodifferenttrade-offsinevaluationspeedandmemoryrequirement.AtotalorderforaPNcanbeobtainedbyschedulingorbydoingarun-timeevaluationofaprocessnetwork.Nevertheless,wewouldliketoavoidbothmethods.Theschedulinginterfereswiththenotionofdistributedcontrolandtherun-timeevaluationisnotacompiletimeanalysisthatcanbeperformedaspartoftheCOMPAANtool.Also,boththerun-timeandscheduleapproachcannothandletheparameterizednatureofthePNswederive.Instead,weusecompiletimeallocationschemes.Withoutgoingtodeepintodetail,wecurrentlydistinguishtwocompiletimeallocationschemes:

–MemoryAllocationwithoutreleasingthememory(SAC)

Thismemoryallocationschemeisderivedfromthesingleassignmentdataallo-cationfoundbyoneoftheintermediatestepofourtoolchain.Inthisscheme,amemorylocationisassigneddataonlyonce.HenceaConsumerdoesnotneedtoreleaseamemorylocationafterithasconsumeddatafrommemory.ThememoryrequiredonaChannelisthedimensionneededtoaccommodateallproduceddata,whichisgivenbytheBoundingBox.

–MemoryAllocationWithreleasingthememory(SEQ)

ThismemoryallocationschemeisderivedfromtheoriginaldataallocationfoundinthesequentialinputMatlabprogram.Inthisscheme,memorylocationsarere-usedandhenceeachreadorwriteoperationontheChannelhastocheckifamemorylocationiseitherfreeorcontainsvaliddata.

TheSACallocationschemeresultsinthefastestexecutionofaPNasthenetworkcanrunwithmaximumparallelism.Anyotherallocationscheme,liketheSEQallocationscheme,mayrestrictthenetworkparallelismtoamoresequentialexecutionscheme,asexplicitchecksneedtobeperformedonthevalidityofdata.Theguaranteethatthetwodescribedallocationschemesdoindeednotintroducedeadlockduetounderdimensionedcommunicationbuffers,isgivenbythefactthattheallocationschemesareabletoruninboundedamountofmemoryinthesequentialexecution.Weknowfromsimulationthattighterlowerboundsarepossible.Furtherresearchisneededtodeveloptechniquestoderivethesetighterboundsatcompiletime.WeexpectthatresultsfromresearchinmemoryoptimizationforsequentialcodecanalsobeappliedtoimprovethememoryallocationinPNs.

4Hardwarerealization

AllfourcommunicationtypescanbeautomaticallygeneratedbyLAURAandhavebeenimplementedonaVirtexII-6000platformfromXilinx.AFIFOchannelisrealizedusingdifferenttypesofmemoryatcompiletimedependingonthecalculatedsize.Iflessthan1024bitsarerequired,weuseRAM16x1Dmemories,otherwiseweuseRAMB16memoryblocks.Ifonlyasinglelocationisrequired,wesimplyinstantiateaFIFO

CommunicationSynthesisinamultiprocessorenvironment9

channelthatusesonlyaregister.IncaseoftheReorderChannels,wealwaysuseBlockRAMstorealizetheReorderingMemoryasafulldualportmemoryisrequired.

ThehardwarerealizationofaFIFObufferisthefastestandmostefﬁcientoneintermsofcyclesperoperation,asitrequiresonecycleperreadorwriteoperation.InthecaseofaReorderingChannel,theSACimplementationrequiresstillonecycleperwriteoperationastheavailabilityofalocationdoesnotneedtobechecked.However,thereadoperationrequiresthreecyclestoreadonescalarfromthechannel.TheSEQimplementationoftheReorderChannelrequiresupto2cyclesforawriteand3cyclesforareadoperation.Inboththereadandwriteoperation,theavailabilityofalocationneedstobechecked.

5Example

Tohighlightthedifferentcharacteristicsofthecommunicationchannels,welookedattheQRalgorithm[11].Thisalgorithmrequires1IOM+and10IOM-channels.How-ever,wecanimplementthesechannelsalsousing1OOM+and10OOM-channels.Thishighlightsthedifferencesinthehardwareimplementationandperformanceofthedifferentchannelsandthebeneﬁtsofselectingthetherightchanneltypeatcompiletime.Inbothcases,thelowerboundsonthechannelsaredeterminedatcompiletimeusingtheSACallocationscheme.

15432

CyclesMemorySizeFrequency

1285120bits102Mhz

Compiletime

estimations

MemorySize111771

FIFOChannel

RAM16x1DSlices

Table1.Experimentalresultsforvarioushardwarechannels.

FromTable1,weobservethattheReorderChannelsaremoreinefﬁcientthanFIFOchannelsintermsofusageofFPGAmemories.TheReorderMemoryisrealizedusingonlySelectBlockRAMs(RAMB16).Forthe11channels,11RAMB16memoryblocksallocated,whichis180224bitsinmemoryontheFPGA.ForeachReorderingchannel,weneedtoallocationatleastaRAMB16blockwhichcanaccommodate512tokensof32bit,evenifweonlyneedspacefor10tokens.Hence,itisdifﬁculttousethememoryblockefﬁciently;therecanbequitesomespill.Ontheotherhand,foraFIFOchan-nel,wecanselectbetweenSelectBlockRAMsorRAM16x1Dmemories,dependingontherequiredchannelsize.Inthisparticularcase,weuse320RAM16x1DmemoryblocksandthememoryusedfortheFIFOimplementationisonlyslightlylargerthanthecalculatedsizeusingtheSACallocationscheme,i.e.,5120bitsversusthe4928bitscalculated.SincetheSACallocationschemeisused,theReorderChannelsimplementafastwriteoperation.Nevertheless,ittakestwicethenumberofcyclestoruntheQRal-gorithm(258cycles)comparedtowhenusingFIFOchannels(128cycles).Thatareadoperationrequires3cycles,hasabiginﬂuenceontheexecutionspeedofthealgorithm.ThenumberofslicesusedtoimplementtheQRalgorithmwithFIFOsis890andwithReorderingChannelsis1771.TheReorderingChannelrequirestwiceasmanyslices

10ClaudiuZissulescu,BartKienhuis,EdDeprettere

duetothehardwaretaggeneratorsandtheuseofRAMB16memories.Boththeexecu-tionspeedandnumberofslicesusedshowthatitisindeedveryimportanttoselectthepropercommunicationtypeinordertoevaluateanalgorithmasfastaspossiblewiththeleastamountofresources.

6Conclusions

Thedata-ﬂowintensiveapplications,wetargetwithCOMPAAN,requirelargecompu-tationalpoweranditisimportanttoderiveefﬁcientcommunicationstructuresinhard-ware.Fourpoint-to-pointcommunicationtypesaredistinguishedinCOMPAANandwehaveshownthatforeachofthem,wecanderiveefﬁcienthardwareinLAURA.TwotypesusehardwareFIFOimplementationsandtwotypesuseaReorderChannelim-plementation,aswepresentedinthispaper.Fromacasestudy,wehaveshownthataFIFOimplementationisthemostoptimalimplementationofaChannelfromanypointofview(throughput,hardwareresources,memoryusage).BecauseaFIFOcanreadandwriteinasinglecycleitcankeepupwiththemaximumdataﬂowthroughourvirtualprocessors.Fromexperience,weknowthatonaverage90%ofthechannelsinanapplicationcanberealizedwithFIFObuffers.Intheremaining10%,weemploytheReorderingChannelimplementation.Asfuturework,weareinterestedtofurtheropti-mizetheReorderChannelandportingittootherFPGAarchitectures,e.g.Altera.Also,wewanttoinvestigatenewcompiletimeallocationschemestoobtaintighterlowerboundsonthememoryneededinthecommunicationchannels.

References

1.A.Aho,R.Sethi,andJ.Ullman.Compilers:Principles,TechniquesandTools.Addison-WesleyPublishingCo.,1986.2.E.Ehrhart.Polynˆomesarithm´etiquesetM´ethodedesPoly´edresenCombinatoire.Birkh¨auserVerlag,Basel,internationalseriesofnumericalmathematicsvol.35edition,1977.

3.P.Feautrier.Dataﬂowanalysisofarrayandscalarreferences.InternationalJournalofParallelProgramming,20(1):23–53,1991.

4.P.GuerrierandA.Greiner.Agenericarchitectureforon-chippacket-switchedintercon-nections.InDATE’00:ProceedingsoftheconferenceonDesign,automationandtestinEurope,pages250–256.ACMPress,2000.

5.K.Keutzer,S.Malik,R.Newton,J.Rabaey,andA.L.Sangiovanni-Vincentelli.Systemleveldesign:Orthogonalizationofconcernsandplatform–baseddesign.IEEETrans.onCAD,2000.

6.S.Muchnick.AdvancedCompilerDesignandImplementation.MorganKaufmannPublish-ers,Inc.,1997.

7.E.Rijpkema,K.Goossens,A.adulescu,J.vanMeerbergen,P.Wielage,andE.Waterlander.Tradeoffsinthedesignofarouterwithbothguaranteedandbest-effortservicesfornetworksonchip,2003.

8.T.Stefanov,C.Zissulescu,A.Turjan,B.Kienhuis,andE.Deprettere.Systemdesignusingkahnprocessnetworks:Thecompaan/lauraapproach.InProceedingsofDATE2004,Paris,France,Feb16–202004.

CommunicationSynthesisinamultiprocessorenvironment11

9.A.Turjan,B.Kienhuis,andE.Deprettere.Acompiletimebasedapproachforsolvingout-of-ordercommunicationinKahnProcessNetworks.InProceedingsofIEEE13thInternationalConferenceonApplication-speciﬁcSystems,ArchitecturesandProcessors,July17-192002.10.A.Turjan,B.Kienhuis,andE.Deprettere.AnIntegerLinearProgrammingApproachto

ClassifyCommunicationinProcessNetworks.In8thInternationalWorkshoponSoftwareandCompilersforEmbeddedSystems(SCOPES),Amsterdam,Sept.2–32004.

11.R.WalkeandR.Smith.20gﬂopsqrprocessoronaxilinxvirtex-efpga.InAdvancedSignal

ProcessingAlgorithms,Architectures,andImplementationsX,volume4116,2000.

因篇幅问题不能全部显示，请点此查看更多更全内容

查看全文

全部栏目

Communication Synthesis in a multiprocessor environment