您的当前位置:首页Communication Synthesis in a multiprocessor environment

Communication Synthesis in a multiprocessor environment

2023-10-28 来源:世旅网
CommunicationSynthesisinamultiprocessor

environment

ClaudiuZissulescu,BartKienhuis,EdDeprettere

LeidenEmbeddedResearchCenter,

LeidenInstituteofAdvancedComputerScience(LIACS),

LeidenUniversity,TheNetherlands{claus,kienhuis,edd}@liacs.nl

Presentedat

Field-ProgrammableLogicandApplicationsconference(FPL’05)

Tampere,FinlandAugust24-26,2005

Abstract.AtLeidenUniversity,wearedevelopingadesignmethodologythatallowsforfastmappingofnested-loopapplications(e.g.DSP,Imaging,orMulti-Media)writteninasubsetofMatlabontoreconfigurabledevices.ThisdesignmethodologyisimplementedintoatoolchainthatwecallCOMPAAN/LAURA[8].Thismethodologygeneratesaprocessnetworkinwhichtheinter-processcom-municationtakesplaceinapoint-to-pointfashion.Fourtypesofpoint-to-pointinter-processorcommunicationexistinthePN.TwoofthemuseaFIFOlikecommunicationandtheothertwouseacachelikememorytoexchangedata.Inthispaper,weinvestigatetherealizationsforthefourcommunicationtypesandshowthatpoint-to-pointcommunicationatthelevelofscalarscanberealizedautomaticallyandveryefficientlyintoday’sFPGAs.

1Introduction

Tobetterexploitthereconfigurablehardwaredevicesthatarecomingtomarket,anum-beroftechnologiesaredevelopedtohandlebillionsoftransistorsavailableinthesenewchips.Akeyideainthesetechnologiesisdecouplingthecommunicationfromcomputation.ThisdecouplingallowstheIPcores(thecomputationpart)andtheinter-connect(thecommunicationpart)tobedesignseparately[5].Respectingthisdesignconcept,wearedevelopingadesignmethodologythatallowsfastmappingofnested-loopapplications(e.g.DSP,Imaging,orMulti-Media)writteninasubsetofMatlabontoreconfigurabledevices.ThisdesignmethodologyisimplementedintoatoolchainthatwecallCOMPAAN/LAURA[8].

TheCOMPAANtoolanalyzestheMatlabapplicationandderivesautomaticallyaparallelrepresentation,expressedasaProcessNetwork(PN).APNconsistsofcon-currentprocessesthatareinterconnectedviaasynchronousFIFOs.ThecontroloftheinputMatlabprogramisdistributedovertheprocessesandthememoryisdistributedovertheFIFOs.TheLAURAtoolsynthesizesanetworkofhardwareprocessorsfromthegivenPN.AkeyoperationinLAURAistogeneratetheproperhardwarecommu-nicationmechanismforthepointtopointinter-processorscommunication,whichmaybeadifferentcommunicationstructurethanthePNcommunicationFIFOmodel.

2ClaudiuZissulescu,BartKienhuis,EdDeprettere

Ourtoolflowhasbeendevelopedfordata-flowalgorithms,havingcommunica-tionatthelevelofscalars(e.g.bytesorwords).ThecommunicationtopologyofthePNisstatic,derivedatcompiletime.Torealizetheinter-processorcommunication,weusepoint-to-pointcommunicationmechanism.Employingbussesand/orcomplexNetworks-on-Chips(NoCs)[7,4]forthecommunicationisnotfeasibleduetothede-laysintheroutingprocessandtheusageoflargepacketsinsteadofscalarsinthecom-municationprotocol.

Aswefoundoutin[10],fourtypesofpoint-to-pointinter-processorcommunicationexistinthePNwegenerate.TwoofthemuseaFIFOlikecommunicationandtheothertwouseacachelikememorytoexchangedata.Inthispaper,weinvestigatetherealiza-tionsforthefourcommunicationtypesanddiscusstheeffectivenessoftherealizations.Therestofthepaperisorganizedasfollows:first,wepresenttheCOMPAAN/LAURAflowinordertounderstandhowthehardwaremappingisdonebyLAURA.Next,weaddressthecommunicationchannelgenerationandproposeanapproachtosolveeachcommunicationtype.Wefinishthispaperwithadiscussionoverthemeritsandim-provementsoftheseapproachesinthecontextofourtoolchain.1.1

COMPAAN/LAURAdesignflow

TheprocessnetworksweconsiderinthispaperarederivedusingtheCOMPAANtoolchain.COMPAANtakesasinputparameterizedstaticnestedloopprogramswritteninMatlabandconvertsthiscodetoprocessnetworks.Anintermediatestepistransforma-tionoftheinitialMatlabcodeintosingleassignmentcode(SAC)usingexactdataflowanalysis[3].Thelasttooloftheflow,calledLAURA,isusedtogenerateaVHDLde-scriptionofanarchitecturefromaPNdescription.Duringthisstep,eachprocessofthePNismappedtoanabstractarchitecturalmodelcalledVirtualProcessor.Eachvirtualprocessorconsistsofthreedistinctcomponents:

–AnExecuteUnit,whichisthecomputationalpartofthevirtualprocessor.ThisunitwrapsinanIPcorethatimplementsthefunctionalityoftheprocess.ItsinterfaceconsistsofanumberofInputargumentsandOutputarguments.

–AReadUnit,whichisresponsibleforassigningvalidtokenstotheinputargumentsoftheExecuteUnit.Sincetherearemoreinputportsthanarguments,theReadUnithastoselectatrun-timefromwhichChanneltoreadtokensusingacontrolprogramthatisderivedbyCOMPAAN.

–AndaWriteUnit,whichisresponsiblefordistributingtheresultsoftheExecuteUnittodifferentChannels.AwriteoperationcanexecuteonlywhenalltheoutputargumentsoftheExecuteUnitareavailablefortheWriteUnit.SimilarlytotheReadUnit,theWriteUnithastoselectachannelatrun-timetowritetokensinto,usingacontrolprogramthatisderivedbyCOMPAAN.

TheapplicationstargetedbyCOMPAANareusuallydata-flowintensive,requiringlargecomputationalpower.Therefore,animportantissueinLAURAisthederivationofef-ficientcommunicationstructuresinhardwareandisthefocusofthispaper.Initially,COMPAANfindsintheinputMatlabfile,allthepossibleproducer-consumerpairs.Atthatlevelthecommunicationbetweentwoprocessesisdoneusingamultidimensional

CommunicationSynthesisinamultiprocessorenvironment3

array,representedasapolytope.ToselectthetypeofcommunicationalinearizationprocedureisemployedbytheCOMPAANtoolwhichselectstherighttypeofcommuni-cationchannel.

2CommunicationGeneration

InahardwarenetworkgeneratedbyLAURA,eachprocessorexecutesaninternalcontrolprogramatboththeReadandWriteUnits.ThisprogramdescribesalocalscheduleintermsofExecuteUnitexecutions.Ateachexecution,alsorefertoasaniteration,aReadUnitreadsdatafromaChannelandaWriteUnitwritesdatatoaChannel.IntheoriginalMatlabcode,theChannelrepresentsthecommunicationonan-Dimensionalarray(e.g.,a[i,j]).Thisarrayisreplacedby1-Darraybyourtoolchaininthelinearizationstep.

Virtual Processor 1Channel 1ReadExecuteWriteChannel 2Channel 3ReadExecuteWriteVirtual Processor 2ProducerConsumerFig.1.AProducer-Consumerpair

Figure1depictsaclassicalproducerconsumerpair.AVirtualProcessor1sendsdatatothesecondprocessorcalledVirtualProcessor2viatheChannel2.ThechannelrepresentsthedatadependencybetweenaReadUnitandaWriteUnit.ThisrelationisgivenbyaMappingfunction.Thelinearizationstepreplacestheaddressingofanarraywithrelativeaddressingschemebasedonputandgetprimitives.Usually,thede-rivedcommunicationchannelisaFIFO,however,therearecasesinwhichaFIFOisnotsufficienttolinearizean-dimensionalarray[10].WehavefoundthatfourtypesofcommunicationcanbedistinguishedasgiveninFigure2.Theyresultfromtheorder-ingoftheiterationsattheProducerandtheConsumerprocessesandtheexistenceofmultiplicityforagiventoken,whichmeansthatatokenthatissentbyProducerisreadmorethanonceattheConsumerside.Hence,dependingontheorderandexistenceofmultiplicity,anarbitrarycommunicationchannelbelongstooneoffourdisjointclasses:in-orderwithoutmultiplicity(IOM-),in-orderwithmultiplicity(IOM+),out-of-orderwithoutmultiplicity(OOM-),andout-of-orderwithmultiplicity(OOM+).Foreachclassanadequatecommunicationmechanismneedstobeefficientlysynthesizedinhardwareintermsofcyclesperoperation,areaandspeed.

Fromexperience[10],weknowthatonaveragethefollowingdistributioncanbeex-pectedoverthevariouscommunicationtypes:typeIOM-(80%),IOM+(10%),OOM-(9%),OOM+(1%).TypeIOM-togetherwithtypeIOM+,resultinthat90%ofthecommunicationchannels,requireaFIFObuffertorealizethecommunication.Intheremaining10%ofthecases,amorecomplexReorderingChannelisneeded.2.1InOrdercommunication(IOM-)

IntheInOrdercommunication(IOM-)case,theProducerwritesdataintheChannelinthesameorderastheConsumerreadsfromtheChannel.Therefore,thisChannelis

4ClaudiuZissulescu,BartKienhuis,EdDeprettere

In−order without multiplicity (IOM−) :

Produceri4321In−order with multiplicity (IOM+) :Produceri4data dependecyloop schedulej4321Consumerj4321Consumer3211 2 3 4 5 i1 2 3 4 5 iOut−of−order without multiplicity (OOM−) :

Produceri4321j43211 2 3 4 5 iOut−of−order with multiplicity (OOM+) :

Producerij43211 2 3 4 5 i4321ConsumerConsumerFig.2.ThefourcasesofcommunicationbetweenProducerandConsumer

implementedinhardwareusingaFIFObuffer.Itisaccessedusingthetwoprimitivesput(implementedintheWriteunit)andget(implementedintheReadunit).BecausehighlyoptimizedimplementationsofFIFObuffersexistfortoday’sFPGAs,ittakeseachprimitiveonlyasinglecycletowritedataortoreaddatafromaChannel.AhardwareFIFOhasfinitememory,thusbothprimitivesareblocking,e.g.theyhaltaprocessorwhennodataisavailableinaFIFOorwhenaFIFOisfull.FindingalowerboundonaChannelisahardprobleminPNsanditisoutsidethescopeofthispaper,althoughasmalldiscussionisgiveninSection3.

2.2InOrderwithMultiplicitycommunication(IOM+)

IntheInOrderwithMultiplicityCommunication(IOM+)case,theorderdataispro-ducedisthesameastheorderinwhichdataisconsumed.However,somedataiscon-sumedmorethanonce,breakingthecommunicationmodelofaFIFOwhereagetoperationisdestructive.Inthismodel,thelife-timeofatokenneedstobetakenintoaccount.Onlyattheendofthelife-timeofatoken,thetokencanbereleasedfromtheFIFO.WhiletheputprimitiveremainsthesameasinthecaseofIOM-,weaddedanewcommunicationprimitivewhichwecalledthepeekprimitive.ThepeekprimitivefetchesdatafromaFIFObufferwithoutdestroyingit.TodestroythecurrentFIFOdataareleasecontrolissynthesizedintheConsumerReadUnit.Also,theoutputoftheFIFOisregisteredbytheMultiplicityRegisterwhichiscontrolledbythereleasecontrol.Apeekisonlyreadingthecontentsoftheregister,whileagetoperationisreadinganewvaluefromtheFIFObufferandplaceitintheMultiplicityRegister.Thecontrolthatdeterminesthelife-timeofatokenisexpressedinthesamewaythecontrolprogramsintheReadandWriteUnitareexpressed.2.3OutofOrdercommunication(OOM-)

IntheOut-of-Ordercommunication(OOM-)case,aConsumerreadsdatainadifferentorderithasbeenwrittenbytheProducer.Hence,thecommunicationchannelallowsaConsumertofetchdataintheorderitexpectit.WerefertothiskindofChannel,

CommunicationSynthesisinamultiprocessorenvironment5

whichallowsforrun-timereorderingoftokens,asaReorderChannel.Themainele-mentsofthisReorderChannelisthereordermemoryandthetaggingoftokensthatarewritten/readto/fromthereorderingmemory.EachtokenneedstobetaggedtoallowtheConsumerProcesstorequestparticulartokensintheordergivenbyitslocalsched-ule.Thetagcomputationtakesplaceinbothpartsinvolvedinthetransaction,i.e.,theProducersideandtheConsumerside.

ValidDataProducerProducerData Out &AddressDataConsumerConsumerReorder MemoryReorder Memory(RAM)(RAM)Data InDataRequestAddress?ACK ProducerACK ConsumerFig.3.TheorganizationoftheReorderchannel

InFigure3,theorganizationoftheReorderChannelisgiven.Themainelementistherandomaccessmemory,calledtheReorderMemory.ThefigurealsoshowsaProducerthatwantstowritetokenstotheChannelandaConsumerthatwantstoreadtokens.Atoken(Data)thatiswrittenbytheProducer,istemporarilystoredinareg-istertogetherwithanaddress.ThisaddressiscalculatedbythetaggeneratoroftheProducer.Eachtokenthatisstoredinmemoryhasavalidbit,whichindicatesthataparticularlocation(address)containsvaliddata.Ifthevalidbitisset,theProducerisnotallowedtowritedataandiscompletelystalleduntiltheaddressbecomesavailableagain(ACKProducer).Otherwise,theProducerwritesthetemporarilystoreddataintothememoryandsetsthevalidbit.Attheotherside,theConsumerplacesarequestcommandtotheReorderingChannelforaparticularlocationgivenbyanaddress.Iftherequestedlocationcontainsvaliddata,theConsumerreceivesanacknowledgesig-nal(ACKConsumer),and,atthesametime,thedesireddata.Ifthelocationdoesnotcontainvaliddata,theConsumerstallsuntilvaliddatabecomesavailable.GiventheorganizationoftheReorderingChannelinFigure3,twoissuesdeterminethedesign.Oneisrelatedwiththecomplexityofthetaggenerationandtheotheroneisrelatedwiththeperformanceofthereorderchannelintermsofclockcycles.2.4Taggeneration

Inthegenerationoftags,wetakeadvantagesofthefactthatweoperateonpolytopeswithinCOMPAAN.Thisleadstotwodifferentapproachwecanusetogeneratethetag.OneapproachisbasedontheEhrhartenumerationtheory[2].Usingthistheory,weobtainapseudo-polynomialexpressionthatgivesanuniqueintegervalueforeachpointenclosedbythepolytope.Thisapproachhasbeensuccessfullyexploredinsoftwarein[9].However,thisapproachisnotsuitableforhardwareimplementationduetothecomplexityoftheobtainedpseudo-polynomialexpression.InourexampleinFigure4,

6ClaudiuZissulescu,BartKienhuis,EdDeprettere

thepseudopolynomialforthepolytopeenclosingtheproducerpointsisgivenbythefollowexpression:(−1/4)∗i2+(N−5/2)∗i+j+[−1,5/2]i.Inthisexpression,thepseudopolynomialterm[−1,5/2]iindicatesthatwhentheevaluationofmod(i,2)isequaltozero,thevalue-1isselected;otherwisethevalue5/2isselectedinthepolynomial.

Inthesecondapproach,werelaxtheshapethatenclosestheproducerpolytopetoahyper-rectangularshape.Wecallthishyper-rectangularshapetheBoundingBox.ForaBoundingBox,wecanmakeuseofclassicallinearizationtoconvertan-dimensionrectangletoanone-dimensionalarray[1,6].FindtheBoundingBoxthatbestenclosesthepolytopeisaminimizationproblemthatwesolveusingintegerlinearprogram-ming.TheimprovementovertheEhrhartapproachisthateachBoundingBoxcanbeaddressedusingasimplepolynomialthatcanbeimplementedefficientlyinhardware.Thetagforatokenisobtainedasafunctionoftheiteratorsofaprocessorandhasthe󰀁N

formtag=k=1ck∗xk+x0+c0whereckrepresentsaconstant,xkisaniterationspaceindex.EachtagbecomesanaddressforaRAMmemory.

ProducerMappingN=87654321543211098761413121118171615212019242322262528272930Consumer(i,j) <−− (y,x)N=87654321211373122148422395241062517112618122719282029301516BoundingBoxPolytopej

i12345678910 = N+2y12345678910 = N+2W:(N−3)*(i−1) + j−400010203040506070809xR:(N−3)*(x−1) + y−420 NULL21 NULL22token 1923token 2024token 2125 NULL26 NULL27token 2228token 2329token 2430313233343536373839token 1token 2token 3token 4token 5token 6token 7token 8token 9token 1010111213141516171819 NULLtoken 11token 12token 13token 14 NULLtoken 15token 16token 17token 18 NULL NULL NULLtoken 25token 26 NULL NULL NULLtoken 27token 2840414243444546474849 NULL NULL NULL NULLtoken 29 NULL NULL NULL NULLtoken 30Reorder memory The Linearization shape is given by the smallestrectagular containing the Producer domain

Fig.4.TheLinearrealizationofthereordermemory

IfweextracttheBoundingBoxattheProducersideinFigure4,weobtainthewriteaddress,whichis(N−3)∗(i−1)+j−4.Fromtheaddressgeneratedbythewriteaddress,theConsumerhastoreadtokensintheorderitdesires.Itthereforerequiresareadaddresstoaccessthetokens.Sincethereisawelldefinedmathematicalrelation-shipbetweentheProducerandConsumerasgivenbytheMapping,wecanobtainthereadaddressbymathematicallycomposingthewriteaddresswiththemappingfunc-tion.Thus,theparameterizedreadaddressisequalto(N−3)∗(x−i)+y−4.Inthiscase,reorderingisinducedbythefactthatthepointsintheConsumerpolytopeareaccessedinadifferentorderthentheProducerhaswrittenthantotheReorderChannel.

ThedownsideoftheBoundingBoxapproachisthatthememoryallocatedfortheReorderMemoryislargerthanneededwhencomparedwiththeEhrhartapproach.InFigure4,thecontentoftheReorderMemoryisgiven.Itshowsatwhichaddressapar-ticulartokeniswritten.Forexample,thetokenfromlocation(6,7)iswrittenataddress28.Thisisthe23rdtokenwrittenbytheProducer.ThistokenisreadbytheConsumer

CommunicationSynthesisinamultiprocessorenvironment7

atlocation(6,7)asthe18thtokenread.Fromthecontentofthereorderingmemory,weobservethatmemorylocationsexistthatareneverwrittenorreadduringtheexecutionofanetwork,asgivenbythe”Nulls”.InthecaseofanEhrhartgeneratedaddress,these”Nulls”wouldnotexistattheexpenseofmorecomplexaddresspolynomialsthataredifficulttobeimplementedefficientlyinhardware.2.5Communicationprotocol

TheReorderingChannelimplementsaparticularcommunicationprotocolthatinvolveswritingandreadingatoken.Awriteoperationusesonlytheputcommunicationprim-itive,likewewouldhavedoneincaseofaFIFOChannel.ThisprimitiveprovidestotheReorderChannelatokenthatconsistsofataganddata.Ifplaceisavailableatthelocationgivenbythetag,thetokeniswritten;otherwisetheProducerisstalledrealizingablockingwrite.Areadoperationconsistsofthreecommunicationprimitives;acheck,apeekandaget.TogethertheyrealizethereadingofdatafromtheReorderingChannelinanydesiredorder.Eachprimitiveperformsaparticulartask:

–checkinquiresifspecificdataispresentintheReorderMemory.ThedesiredtaggiventotheReorderingChannelasaRequest.Ifthatdataispresentatthead-dressgivenbythetag,theReorderingChannelsendsanacknowledgesignaltotheConsumer;otherwisetheConsumerblocksrealizingablockingread.

–getreadsatokenfromtherequestedaddressandsetthevalidbittofalseindicatedthatthelocationifavailableforwritingagain.Agetoperationisthereforedestruc-tive.

–peekreadsatokenfromtherequestedaddressbutkeepsthevalidbithightoin-dicatedthatthelocationisstillneededforreading.Usingthepeekoperation,thelife-timeofatokeniscontrolled.

IntheOOM-communicationtype,weonlyusethecheckandgetprimitivestoreaddatasincenomultiplicityisinvolved.Amemorylocationcanbeimmediatelyreleasedwhenreadingatoken.

2.6OutofOrderwithmultiplicitycommunication(OOM+)

IntheOut-of-OrderwithMultiplicitycommunication(OOM+)case,achannelhasthesamecharacteristicsastheOOM-case.AdditionalreleaselogicishoweveraddedattheConsumersidetokeeptrackofthelife-timeoftokens.Consequently,theOOM+communicationtypealsousesthepeekprimitiveifatokendoesnotyetneedtobereleasedfrommemory.Thegetprimitiveisusedwhenthereleaselogicindicatesthatthelife-timeofatokenhascometoitsend.

3MemoryAllocationSchemes

WhenimplementingaChannel,whetheritisaFIFOoraReorderingChannel,weneedtodeterminealowerboundontheamountoftokensthatcanbestoredinaChannel,withoutcausingadeadlocktooccur.Findingthislowerboundinprocessnetworks

8ClaudiuZissulescu,BartKienhuis,EdDeprettere

isingeneraladifficultproblem.Aprocessnetworkisspecifiedintermsofpartialorderingsbetweenaproducerandconsumer,i.e.,theProducer/Consumerpair.TofindalowerboundonaChannelrequires,however,atotalorderontheexecutionoftheprocessesinthenetwork.Manydifferenttotalordersexist;leadingtodifferenttrade-offsinevaluationspeedandmemoryrequirement.AtotalorderforaPNcanbeobtainedbyschedulingorbydoingarun-timeevaluationofaprocessnetwork.Nevertheless,wewouldliketoavoidbothmethods.Theschedulinginterfereswiththenotionofdistributedcontrolandtherun-timeevaluationisnotacompiletimeanalysisthatcanbeperformedaspartoftheCOMPAANtool.Also,boththerun-timeandscheduleapproachcannothandletheparameterizednatureofthePNswederive.Instead,weusecompiletimeallocationschemes.Withoutgoingtodeepintodetail,wecurrentlydistinguishtwocompiletimeallocationschemes:

–MemoryAllocationwithoutreleasingthememory(SAC)

Thismemoryallocationschemeisderivedfromthesingleassignmentdataallo-cationfoundbyoneoftheintermediatestepofourtoolchain.Inthisscheme,amemorylocationisassigneddataonlyonce.HenceaConsumerdoesnotneedtoreleaseamemorylocationafterithasconsumeddatafrommemory.ThememoryrequiredonaChannelisthedimensionneededtoaccommodateallproduceddata,whichisgivenbytheBoundingBox.

–MemoryAllocationWithreleasingthememory(SEQ)

ThismemoryallocationschemeisderivedfromtheoriginaldataallocationfoundinthesequentialinputMatlabprogram.Inthisscheme,memorylocationsarere-usedandhenceeachreadorwriteoperationontheChannelhastocheckifamemorylocationiseitherfreeorcontainsvaliddata.

TheSACallocationschemeresultsinthefastestexecutionofaPNasthenetworkcanrunwithmaximumparallelism.Anyotherallocationscheme,liketheSEQallocationscheme,mayrestrictthenetworkparallelismtoamoresequentialexecutionscheme,asexplicitchecksneedtobeperformedonthevalidityofdata.Theguaranteethatthetwodescribedallocationschemesdoindeednotintroducedeadlockduetounderdimensionedcommunicationbuffers,isgivenbythefactthattheallocationschemesareabletoruninboundedamountofmemoryinthesequentialexecution.Weknowfromsimulationthattighterlowerboundsarepossible.Furtherresearchisneededtodeveloptechniquestoderivethesetighterboundsatcompiletime.WeexpectthatresultsfromresearchinmemoryoptimizationforsequentialcodecanalsobeappliedtoimprovethememoryallocationinPNs.

4Hardwarerealization

AllfourcommunicationtypescanbeautomaticallygeneratedbyLAURAandhavebeenimplementedonaVirtexII-6000platformfromXilinx.AFIFOchannelisrealizedusingdifferenttypesofmemoryatcompiletimedependingonthecalculatedsize.Iflessthan1024bitsarerequired,weuseRAM16x1Dmemories,otherwiseweuseRAMB16memoryblocks.Ifonlyasinglelocationisrequired,wesimplyinstantiateaFIFO

CommunicationSynthesisinamultiprocessorenvironment9

channelthatusesonlyaregister.IncaseoftheReorderChannels,wealwaysuseBlockRAMstorealizetheReorderingMemoryasafulldualportmemoryisrequired.

ThehardwarerealizationofaFIFObufferisthefastestandmostefficientoneintermsofcyclesperoperation,asitrequiresonecycleperreadorwriteoperation.InthecaseofaReorderingChannel,theSACimplementationrequiresstillonecycleperwriteoperationastheavailabilityofalocationdoesnotneedtobechecked.However,thereadoperationrequiresthreecyclestoreadonescalarfromthechannel.TheSEQimplementationoftheReorderChannelrequiresupto2cyclesforawriteand3cyclesforareadoperation.Inboththereadandwriteoperation,theavailabilityofalocationneedstobechecked.

5Example

Tohighlightthedifferentcharacteristicsofthecommunicationchannels,welookedattheQRalgorithm[11].Thisalgorithmrequires1IOM+and10IOM-channels.How-ever,wecanimplementthesechannelsalsousing1OOM+and10OOM-channels.Thishighlightsthedifferencesinthehardwareimplementationandperformanceofthedifferentchannelsandthebenefitsofselectingthetherightchanneltypeatcompiletime.Inbothcases,thelowerboundsonthechannelsaredeterminedatcompiletimeusingtheSACallocationscheme.

15432

CyclesMemorySizeFrequency

1285120bits102Mhz

Compiletime

estimations

MemorySize111771

FIFOChannel

RAM16x1DSlices

Table1.Experimentalresultsforvarioushardwarechannels.

FromTable1,weobservethattheReorderChannelsaremoreinefficientthanFIFOchannelsintermsofusageofFPGAmemories.TheReorderMemoryisrealizedusingonlySelectBlockRAMs(RAMB16).Forthe11channels,11RAMB16memoryblocksallocated,whichis180224bitsinmemoryontheFPGA.ForeachReorderingchannel,weneedtoallocationatleastaRAMB16blockwhichcanaccommodate512tokensof32bit,evenifweonlyneedspacefor10tokens.Hence,itisdifficulttousethememoryblockefficiently;therecanbequitesomespill.Ontheotherhand,foraFIFOchan-nel,wecanselectbetweenSelectBlockRAMsorRAM16x1Dmemories,dependingontherequiredchannelsize.Inthisparticularcase,weuse320RAM16x1DmemoryblocksandthememoryusedfortheFIFOimplementationisonlyslightlylargerthanthecalculatedsizeusingtheSACallocationscheme,i.e.,5120bitsversusthe4928bitscalculated.SincetheSACallocationschemeisused,theReorderChannelsimplementafastwriteoperation.Nevertheless,ittakestwicethenumberofcyclestoruntheQRal-gorithm(258cycles)comparedtowhenusingFIFOchannels(128cycles).Thatareadoperationrequires3cycles,hasabiginfluenceontheexecutionspeedofthealgorithm.ThenumberofslicesusedtoimplementtheQRalgorithmwithFIFOsis890andwithReorderingChannelsis1771.TheReorderingChannelrequirestwiceasmanyslices

10ClaudiuZissulescu,BartKienhuis,EdDeprettere

duetothehardwaretaggeneratorsandtheuseofRAMB16memories.Boththeexecu-tionspeedandnumberofslicesusedshowthatitisindeedveryimportanttoselectthepropercommunicationtypeinordertoevaluateanalgorithmasfastaspossiblewiththeleastamountofresources.

6Conclusions

Thedata-flowintensiveapplications,wetargetwithCOMPAAN,requirelargecompu-tationalpoweranditisimportanttoderiveefficientcommunicationstructuresinhard-ware.Fourpoint-to-pointcommunicationtypesaredistinguishedinCOMPAANandwehaveshownthatforeachofthem,wecanderiveefficienthardwareinLAURA.TwotypesusehardwareFIFOimplementationsandtwotypesuseaReorderChannelim-plementation,aswepresentedinthispaper.Fromacasestudy,wehaveshownthataFIFOimplementationisthemostoptimalimplementationofaChannelfromanypointofview(throughput,hardwareresources,memoryusage).BecauseaFIFOcanreadandwriteinasinglecycleitcankeepupwiththemaximumdataflowthroughourvirtualprocessors.Fromexperience,weknowthatonaverage90%ofthechannelsinanapplicationcanberealizedwithFIFObuffers.Intheremaining10%,weemploytheReorderingChannelimplementation.Asfuturework,weareinterestedtofurtheropti-mizetheReorderChannelandportingittootherFPGAarchitectures,e.g.Altera.Also,wewanttoinvestigatenewcompiletimeallocationschemestoobtaintighterlowerboundsonthememoryneededinthecommunicationchannels.

References

1.A.Aho,R.Sethi,andJ.Ullman.Compilers:Principles,TechniquesandTools.Addison-WesleyPublishingCo.,1986.2.E.Ehrhart.Polynˆomesarithm´etiquesetM´ethodedesPoly´edresenCombinatoire.Birkh¨auserVerlag,Basel,internationalseriesofnumericalmathematicsvol.35edition,1977.

3.P.Feautrier.Dataflowanalysisofarrayandscalarreferences.InternationalJournalofParallelProgramming,20(1):23–53,1991.

4.P.GuerrierandA.Greiner.Agenericarchitectureforon-chippacket-switchedintercon-nections.InDATE’00:ProceedingsoftheconferenceonDesign,automationandtestinEurope,pages250–256.ACMPress,2000.

5.K.Keutzer,S.Malik,R.Newton,J.Rabaey,andA.L.Sangiovanni-Vincentelli.Systemleveldesign:Orthogonalizationofconcernsandplatform–baseddesign.IEEETrans.onCAD,2000.

6.S.Muchnick.AdvancedCompilerDesignandImplementation.MorganKaufmannPublish-ers,Inc.,1997.

7.E.Rijpkema,K.Goossens,A.adulescu,J.vanMeerbergen,P.Wielage,andE.Waterlander.Tradeoffsinthedesignofarouterwithbothguaranteedandbest-effortservicesfornetworksonchip,2003.

8.T.Stefanov,C.Zissulescu,A.Turjan,B.Kienhuis,andE.Deprettere.Systemdesignusingkahnprocessnetworks:Thecompaan/lauraapproach.InProceedingsofDATE2004,Paris,France,Feb16–202004.

CommunicationSynthesisinamultiprocessorenvironment11

9.A.Turjan,B.Kienhuis,andE.Deprettere.Acompiletimebasedapproachforsolvingout-of-ordercommunicationinKahnProcessNetworks.InProceedingsofIEEE13thInternationalConferenceonApplication-specificSystems,ArchitecturesandProcessors,July17-192002.10.A.Turjan,B.Kienhuis,andE.Deprettere.AnIntegerLinearProgrammingApproachto

ClassifyCommunicationinProcessNetworks.In8thInternationalWorkshoponSoftwareandCompilersforEmbeddedSystems(SCOPES),Amsterdam,Sept.2–32004.

11.R.WalkeandR.Smith.20gflopsqrprocessoronaxilinxvirtex-efpga.InAdvancedSignal

ProcessingAlgorithms,Architectures,andImplementationsX,volume4116,2000.

因篇幅问题不能全部显示,请点此查看更多更全内容