From: Pietro Amodeo (pamodeo_at_icmib.na.cnr.it)
Date: Tue Dec 28 2010 - 05:45:16 CST
Hi Axel,
thanks a lot for your reply.
On Mar, Dicembre 28, 2010, 4:46 am, Axel Kohlmeyer disse:
> hi pietro,
>
> On Mon, Dec 27, 2010 at 3:20 PM, Pietro Amodeo <pamodeo_at_icmib.na.cnr.it>
> wrote:
>> Hi,
>>
>> sorry but the table in my last post is wrong:
>> 1) obviously, the reported ratio is Time(1)/Time(N) and NOT
>> Time(N)/Time(1)!!!!
>> 2) the correct figures are:
>> N Time(1)/Time(N)
>> 1 1
>> 2 1.9733511924
>> 4 3.5960034869
>> 6 5.1641581203
>> 8 6.5367137981
>> 10 8.0500773076
>> 12 9.1171710303
>>
>> 16 8.8086727989
>>
>> 20 9.6037249284
>> 22 10.103089676
>> 24 10.6848376171
>
> those timings are fairly good.
> i don't know what you are complaining about.
>
> you really have only 12 physical CPU cores on
> your machine and about 10-15% extra speed from
> hyper-threading is quite typical for this kind of setup.
>
> the fact that you don't get perfect scaling can be easily
> explained by two reasons: memory bandwidth contention
> overall and lack of processor affinity that makes the
> contention worse.
>
> memory contention is the worst the larger the system is
> as that makes CPU caches less efficient. overall, also
> the topology and size of caches has an impact to performance
> and scaling.
I was aware of the possibly poor speed up from HT and absolute times were
also quite good. What I was complaining about was only the scaling from 1
to 12 cores, especially in comparison with results obtained on 8-core
opteron processors on dual CPU nodes, where scaling was almost ideal up to
16 cores (>15x). However, as you clearly explained, memory bandwidth
contention and/or lack of processor affinity may easily account for the
worse scaling, even because, as I wrote in my last email, benchmarks on
the opteron cluster were run on a smaller protein+membrane system.
>
> as for your CUDA version problem. that looks like a compile
> time issue. you'll have to examine the source code and see,
> if you can adjust the mentioned parameter. on GPUs the
> memory (and cache) architecture is different from CPUs and
> sometimes one has to choose what works well for most
> typical cases and require a recompilation with changed
> parameters. due to continued improvements in the CUDA
> programming interface and the CUDA drivers, this situation
> will improve in the future (e.g. with JIT compilation and selection
> of kernels suitable for specific needs).
For the CUDA problem, I was wondering if MAX_EXCLUSIONS parameter could be
simply increased or architecture/CUDA issues limit its upper value. Also,
any info about dependencies involving this parameter could be useful. A
last question was the origin of the error, i.e. if it depends just on
overall system size or rather on a combination of atom/molecule
numbers/sizes. I'll try to work out my answers from the code.
cheers,
Pietro
-- Dr. Pietro Amodeo, Ph.D. Istituto di Chimica Biomolecolare del CNR Comprensorio "A. Olivetti", Edificio 70 Via Campi Flegrei 34 I-80078 Pozzuoli (Napoli) - Italy Phone +39-0818675072 Fax +39-0818041770 Email pamodeo_at_icmib.na.cnr.it
This archive was generated by hypermail 2.1.6 : Wed Feb 29 2012 - 15:54:54 CST