* Enable AVX by default
* Fix linting errors
* Fix win64 build (libxsmm not linked)
Libxsmm on Win64 is not linked, should be disabled by default
* Fix clang format issues
* Change lower supported cpu version to LIBXSMM_X86_AVX2
Change lower supported cpu version to LIBXSMM_X86_AVX2 to address https://github.com/dmlc/dgl/issues/3459 issue
* Fix unit test
Remove assumption that libxsmm is enabled in the config by default (only true for intel CPUs with AVX2 instructions)
---------
Co-authored-by: Ubuntu <ubuntu@ip-172-31-15-137.us-west-2.compute.internal>
Co-authored-by: Quan (Andy) Gan <coin2028@hotmail.com>
* add set_stream
* add .record_stream for NDArray and HeteroGraph
* refactor dgl stream Python APIs
* test record_stream
* add unit test for record stream
* use pytorch's stream
* fix lint
* fix cpu build
* address comments
* address comments
* add record stream tests for dgl.graph
* record frames and update dataloder
* add docstring
* update frame
* add backend check for record_stream
* remove CUDAThreadEntry::stream
* record stream for newly created formats
* fix bug
* fix cpp test
* fix None c_void_p to c_handle
* Use an internal cuda stream for CopyDataFromTo
* small fix white space
* Fix to compile
* Make stream optional in copydata for compile
* fix lint issue
* Update cub functions to use internal stream
* Lint check
* Update CopyTo/CopyFrom/CopyFromTo to use internal stream
* Address comments
* Fix backward CUDA stream
* Avoid overloading CopyFromTo()
* Minor comment update
* Overload copydatafromto in cuda device api
Co-authored-by: xiny <xiny@nvidia.com>
* [Dist] Enable maximum try times for socket backend via DGL_DIST_MAX_TRY_TIMES
* reset env before/after test
* print log for info when trying to connect
* fix
* print log in python instead of cpp
* Based on issue #3436. Improving _SegmentCopyKernel s GPU utilization by switching to nonzero based thread assignment
* fixing lint issues
* Update cub for cuda 11.5 compatibility (#3468)
* fixing type mismatch
* tx guaranteed to be smaller than nnz. Hence removing last check
* minor: updating comment
* adding three unit tests for csr slice method to cover some corner cases
Co-authored-by: Abdurrahman Yasar <ayasar@nvidia.com>
Co-authored-by: nv-dlasalle <63612878+nv-dlasalle@users.noreply.github.com>
Co-authored-by: Jinjing Zhou <VoVAllen@users.noreply.github.com>
* relabel gpu
* unittest for ralebl_ on the GPU
* finish Relabel_ for the GPU
* copyright
* re-enable the unittest for edge_subgrah on the GPU
* fix unittest for tensorflow
* use a fixed number of threads
Co-authored-by: Jinjing Zhou <VoVAllen@users.noreply.github.com>
Co-authored-by: nv-dlasalle <63612878+nv-dlasalle@users.noreply.github.com>
Co-authored-by: Quan (Andy) Gan <coin2028@hotmail.com>
* [Performance] improve coo2csr space complexity when row is not sorted
* [Perf] replace std::vector<> by NDArray
* keep both impl of unsorted coo to csr and choose according to graph density dynamically
* refine criteria to choose btw Unsorted algos
Co-authored-by: Ubuntu <ubuntu@ip-172-31-34-27.us-west-2.compute.internal>
* Implement range based NDArrayPartition
* Finish implement range based partition support
* Add unit test
* Fix whitepace
* Add Kernel suffix
* Fix argument passing
* Add doxygen docs and improve variable naming
* Add unit test
* Add function for converting a partition book
* Add example to partition_op docs
* Fix dtype conversion for mxnet and tensorflow
* fix.
* fix.
* fix.
* fix.
* Fix test
* Deprecate old DistEmbedding impl, use synchronized embedding impl
* Basic imple of heterogeneous on homogenenous sampling
* make pass
* Pass C++ test
* Add python test code
* lint
* lint
* Add MultiLayerEtypeNeighborSampler
* Add unitest for single machine dataloader
* Add dist dataloader test for edge type sampler
* Fix lint
* fix
* support for per etype sample
* Fix some bug and enable distributed training with per edge sample
* fix
* Now distributed training works
* turn off some mxnet
* turn off mxnet for some dist test
* fix
* upd
* upd according to the comments
* Fix
* Fix test and now distributed works.
* upd
* upd
* Fix
* Fix bug
* remove dead code.
* upd
* Fix
* upd
* Fix
Co-authored-by: Ubuntu <ubuntu@ip-172-31-71-112.ec2.internal>
Co-authored-by: Ubuntu <ubuntu@ip-172-31-2-66.ec2.internal>
Co-authored-by: Da Zheng <zhengda1936@gmail.com>
* Split from NCCL PR
* Fix type in comment
* Expand documentation for sparse_all_to_all_push
* Restore previous behavior in example
* Re-work optimizer to use NCCL based on gradient location
* Allow for running with embedding on CPU but using NCCL for gradient exchange
* Optimize single partition case
* Fix pylint errors
* Add missing include
* fix gradient indexing
* Fix line continuation
* Migrate 'first_step'
* Skip tests without enough GPUs to run NCCL
* Improve empty tensor handling for pytorch 1.5
* Fix indentation
* Allow multiple NCCL communicator to coexist
* Improve handling of empty message
* Update python/dgl/nn/pytorch/sparse_emb.py
Co-authored-by: xiang song(charlie.song) <classicxsong@gmail.com>
* Update python/dgl/nn/pytorch/sparse_emb.py
Co-authored-by: xiang song(charlie.song) <classicxsong@gmail.com>
* Keepy empty tensor dimensionaless
* th.empty -> th.tensor
* Preserve shape for empty non-zero dimension tensors
* Use shared state, when embedding is shared
* Add support for gathering an embedding
* Fix typo
* Fix more typos
* Fix backend call
* Use NodeDataLoader to take advantage of ddp
* Update training script to share memory
* Only squeeze last dimension
* Better handle empty message
* Keep embedding on the target device GPU if dgl_sparse if false in RGCN example
* Fix typo in comment
* Add asserts
* Improve documentation in example
Co-authored-by: xiang song(charlie.song) <classicxsong@gmail.com>
* Split NCCL wrapper from sparse optimizer and sparse embedding
* Add more unit tests for single node nccl
* Fix unit test for tf
* Switch to device histogram
* Fix histgram issues
* Finish migration to histogram
* Handle cases with zero send/recieve data
* Start on partition object
* Get compiling
* Updates
* Add unit tests
* Switch to partition object
* Fix linting issues
* Rename partition file
* Add python doc
* Fix python assert and finish doxygen comments
* Remove stubs for range based partition to satisfy pylint
* Wrap unit test in GPU only
* Wrap explicit cuda call in ifdef
* Merge with partition.py
* update docstrings
* Cleanup partition_op
* Add Workspace object
* Switch to using workspace object
* Move last remainder based function out of nccl_api
* Add error messages
* Update docs with examples
* Fix linting erros
Co-authored-by: xiang song(charlie.song) <classicxsong@gmail.com>
* Remove double-checking sorted
* Remove sorting of CSR by default
* Update unit test to use unsorted matix
* delete whitespace
* Expand unit tests
* Replace cusparse sort
* Fix row column sorting
* Explicitly don't sort columns
* Fix linting errors
* Fix bit-width calculation
* Fix sorting assertion and unit test
* Fix linting
* Improve CPU COO2CSR
* Remove references
* Rename and add documentation to edge encoding/decoding funcionts
* Fix sorting keys as 64 bit
* Revert cosmetic changes to unit tests
* Update documentation
* Update complexity documentation for coo to csr conversion
* Remove COOIsSorted check in CPU implementation too